PySpark dependencies
Often, your Python Spark job will need dependencies. On the Terrascope platform, we offer three Python interpreters with pre-installed dependencies:
- python3.5
- python3.6
All these Python interpreters have additional packages installed that are typically used in image processing. Although this provides a good starting point for prototyping, it is better to manage dependencies yourself, as descibed in the following sections.
Conda
In order to run scripts using the same custom environment used for developing it, it is possible to ship an entire conda environment to YARN. First, make sure you have conda installed, then create a gzip archive of your conda environment using conda-pack.
conda activate myenv
conda pack -o /path/to/folder/myenv.tar.gz
Then start the Spark job.
PYSPARK_PYTHON=./myenv/bin/python \
spark-submit \
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./myenv/bin/python \
--master yarn \
--deploy-mode cluster \
--archives/path/to/folder/myenv.tar.gz#myenv \
script.py
Docker
Additional to conda, you can also use Docker to stage dependencies. We have prepared a sample job to get you started. You can use Docker for both the driver and the executors separately. It is possible to use different images for each.
The 3 required parameters for running Spark on Docker are listed below. The value can be different for driver and executor. For the driver, it should be prefixed by spark.yarn.appMasterEnv
, for the executor by spark.executorEnv
.
YARN_CONTAINER_RUNTIME_TYPE=docker
for running in Docker containers
YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=your/image
to select which Docker image you want to use, the example will use vito-docker-private.artifactory.vgt.vito.be/dockerspark-quickstart
YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS=/var/lib/sss/pipes:/var/lib/sss/pipes%3Arw,/usr/hdp/current/:/usr/hdp/current/:ro,/etc/hadoop/conf/:/etc/hadoop/conf/:ro,/etc/krb5.conf:/etc/krb5.conf:ro:
to specify which folders you want to have mounted in your docker container. The ones mentioned here are always required for Spark on Docker to work, you can add your own required volumes here.
- the
/var/lib/sss/pipes
and/etc/krb5.conf
are for user-related settings. - the
/etc/hadoop/conf
and/usr/hdp/current
are for hadoop-related settings
The format is: /path/on/host/os/:/where/to/mount/in/container/:read-write-permissions
.
Paths outside /data/MTDA
will have to be requested, as these need to be allowed for by an administrator. If you are using your own installation of Python in your docker container, you will also need to make sure that you set PYSPARK_PYTHON
to point to your Python binary before submitting the application.
Docker images
Images need to be created from our base image: vito-docker-private.artifactory.vgt.vito.be/dockerspark-centos
This image has some of the required ENV variables and packages pre-installed. If you do not wish to use this base image, below are the requirements for a Docker container to work for Spark.
- java8 (for Hadoop)
- sssd-client (for authentication)
- Environment variables (for configuration)
JAVA_HOME=/usr/lib/jvm/java-1.8.0
HADOOP_HOME=/usr/hdp/current/hadoop-client
HADOOP_CONF_DIR=/etc/hadoop/conf
YARN_CONF_DIR=/etc/hadoop/conf
SPARK_HOME=/usr/hdp/current/spark2-client
In a Dockerfile, this would result in the following:
FROM vito-docker-private.artifactory.vgt.vito.be/centos7.4
RUN yum install -y java-1.8.0-openjdk java-1.8.0-openjdk-devel sssd-client && \
yum clean all
ENV JAVA_HOME=/usr/lib/jvm/java-1.8.0 HADOOP_HOME=/usr/hdp/current/hadoop-client HAVOOP_CONF_DIR=/etc/hadoop/conf YARN_CONF_DIR=/etc/hadoop/conf SPARK_HOME=/usr/hdp/current/spark2-client