PySpark dependencies

Often, your Python Spark job will need dependencies. On the Terrascope platform, we offer three Python interpreters with pre-installed dependencies:

python3.5
python3.6

All these Python interpreters have additional packages installed that are typically used in image processing. Although this provides a good starting point for prototyping, it is better to manage dependencies yourself, as descibed in the following sections.

Conda

In order to run scripts using the same custom environment used for developing it, it is possible to ship an entire conda environment to YARN. First, make sure you have conda installed, then create a gzip archive of your conda environment using conda-pack.

conda activate myenv
conda pack -o /path/to/folder/myenv.tar.gz

Then start the Spark job.

PYSPARK_PYTHON=./myenv/bin/python \
spark-submit \
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./myenv/bin/python \
--master yarn \
--deploy-mode cluster \
--archives/path/to/folder/myenv.tar.gz#myenv \
script.py

Docker

Additional to conda, you can also use Docker to stage dependencies. We have prepared a sample job to get you started. You can use Docker for both the driver and the executors separately. It is possible to use different images for each.

The 3 required parameters for running Spark on Docker are listed below. The value can be different for driver and executor. For the driver, it should be prefixed by spark.yarn.appMasterEnv, for the executor by spark.executorEnv.

YARN_CONTAINER_RUNTIME_TYPE=docker

for running in Docker containers

YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=your/image

to select which Docker image you want to use, the example will use vito-docker-private.artifactory.vgt.vito.be/dockerspark-quickstart

YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS=/var/lib/sss/pipes:/var/lib/sss/pipes%3Arw,/usr/hdp/current/:/usr/hdp/current/:ro,/etc/hadoop/conf/:/etc/hadoop/conf/:ro,/etc/krb5.conf:/etc/krb5.conf:ro:

to specify which folders you want to have mounted in your docker container. The ones mentioned here are always required for Spark on Docker to work, you can add your own required volumes here.

the /var/lib/sss/pipes and /etc/krb5.conf are for user-related settings.
the /etc/hadoop/conf and /usr/hdp/current are for hadoop-related settings

The format is: /path/on/host/os/:/where/to/mount/in/container/:read-write-permissions.
Paths outside /data/MTDA will have to be requested, as these need to be allowed for by an administrator. If you are using your own installation of Python in your docker container, you will also need to make sure that you set PYSPARK_PYTHON to point to your Python binary before submitting the application.

Docker images

Images need to be created from our base image: vito-docker-private.artifactory.vgt.vito.be/dockerspark-centos

This image has some of the required ENV variables and packages pre-installed. If you do not wish to use this base image, below are the requirements for a Docker container to work for Spark.

java8 (for Hadoop)
sssd-client (for authentication)
Environment variables (for configuration)
- JAVA_HOME=/usr/lib/jvm/java-1.8.0
- HADOOP_HOME=/usr/hdp/current/hadoop-client
- HADOOP_CONF_DIR=/etc/hadoop/conf
- YARN_CONF_DIR=/etc/hadoop/conf
- SPARK_HOME=/usr/hdp/current/spark2-client

In a Dockerfile, this would result in the following:


    FROM vito-docker-private.artifactory.vgt.vito.be/centos7.4
    RUN yum install -y java-1.8.0-openjdk java-1.8.0-openjdk-devel sssd-client && \
    yum clean all
    ENV JAVA_HOME=/usr/lib/jvm/java-1.8.0 HADOOP_HOME=/usr/hdp/current/hadoop-client HAVOOP_CONF_DIR=/etc/hadoop/conf YARN_CONF_DIR=/etc/hadoop/conf SPARK_HOME=/usr/hdp/current/spark2-client

Shared object files

In some cases a processing job needs shared object (.so) files. Suppose you have packaged the Python code and associated shared object file dependency in a ZIP file factorial.zip.

$ unzip -l factorial.zip 
Archive:  factorial.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
     7472  2020-01-21 13:28   cfactorial.so
      239  2020-01-21 16:33   pyfactorial.py
---------                     -------
     7711                     2 files

Then, you can make it available on the executors as follows:

spark-submit --master yarn \
             --deploy-mode cluster \
             --archives factorial.zip \
             --conf spark.executorEnv.LD_LIBRARY_PATH=./factorial.zip \
             spark.py

The --archives option makes sure that the content of the ZIP file is unpacked in the working directory under folder name factorial.zip, hence the option --conf spark.executorEnv.LD_LIBRARY_PATH=./factorial.zip.