Using Docker on Hadoop

Terrascope users can use Docker on Hadoop to run specific Spark or Python versions that are unavailable on the cluster. Docker images can serve as base images for installing required packages or by incorporating a Python conda environment.

Required Parameters

Docker can be used separately for the driver and the executors, using different images for each.


Driver: --conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_TYPE=docker

Executor: --conf spark.executorEnv.YARN_CONTAINER_RUNTIME_TYPE=docker

The three required parameters (values can differ for driver/executor) for running Spark on Docker are:

YARN_CONTAINER_RUNTIME_TYPE=docker: Specifies docker as the runtime environment for running Spark applications.
YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=your/image: Specifies the docker image for the driver or executor.
YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS=/var/lib/sss/pipes:/var/lib/sss/pipes:rw,/usr/hdp/current/:/usr/hdp/current/:ro,/etc/hadoop/conf/:/etc/hadoop/conf/:ro,/etc/krb5.conf:/etc/krb5.conf:ro: Defines the folders to mount into the Docker container.

Required paths include:
- /var/lib/sss/pipes and /etc/krb5.conf for user-related settings.
- /etc/hadoop/conf and /usr/hdp/current for Hadoop-related configurations.
Users can add additional volumes as needed.

Format: /path/on/host/os/:/where/to/mount/in/container/:read-write-permissions. Paths outside /data need administrator approval.

If using a custom Python installation in the Docker container, set PYSPARK_PYTHON to the Python binary before running spark-submit to ensure compatibility and correct execution.

Spark and Python in docker

When a Spark or Python version not available on the cluster is needed, a Docker container with the required versions can be used. Currently, a base Docker image for Python 3.11 and Spark 3.2 is available on VITO’s Docker repository: vito-docker.artifactory.vgt.vito.be.

The Dockerfile of this base image is as follows:


FROM vito-docker-private.artifactory.vgt.vito.be/almalinux8.5

ARG SPARK_VERSION
ARG PYTHON_PACKAGE

COPY lib/repos/alma_8.5/vito-alma.repo /etc/yum.repos.d/vito.repo

WORKDIR /opt/spark/work-dir

RUN dnf clean all && dnf install -y epel-release && \
    dnf install -y spark-bin-${SPARK_VERSION} \
    ${PYTHON_PACKAGE} \
    java-11-openjdk-headless \
    krb5-workstation \
    krb5-libs \
    sssd-client \
    ipa-client \
    nss && \
    python3.11 -m pip install --upgrade --target /usr/lib64/python3.11/site-packages/ pip setuptools wheel && \
    rm -r /root/.cache && \
    yum clean all && \
    rm -rf /var/cache/yum/*

ENV SPARK_HOME /usr/local/spark
ENV JAVA_HOME /usr/lib/jvm/jre
ENV PYSPARK_PYTHON=python3
ENV HADOOP_HOME=/usr/hdp/current/hadoop-client
ENV HADOOP_CONF_DIR=/etc/hadoop/conf
ENV YARN_CONF_DIR=/etc/hadoop/conf

The image can be used as a base when installing required packages or mounting a Python conda environment. An important consideration is that the Spark version used must also be available on the machine where the job is submitted. For instance, Spark 3.2 must be installed and configured on the submit machine (Spark 3.2 is available on the user VMs by default).

The spark-defaults.conf file looks like this:


spark.driver.extraLibraryPath /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
spark.executor.extraLibraryPath /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
spark.eventLog.dir hdfs:///spark2-history/
spark.eventLog.enabled true
spark.history.fs.cleaner.enabled true
spark.history.fs.cleaner.interval 1d
spark.history.fs.cleaner.maxAge 14d
spark.history.fs.logDirectory hdfs:///spark2-history/
spark.history.fs.numReplayThreads 3
spark.history.fs.update.interval 10s
spark.history.kerberos.enabled true
spark.history.kerberos.keytab /etc/security/keytabs/spark.headless.keytab
spark.history.kerberos.principal spark@VGT.VITO.BE
spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
spark.yarn.historyServer.address epod-ha.vgt.vito.be:18481
spark.hadoop.yarn.timeline-service.enabled false
spark.yarn.queue default
spark.hadoop.yarn.client.failover-proxy-provider org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider
spark.driver.extraJavaOptions -Dhdp.version=3.1.4.0-315
spark.yarn.am.extraJavaOptions -Dhdp.version=3.1.4.0-315
spark.shuffle.service.name=spark_shuffle_320
spark.shuffle.service.port=7557

And the spark-env.sh file looks like this:

export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-/usr/hdp/3.1.4.0-315/hadoop/conf}
export HADOOP_HOME=${HADOOP_HOME:-/usr/hdp/3.1.4.0-315/hadoop}
export SPARK_HOME=/opt/spark3_2_0/
export SPARK_MAJOR_VERSION=3
export SPARK_DIST_CLASSPATH=$(hadoop classpath)
export PATH="$SPARK_HOME/bin:$HADOOP_HOME/bin:$PATH"
export PYTHONPATH="/opt/spark3_2_0/python:/opt/spark3_2_0/python/lib/py4j-*zip"

A sample of a Spark submission using base Docker container can be found here.