Terrascope
  • Overview
  • Get started
  • Introduction
    • Terrascope Introduction
    • The Copernicus Programme
    • Registration and authentication
  • Data
    • Sentinel Missions
    • Sentinel-1
    • Sentinel-2
    • Sentinel-3
    • Sentinel-5P
    • PROBA-V mission
    • PROBA-V
    • SPOT-VGT mission
    • SPOT-VGT
    • Additional Products
  • APIs
    • catalogue APIs
    • OpenSearch
    • TerraCatalogueClient
    • STAC
    • Product download
    • Streamlined Data Access APIs
    • openEO
    • Additional Web-services
    • CropSAR Service
    • Web Map Service (WMS)
    • Web Map Tile Service (WMTS)
  • Tools
    • Terrascope GUI
    • Terrascope Viewer
    • openEO web editor
    • Virtual Environments
    • Virtual Machine
    • JupyterLab
    • Hadoop Cluster
    • EOplaza
    • Getting started
    • Manage your Organisation
    • Publish a Service
    • Execute a Service
    • Manage a Service
    • Reporting
  • Quotas and Limitations
  • Support
    • Contact
    • Terrascope Forum
    • Terrascope Sample Examples
  • FAQ
  1. Virtual Environments
  2. Hadoop Cluster
  3. Use Docker on Hadoop
  • Terrascope GUI
    • Terrascope Viewer
    • openEO web editor
  • Virtual Environments
    • Virtual Machine
      • Terrascope Policies
    • JupyterLab
    • Hadoop Cluster
      • Manage Spark Resources
      • Advanced Kerberos
      • Access Spark Logs
      • Use Docker on Hadoop
      • Manage permissions and ownership
  • EOplaza
    • Getting started
    • Manage your Organisation
    • Publish a Service
    • Execute a Service
    • Manage a Service
      • Service Maturity
      • Credit Strength
    • Reporting

On this page

  • Required Parameters
  • Spark and Python in docker
  1. Virtual Environments
  2. Hadoop Cluster
  3. Use Docker on Hadoop

Using Docker on Hadoop

Terrascope users can use Docker on Hadoop to run specific Spark or Python versions that are unavailable on the cluster. Docker images can serve as base images for installing required packages or by incorporating a Python conda environment.

Required Parameters

Docker can be used separately for the driver and the executors, using different images for each.


Driver: --conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_TYPE=docker

Executor: --conf spark.executorEnv.YARN_CONTAINER_RUNTIME_TYPE=docker

The three required parameters (values can differ for driver/executor) for running Spark on Docker are:

  • YARN_CONTAINER_RUNTIME_TYPE=docker: Specifies docker as the runtime environment for running Spark applications.

  • YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=your/image: Specifies the docker image for the driver or executor.

  • YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS=/var/lib/sss/pipes:/var/lib/sss/pipes:rw,/usr/hdp/current/:/usr/hdp/current/:ro,/etc/hadoop/conf/:/etc/hadoop/conf/:ro,/etc/krb5.conf:/etc/krb5.conf:ro: Defines the folders to mount into the Docker container.

    Required paths include:

    • /var/lib/sss/pipes and /etc/krb5.conf for user-related settings.
    • /etc/hadoop/conf and /usr/hdp/current for Hadoop-related configurations.

    Users can add additional volumes as needed.

    Format: /path/on/host/os/:/where/to/mount/in/container/:read-write-permissions. Paths outside /data need administrator approval.

If using a custom Python installation in the Docker container, set PYSPARK_PYTHON to the Python binary before running spark-submit to ensure compatibility and correct execution.

Spark and Python in docker

When a Spark or Python version not available on the cluster is needed, a Docker container with the required versions can be used. Currently, a base Docker image for Python 3.11 and Spark 3.2 is available on VITO’s Docker repository: vito-docker.artifactory.vgt.vito.be.

The Dockerfile of this base image is as follows:


FROM vito-docker-private.artifactory.vgt.vito.be/almalinux8.5

ARG SPARK_VERSION
ARG PYTHON_PACKAGE

COPY lib/repos/alma_8.5/vito-alma.repo /etc/yum.repos.d/vito.repo

WORKDIR /opt/spark/work-dir

RUN dnf clean all && dnf install -y epel-release && \
    dnf install -y spark-bin-${SPARK_VERSION} \
    ${PYTHON_PACKAGE} \
    java-11-openjdk-headless \
    krb5-workstation \
    krb5-libs \
    sssd-client \
    ipa-client \
    nss && \
    python3.11 -m pip install --upgrade --target /usr/lib64/python3.11/site-packages/ pip setuptools wheel && \
    rm -r /root/.cache && \
    yum clean all && \
    rm -rf /var/cache/yum/*

ENV SPARK_HOME /usr/local/spark
ENV JAVA_HOME /usr/lib/jvm/jre
ENV PYSPARK_PYTHON=python3
ENV HADOOP_HOME=/usr/hdp/current/hadoop-client
ENV HADOOP_CONF_DIR=/etc/hadoop/conf
ENV YARN_CONF_DIR=/etc/hadoop/conf

The image can be used as a base when installing required packages or mounting a Python conda environment. An important consideration is that the Spark version used must also be available on the machine where the job is submitted. For instance, Spark 3.2 must be installed and configured on the submit machine (Spark 3.2 is available on the user VMs by default).

The spark-defaults.conf file looks like this:


spark.driver.extraLibraryPath /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
spark.executor.extraLibraryPath /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
spark.eventLog.dir hdfs:///spark2-history/
spark.eventLog.enabled true
spark.history.fs.cleaner.enabled true
spark.history.fs.cleaner.interval 1d
spark.history.fs.cleaner.maxAge 14d
spark.history.fs.logDirectory hdfs:///spark2-history/
spark.history.fs.numReplayThreads 3
spark.history.fs.update.interval 10s
spark.history.kerberos.enabled true
spark.history.kerberos.keytab /etc/security/keytabs/spark.headless.keytab
spark.history.kerberos.principal spark@VGT.VITO.BE
spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
spark.yarn.historyServer.address epod-ha.vgt.vito.be:18481
spark.hadoop.yarn.timeline-service.enabled false
spark.yarn.queue default
spark.hadoop.yarn.client.failover-proxy-provider org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider
spark.driver.extraJavaOptions -Dhdp.version=3.1.4.0-315
spark.yarn.am.extraJavaOptions -Dhdp.version=3.1.4.0-315
spark.shuffle.service.name=spark_shuffle_320
spark.shuffle.service.port=7557

And the spark-env.sh file looks like this:

export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-/usr/hdp/3.1.4.0-315/hadoop/conf}
export HADOOP_HOME=${HADOOP_HOME:-/usr/hdp/3.1.4.0-315/hadoop}
export SPARK_HOME=/opt/spark3_2_0/
export SPARK_MAJOR_VERSION=3
export SPARK_DIST_CLASSPATH=$(hadoop classpath)
export PATH="$SPARK_HOME/bin:$HADOOP_HOME/bin:$PATH"
export PYTHONPATH="/opt/spark3_2_0/python:/opt/spark3_2_0/python/lib/py4j-*zip"

A sample of a Spark submission using base Docker container can be found here.

Back to top
Access Spark Logs
Manage permissions and ownership

Copyright 2018 - 2024 VITO NV All Rights reserved

 
  • Terms of Use

  • Privacy declaration

  • Cookie Policy

  • Contact