Terrascope
  • Overview
  • Get started
  • Introduction
    • Terrascope Introduction
    • The Copernicus Programme
    • Registration and authentication
  • Data
    • Sentinel Missions
    • Sentinel-1
    • Sentinel-2
    • Sentinel-3
    • Sentinel-5P
    • PROBA-V mission
    • PROBA-V
    • SPOT-VGT mission
    • SPOT-VGT

    • WorldCover
    • Additional Products
  • APIs
    • Catalogue APIs
    • OpenSearch
    • TerraCatalogueClient
    • STAC
    • Product download
    • Streamlined Data Access APIs
    • openEO
    • Additional Web-services
    • CropSAR Service
    • Web Map Service (WMS)
    • Web Map Tile Service (WMTS)
    • Web Map Tile Service v2 (WMTS)
  • Tools
    • Terrascope GUI
    • Terrascope Viewer
    • openEO web editor
    • Virtual Environments
    • Virtual Machine
    • JupyterLab
    • Hadoop Cluster
    • EOplaza
    • Getting started
    • Manage your Organisation
    • Publish a Service
    • Execute a Service
    • Manage a Service
    • Reporting
  • Quotas and Limitations
  • Support
    • Contact
    • Terrascope Forum
    • Terrascope Sample Examples
  • FAQ
  1. Virtual Environments
  2. Hadoop Cluster
  3. Use Docker on Hadoop
  • Terrascope GUI
    • Terrascope Viewer
    • openEO web editor
  • Virtual Environments
    • Virtual Machine
      • Terrascope Policies
    • JupyterLab
    • Hadoop Cluster
      • Manage Spark Resources
      • Advanced Kerberos
      • Access Spark Logs
      • Use Docker on Hadoop
      • Manage permissions and ownership
  • EOplaza
    • Getting started
    • Manage your Organisation
    • Publish a Service
    • Execute a Service
    • Manage a Service
      • Service Maturity
      • Credit Strength
    • Reporting

On this page

  • Overview
  • Required Parameters
  • Additional Configuration
    • Python Path
    • Java Home (Spark 4.0.1)
  • Spark Configuration Files
    • spark-env.sh
    • spark-defaults.conf
  • Complete Example
  • Authenticated Docker Repositories
    • Obtaining config.json
    • Securely Storing config.json in HDFS with ACLs
    • Add Configuration to spark-submit
  • Important Considerations
  • Sample Code
  1. Virtual Environments
  2. Hadoop Cluster
  3. Use Docker on Hadoop

Using Docker on Hadoop

Terrascope users can use Docker on Hadoop to run specific Spark or Python versions that are unavailable on the cluster. Docker images can serve as base images for installing required packages or by incorporating a Python conda environment. This is the preferred method for production workloads as it provides the highest level of reproducibility and isolation.

Overview

Docker allows you to package your application, Python libraries, and system-level dependencies into a container image. YARN runs your Spark job inside these containers, ensuring a consistent environment across all nodes. Docker can be used separately for the driver and the executors, allowing you to use different images for each if needed.

Required Parameters

The three required parameters for running Spark on Docker are:

1. YARN_CONTAINER_RUNTIME_TYPE=docker

This is the main switch that tells YARN to use the Docker runtime for the container. Without this, the process would run directly on the host node’s operating system.

For the driver:

--conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_TYPE=docker

For executors:

--conf spark.executorEnv.YARN_CONTAINER_RUNTIME_TYPE=docker

2. YARN_CONTAINER_RUNTIME_DOCKER_IMAGE

This specifies which Docker image to use for the container. The image must be accessible from all cluster nodes.

For the driver:

--conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=your/image:tag

For executors:

--conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=your/image:tag

3. YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS

This defines the volume mounts, creating a bridge between the host’s filesystem and the container’s filesystem. This is necessary for services like Kerberos and Hadoop to work correctly inside the container.

For the new cluster, the required mounts are:

/var/lib/sss/pipes:/var/lib/sss/pipes:rw,/usr/local/hadoop/:/usr/local/hadoop/:ro,/etc/krb5.conf:/etc/krb5.conf:ro

Required paths explained:

  • /var/lib/sss/pipes (read-write): Required for SSSD (System Security Services Daemon) authentication to work inside the container.
  • /etc/krb5.conf (read-only): Contains Kerberos configuration needed for authentication with Hadoop services.
  • /usr/local/hadoop/ (read-only): Contains Hadoop client libraries and configuration files needed for HDFS and YARN communication.

Mount format: /path/on/host:/path/in/container:permissions

Users can add additional volumes as needed (e.g. for data access to specific dataset directories). Paths outside /data require administrator approval. The hadoop-spark-samples repository contains examples that add extra mounts for data.

Additional Configuration

Python Path

If using a custom Python installation in the Docker container, set PYSPARK_PYTHON to the Python binary path:

--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=/usr/bin/python3.11
--conf spark.executorEnv.PYSPARK_PYTHON=/usr/bin/python3.11

Java Home (Spark 4.0.1)

Spark 4.0.1 requires Java 17. The new cluster has Java 8, 11, and 17 installed, so you must set JAVA_HOME explicitly to Java 17 when using Spark 4.0.1 (both in your environment and in the Spark config passed to the driver and executors). If using Spark 4.0.1, add:

--conf spark.yarn.appMasterEnv.JAVA_HOME=/usr/lib/jvm/java-17-openjdk
--conf spark.executorEnv.JAVA_HOME=/usr/lib/jvm/java-17-openjdk

Spark Configuration Files

When building Docker images for Spark, you may need to configure Spark environment variables and defaults. These configurations are typically set in spark-env.sh and spark-defaults.conf files within your Docker image. These configuration files are already available on the userVM for the new cluster: Spark 3.5.0 uses /opt/spark3_5_0/conf2/, and Spark 4.0.1 uses /opt/spark4_0_1/conf/ (both point to the new cluster). You can copy or mount these files from the userVM when building your image; the snippets below are for reference or when using custom images.

spark-env.sh

The spark-env.sh file sets environment variables that Spark uses at runtime. For the new Hadoop cluster, use the following configuration:

export JAVA_HOME=${JAVA_HOME:-/usr/lib/jvm/jre}
export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-/usr/local/hadoop/etc/hadoop/}
export HADOOP_HOME=${HADOOP_HOME:-/usr/local/hadoop/}
export SPARK_DIST_CLASSPATH=$(${HADOOP_HOME}/bin/hadoop classpath)
export PATH="$SPARK_HOME/bin:$HADOOP_HOME/bin:$PATH"
export PYTHONPATH="/opt/spark3_5_0/python:/opt/spark3_5_0/python/lib/py4j-*zip"

Notes: - JAVA_HOME defaults to /usr/lib/jvm/jre but can be overridden (e.g., to /usr/lib/jvm/java-17-openjdk for Spark 4.0.1) - HADOOP_CONF_DIR and HADOOP_HOME point to the mounted /usr/local/hadoop/ directory - SPARK_DIST_CLASSPATH is dynamically set using hadoop classpath to include all necessary Hadoop libraries - PYTHONPATH should match your Spark version (e.g., /opt/spark4_0_1/python for Spark 4.0.1)

spark-defaults.conf

The spark-defaults.conf file contains Spark configuration properties. For the new Hadoop cluster, use:

spark.eventLog.dir hdfs:///spark3-history/
spark.eventLog.enabled true
spark.yarn.historyServer.address history.hadoop.rscluster.vito.be:18481
spark.yarn.queue default
spark.yarn.maxAppAttempts 1
spark.hadoop.yarn.client.failover-proxy-provider org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider
spark.dynamicAllocation.enabled true
spark.shuffle.service.enabled true
spark.shuffle.service.name spark_shuffle_350
spark.shuffle.service.port 7668
spark.yarn.appMasterEnv.JAVA_HOME /usr/lib/jvm/jre
spark.executorEnv.JAVA_HOME /usr/lib/jvm/jre

Configuration details:

  • Event Logging: spark.eventLog.dir and spark.eventLog.enabled enable Spark event logging to HDFS for job history tracking.
  • History Server: spark.yarn.historyServer.address points to the Spark History Server on the new cluster.
  • Resource Management: spark.dynamicAllocation.enabled and spark.shuffle.service.enabled enable dynamic resource allocation for better cluster utilization.
  • Shuffle Service: spark.shuffle.service.name and spark.shuffle.service.port configure the shuffle service (use spark_shuffle_401 and port 7668 for Spark 4.0.1).
  • Java Configuration: JAVA_HOME is set for both the Application Master and executors (use /usr/lib/jvm/java-17-openjdk for Spark 4.0.1).

For Spark 4.0.1, update the following values:

spark.shuffle.service.name spark_shuffle_401
spark.yarn.appMasterEnv.JAVA_HOME /usr/lib/jvm/java-17-openjdk
spark.executorEnv.JAVA_HOME /usr/lib/jvm/java-17-openjdk

These configuration files should be placed in your Docker image at $SPARK_HOME/conf/spark-env.sh and $SPARK_HOME/conf/spark-defaults.conf, or you can mount them from the host if they’re available in the mounted /usr/local/hadoop/ directory.

Complete Example

Full, up-to-date examples for submitting Spark jobs with Docker on the new cluster are available in the hadoop-spark-samples repository (e.g. the docker folder and related scripts). The main ingredients are: set the required environment variables (SPARK_HOME, SPARK_CONF_DIR, HADOOP_HOME, HADOOP_CONF_DIR, PATH), configure the Docker image and YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS, set PYSPARK_PYTHON, and for Spark 4.0.1 also set JAVA_HOME to Java 17. Refer to the samples for complete, maintained commands.

Authenticated Docker Repositories

If your Docker image is stored in a private repository that requires authentication, the cluster nodes need credentials to pull the image. The best practice is to store the Docker configuration file in HDFS and restrict access using HDFS ACLs so that only you and the YARN service account can read it.

Obtaining config.json

  1. Log in to your private Docker registry:

    docker login <your-registry-host>

    You will be prompted for your username and password (or an API key).

  2. Locate the generated configuration file. Docker stores credentials in:

    ~/.docker/config.json

    Check that it exists: cat ~/.docker/config.json. It will contain entries for the registries you have logged into.

Securely Storing config.json in HDFS with ACLs

Each user places their Docker authentication file at /user/<username>/config.json. Because this file contains credentials, it must be readable only by you and the YARN service account. Use HDFS ACLs to grant read access to YARN without making the file readable to other users or groups. The file should belong to the group hdfs (standard), not a user group like vito.

1. Upload the file to HDFS

Ensure your shell is configured for the new cluster (see the Hadoop Cluster page, section “Connecting to the New Hadoop Cluster”) before running any hdfs dfs commands, so the file is uploaded to the correct cluster.

# Set environment variables for the new cluster (if not already set)
export HADOOP_HOME=/usr/local/hadoop/
export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop/
export PATH=/usr/local/hadoop/bin/:$PATH

hdfs dfs -put ~/.docker/config.json /user/$USER/config.json
# or
hdfs dfs -copyFromLocal ~/.docker/config.json /user/$USER/config.json

2. Verify owner and group

After uploading, verify that the file has the correct owner and group:

hdfs dfs -ls /user/$USER/config.json

The file should belong to your user and the group hdfs (e.g. tibodc hdfs).

3. Set file permissions

Restrict access so that only the owner can read and write, and the group can read. Other users must not have access:

hdfs dfs -chmod 640 /user/$USER/config.json

4. Set ACLs for YARN access

Grant the yarn user read access to the config file, and execute permission on your home directory so YARN can resolve the path to the file:

hdfs dfs -setfacl -m user:yarn:r-- /user/$USER/config.json
hdfs dfs -setfacl -m user:yarn:--x /user/$USER

5. Verify ACLs

hdfs dfs -getfacl /user/$USER
hdfs dfs -getfacl /user/$USER/config.json

Example expected output for config.json:

# file: /user/<username>/config.json
# owner: <username>
# group: hdfs
user::rw-
user:yarn:r--
group::r--
mask::r--
other::---

The + in the permission string (e.g. -rw-r-----+) indicates that ACLs are set.

Why this matters

Spark and YARN use this file as the Docker client configuration when pulling images. If the file is world-readable, any other user on the cluster could read your registry tokens. In addition, some secured configurations refuse to use credential files with overly permissive permissions (e.g. 644).

Add Configuration to spark-submit

Point Spark to your config.json on HDFS by adding the following to both appMasterEnv and executorEnv:

--conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_CLIENT_CONFIG=hdfs:///user/$USER/config.json
--conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_CLIENT_CONFIG=hdfs:///user/$USER/config.json

This ensures all nodes use the same credentials when pulling the Docker image.

Important Considerations

  • The new cluster supports Spark 3.5.0 and Spark 4.0.1. Ensure your Docker image uses a compatible Spark version.
  • Spark 4.0.1 requires Java 17. The new cluster has Java 8, 11, and 17; set JAVA_HOME explicitly to Java 17 in your Docker image and in your Spark submission so the correct version is used.
  • The Spark version used in the Docker container must also be available on the machine where the job is submitted (Spark 3.5.0 and 4.0.1 are available on user VMs).
  • Always set PYSPARK_PYTHON to point to the Python executable inside your Docker container.

Sample Code

Complete Docker-based Spark submission examples for the new cluster can be found in the hadoop-spark-samples repository. The repository includes working examples for both Spark 3.5.0 and Spark 4.0.1.

Back to top
Access Spark Logs
Manage permissions and ownership

Copyright 2018 - 2025 VITO NV All Rights reserved

 
  • Terms of Use

  • Privacy declaration

  • Cookie Policy

  • Contact