Terrascope
  • Overview
  • Get started
  • Introduction
    • Terrascope Introduction
    • The Copernicus Programme
    • Registration and authentication
  • Data
    • Sentinel Missions
    • Sentinel-1
    • Sentinel-2
    • Sentinel-3
    • Sentinel-5P
    • PROBA-V mission
    • PROBA-V
    • SPOT-VGT mission
    • SPOT-VGT

    • WorldCover
    • Additional Products
  • APIs
    • Catalogue APIs
    • OpenSearch
    • TerraCatalogueClient
    • STAC
    • Product download
    • Streamlined Data Access APIs
    • openEO
    • Additional Web-services
    • CropSAR Service
    • Web Map Service (WMS)
    • Web Map Tile Service (WMTS)
    • Web Map Tile Service v2 (WMTS)
  • Tools
    • Terrascope GUI
    • Terrascope Viewer
    • STAC Browser
    • openEO web editor
    • Virtual Environments
    • Virtual Machine
    • JupyterLab
    • Hadoop Cluster
    • EOplaza
    • Getting started
    • Manage your Organisation
    • Publish a Service
    • Execute a Service
    • Manage a Service
    • Reporting
  • Quotas and Limitations
  • Support
    • Contact
    • Terrascope Forum
    • Terrascope Sample Examples
  • FAQ
  1. Virtual Environments
  2. Hadoop Cluster
  • Terrascope GUI
    • Terrascope Viewer
    • STAC Browser
    • openEO web editor
  • Virtual Environments
    • Virtual Machine
      • Terrascope Policies
    • JupyterLab
    • Hadoop Cluster
      • Manage Spark Resources
      • Advanced Kerberos
      • Access Spark Logs
      • Use Docker on Hadoop
      • Manage permissions and ownership
  • EOplaza
    • Getting started
    • Manage your Organisation
    • Publish a Service
    • Execute a Service
    • Manage a Service
      • Service Maturity
      • Credit Strength
    • Reporting

On this page

  • Hadoop
  • Spark
  • YARN
  • Cluster overview
  • Connecting to the New Hadoop Cluster
    • For hdfs and yarn Commands
    • For spark-submit Commands
  • Submission methods
  • Additional Resources
  1. Virtual Environments
  2. Hadoop Cluster

Hadoop Cluster

In Terrascope, we harness the power of Hadoop, Spark, and YARN to manage and process our extensive Earth observation data efficiently. Our platform leverages these technologies to ensure robust data handling, fast processing, and optimal resource management.

Hadoop

At the core of our data storage system is Hadoop’s HDFS (Hadoop Distributed File System). It enables us to distribute vast amounts of satellite imagery across multiple nodes, ensuring reliability and accessibility.

Figure 1: Hadoop distinction between file storage and processing components (from https://luminousmen.com/post/hadoop-yarn-spark).

Spark

Spark integrates seamlessly with Hadoop, running on top of Hadoop YARN within Terrascope. This integration enables us to utilize Spark’s potent batch and stream processing features to tackle various big data challenges.

YARN

YARN (Yet Another Resource Negotiator) plays a vital role in our cluster management by overseeing resource allocation and task scheduling. Similar to an operating system, YARN manages our computing cluster’s storage (via HDFS) and processing resources. This ensures that computational tasks are allocated and scheduled optimally, maintaining the efficiency and performance of our platform. By leveraging YARN, we can dynamically manage resources, improving the overall utilisation and throughput of our data processing workflows.

Figure 2: YARN architecture (from https://luminousmen.com/post/hadoop-yarn-spark).

More details on the resource management by YARN can be found here.

Cluster overview

Terrascope currently operates two Hadoop clusters:

  • an old (legacy) Hadoop cluster, which many existing jobs still use and which is the default target for the userVM configuration;
  • a new Hadoop cluster (backed by the *.hadoop.rscluster.vito.be endpoints) that provides the upgraded environment described on this page.

While the old cluster remains available during a transition period, it is strongly recommended to start using the new cluster for all new workloads and to migrate existing jobs where feasible. The old cluster is planned to be phased out by the end of June 2026.

Connecting to the New Hadoop Cluster

Terrascope has migrated to a new, upgraded Hadoop cluster. By default, the userVM is configured to interact with the old cluster. To use the new cluster, you must set the following environment variables before running any commands.

For hdfs and yarn Commands

If you need to interact directly with HDFS or YARN on the new cluster, run these commands first:

export HADOOP_HOME=/usr/local/hadoop/
export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop/
export PATH=/usr/local/hadoop/bin/:$PATH

For spark-submit Commands

To ensure your Spark jobs are submitted to the new cluster, you must set these variables in addition to the above. The new cluster supports both Spark 3.5.0 and Spark 4.0.1.

For Spark 3.5.0:

On the old cluster, Spark uses the configuration directory conf (e.g. /opt/spark/conf). On the new cluster, Spark 3.5.0 uses a separate directory: conf2. Set:

export SPARK_HOME=/opt/spark3_5_0/
export SPARK_CONF_DIR=/opt/spark3_5_0/conf2/

For Spark 4.0.1:

Spark 4.0.1 uses a single configuration directory and requires Java 17. The new cluster has Java 8, 11, and 17 installed, so you must set JAVA_HOME explicitly to Java 17 when using Spark 4.0.1:

export SPARK_HOME=/opt/spark4_0_1/
export SPARK_CONF_DIR=/opt/spark4_0_1/conf/
export JAVA_HOME=/usr/lib/jvm/jre-17

The Java 17 path may differ on your system (e.g. /usr/lib/jvm/java-17-openjdk or /usr/lib/jvm/jre-17). Use the path that matches your installation.

To make this easier, helper scripts are available in the hadoop-spark-samples repository. If you have cloned this repository, you can source these scripts:

source scripts/source_new_cluster      # For Spark 3.5.0
# or
source scripts/source_spark4.sh        # For Spark 4.0.1

Note: The source_spark4.sh script automatically configures JAVA_HOME to Java 17, which is required for Spark 4.0.1. Because the new cluster has Java 8, 11, and 17 available, setting JAVA_HOME explicitly ensures the correct version is used. Tips:

  • If spark-submit is not found when you run it manually, use ${SPARK_HOME}/bin/spark-submit (the source scripts set SPARK_HOME).
  • In the hadoop-spark-samples repository, many submission scripts choose the Spark version via the SPARK_VERSION environment variable. Set SPARK_VERSION=4.0.1 or SPARK_VERSION=4 before running such a script to use Spark 4.0.1.
  • To verify that Java 17 is available: ls -d /usr/lib/jvm/java-17-openjdk* 2>/dev/null or run java -version after setting JAVA_HOME.

Submission methods

Docker is the preferred method for production workloads. Python virtual environments and Conda are also supported. You can submit Spark jobs to the new cluster in three main ways:

  1. Docker containers (preferred) — Run the driver and executors inside Docker images for full control over Spark and Python versions. See Using Docker on Hadoop for setup and configuration.
  2. Python virtual environment (archives) — Bundle your Python environment as an archive, upload it to HDFS once, and reference it in spark-submit with --archives. See Starting your first Spark job for a minimal example.
  3. Conda — Use Conda to create and pack an environment, then ship it to the cluster (similar to venv). Suited for complex scientific dependencies. Examples are in the hadoop-spark-samples repository.

Sample code and advanced examples (venv, conda, Docker) are available in the hadoop-spark-samples repository.

Additional Resources

Various guides and best practices are available for users to start or learn more about setting up a Spark job, managing resources, and accessing logs. These resources cover topics such as: These resources cover various topics, including:

  • Start a Spark Job:

    To speed up processing in Terrascope, it is recommended that tasks be distributed using the Apache Spark framework. The new cluster runs Spark versions 3.5.0 and 4.0.1. Spark is pre-installed on the Virtual Machine, allowing spark-submit to run from the command line. The yarn-cluster mode (or cluster deploy mode) must be used for Hadoop cluster jobs, and authentication with Kerberos is required by running kinit on the command line. Commands like klist and kdestroy help manage Kerberos authentication status. Make sure to set the appropriate environment variables (as described above) to connect to the new cluster. A Python Spark example to help users get started.

  • Manage Spark resources:

    Optimizing Spark resource management is crucial for efficient data processing. This guide provides insights into configuring Spark resources effectively, ensuring optimal performance and resource utilization. For an in-depth understanding of Spark resource management, refer to the Spark Resource Management guide.

  • Advanced Kerberos:

    To use Hadoop services, including Spark, YARN, and HDFS, users must have a valid Ticket Granting Ticket (TGT). This page summarises insights on Kerberos.

  • Access Spark Logs:

    Accessing Spark logs is essential for monitoring and troubleshooting Spark jobs. Refer to the Accessing Spark Logs section for detailed instructions on accessing Spark logs.

  • Using Docker on Hadoop:

    For users interested in running Docker on Hadoop, this page provides detailed instructions on setting up Docker on Hadoop. This could be useful when the user wants to use a spark or Python version that isn’t available on the cluster.

  • Manage permissions and ownership:

    Managing file permissions and ownership is crucial for ensuring data security and integrity. Refer to the Managing Spark file permission and ownership page for insights on managing Spark file permissions and ownership.

  • Spark best practices:

    Additionally, to have a better understanding of Spark best practices, it is recommended to refer to the Spark best practices guide. This guide provides insights into optimizing Spark workflows and enhancing data processing efficiency.

Back to top
JupyterLab
Manage Spark Resources

Copyright 2018 - 2025 VITO NV All Rights reserved

 
  • Terms of Use

  • Privacy declaration

  • Cookie Policy

  • Contact