Terrascope
  • Overview
  • Get started
  • Introduction
    • Terrascope Introduction
    • The Copernicus Programme
    • Registration and authentication
  • Data
    • Sentinel Missions
    • Sentinel-1
    • Sentinel-2
    • Sentinel-3
    • Sentinel-5P
    • PROBA-V mission
    • PROBA-V
    • SPOT-VGT mission
    • SPOT-VGT

    • WorldCover
    • Additional Products
  • APIs
    • Catalogue APIs
    • OpenSearch
    • TerraCatalogueClient
    • STAC
    • Product download
    • Streamlined Data Access APIs
    • openEO
    • Additional Web-services
    • CropSAR Service
    • Web Map Service (WMS)
    • Web Map Tile Service (WMTS)
    • Web Map Tile Service v2 (WMTS)
  • Tools
    • Terrascope GUI
    • Terrascope Viewer
    • openEO web editor
    • Virtual Environments
    • Virtual Machine
    • JupyterLab
    • Hadoop Cluster
    • EOplaza
    • Getting started
    • Manage your Organisation
    • Publish a Service
    • Execute a Service
    • Manage a Service
    • Reporting
  • Quotas and Limitations
  • Support
    • Contact
    • Terrascope Forum
    • Terrascope Sample Examples
  • FAQ

On this page

  • New Hadoop Cluster
  • Submitting with a Python virtual environment (archives)

Starting your first Spark job

To speed up processing, it is often desirable to distribute the various processing components over a number of machines. The easiest way is to use the Apache Spark processing framework. The fastest way to get started is to read the Spark Quick Start.

On the new Hadoop cluster, there are three main submission methods:

  • Python virtual environment (archives) — Use a bundled Python environment on HDFS (see below). Suitable when your dependencies are pure Python or you already have a venv archive.
  • Conda — Use Conda to create and pack an environment, then ship it to the cluster (similar to venv). Suited for complex scientific dependencies. Examples are in the hadoop-spark-samples repository.
  • Docker containers — Use a Docker image for the driver and executors. Preferred for production and when you need specific Spark or Python versions. See Using Docker on Hadoop.

New Hadoop Cluster

The new Terrascope Hadoop cluster supports Spark 3.5.0 and Spark 4.0.1. To connect to the new cluster, you must set the appropriate environment variables before running spark-submit. See the Hadoop Cluster page for detailed instructions on setting up your environment.

Important: Terrascope currently operates two clusters: an old (legacy) cluster and a new, upgraded cluster. By default, the userVM is configured to interact with the old cluster, but it is preferred to start using the new cluster, as the old cluster will be phased out by the end of June 2026. To use the new cluster, you must export the following environment variables:

For Spark 3.5.0:

export HADOOP_HOME=/usr/local/hadoop/
export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop/
export PATH=/usr/local/hadoop/bin/:$PATH
export SPARK_HOME=/opt/spark3_5_0/
export SPARK_CONF_DIR=/opt/spark3_5_0/conf2/

For Spark 4.0.1:

Spark 4.0.1 requires Java 17. The new cluster has Java 8, 11, and 17 installed, so set JAVA_HOME explicitly:

export HADOOP_HOME=/usr/local/hadoop/
export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop/
export PATH=/usr/local/hadoop/bin/:$PATH
export SPARK_HOME=/opt/spark4_0_1/
export SPARK_CONF_DIR=/opt/spark4_0_1/conf/
export JAVA_HOME=/usr/lib/jvm/java-17-openjdk

Spark is also installed on your Virtual Machine. If spark-submit is not found, use ${SPARK_HOME}/bin/spark-submit. In the hadoop-spark-samples scripts you can set SPARK_VERSION=4.0.1 (or 4) before running to select Spark 4.0.1.

On the new cluster, dynamic allocation and the shuffle service are already set in the Spark defaults of the installed Spark libraries, so you do not need to add them to your spark-submit command.

To run jobs on the Hadoop cluster, the cluster deploy mode (formerly yarn-cluster) has to be used, and you need to authenticate with Kerberos. For the authentication, just run kinit on the command line. You will be asked to provide your password. Two other useful commands are klist to show whether you have been authenticated, and kdestroy to clear all authentication information. After some time, your Kerberos ticket will expire, so you’ll need to run kinit again.

Submitting with a Python virtual environment (archives)

This method uses a Python environment packaged as an archive (e.g. a tarball) that you upload to HDFS once, jobs then reference it via --archives. Set the environment variables described above for the new cluster, set PYSPARK_PYTHON to the Python inside the archive and use the correct HDFS path for your archive.

For a complete workflow (creating the venv, packing with venv-pack, runtime vs. HDFS staging) and ready-to-run scripts, see the hadoop-spark-samples repository, in particular the venv and conda folders.

Back to top

Copyright 2018 - 2025 VITO NV All Rights reserved

 
  • Terms of Use

  • Privacy declaration

  • Cookie Policy

  • Contact