Starting your first Spark job
To speed up processing, it is often desirable to distribute the various processing components over a number of machines. The easiest way is to use the Apache Spark processing framework. The fastest way to get started is to read the Spark Quick Start.
On the new Hadoop cluster, there are three main submission methods:
- Python virtual environment (archives) — Use a bundled Python environment on HDFS (see below). Suitable when your dependencies are pure Python or you already have a venv archive.
- Conda — Use Conda to create and pack an environment, then ship it to the cluster (similar to venv). Suited for complex scientific dependencies. Examples are in the hadoop-spark-samples repository.
- Docker containers — Use a Docker image for the driver and executors. Preferred for production and when you need specific Spark or Python versions. See Using Docker on Hadoop.
New Hadoop Cluster
The new Terrascope Hadoop cluster supports Spark 3.5.0 and Spark 4.0.1. To connect to the new cluster, you must set the appropriate environment variables before running spark-submit. See the Hadoop Cluster page for detailed instructions on setting up your environment.
Important: Terrascope currently operates two clusters: an old (legacy) cluster and a new, upgraded cluster. By default, the userVM is configured to interact with the old cluster, but it is preferred to start using the new cluster, as the old cluster will be phased out by the end of June 2026. To use the new cluster, you must export the following environment variables:
For Spark 3.5.0:
export HADOOP_HOME=/usr/local/hadoop/
export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop/
export PATH=/usr/local/hadoop/bin/:$PATH
export SPARK_HOME=/opt/spark3_5_0/
export SPARK_CONF_DIR=/opt/spark3_5_0/conf2/For Spark 4.0.1:
Spark 4.0.1 requires Java 17. The new cluster has Java 8, 11, and 17 installed, so set JAVA_HOME explicitly:
export HADOOP_HOME=/usr/local/hadoop/
export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop/
export PATH=/usr/local/hadoop/bin/:$PATH
export SPARK_HOME=/opt/spark4_0_1/
export SPARK_CONF_DIR=/opt/spark4_0_1/conf/
export JAVA_HOME=/usr/lib/jvm/java-17-openjdkSpark is also installed on your Virtual Machine. If spark-submit is not found, use ${SPARK_HOME}/bin/spark-submit. In the hadoop-spark-samples scripts you can set SPARK_VERSION=4.0.1 (or 4) before running to select Spark 4.0.1.
On the new cluster, dynamic allocation and the shuffle service are already set in the Spark defaults of the installed Spark libraries, so you do not need to add them to your spark-submit command.
To run jobs on the Hadoop cluster, the cluster deploy mode (formerly yarn-cluster) has to be used, and you need to authenticate with Kerberos. For the authentication, just run kinit on the command line. You will be asked to provide your password. Two other useful commands are klist to show whether you have been authenticated, and kdestroy to clear all authentication information. After some time, your Kerberos ticket will expire, so you’ll need to run kinit again.
Submitting with a Python virtual environment (archives)
This method uses a Python environment packaged as an archive (e.g. a tarball) that you upload to HDFS once, jobs then reference it via --archives. Set the environment variables described above for the new cluster, set PYSPARK_PYTHON to the Python inside the archive and use the correct HDFS path for your archive.
For a complete workflow (creating the venv, packing with venv-pack, runtime vs. HDFS staging) and ready-to-run scripts, see the hadoop-spark-samples repository, in particular the venv and conda folders.