Starting your first Spark job

To speed up processing, it is often desirable to distribute the various processing components over a number of machines. The easiest way is to use the Apache Spark processing framework. The fastest way to get started, is to read the Spark Quick Start. The Spark version installed on VITO’s Terrascope cluster is 2.3.2, so it is recommended to stick to this version. However, a newer version is available on request if really required. Spark is also installed on your Virtual Machine, so you can run spark-submit from the command line.

To run jobs on the Hadoop cluster, the yarn-cluster mode has to be used, and you need to authenticate with Kerberos. For the authentication, just run kinit on the command line. You will be asked to provide your password. Two other useful commands are klist to show whether you have been authenticated, and kdestroy to clear all authentication information. After some time, your Kerberos ticket will expire, so you’ll need to run kinit again.

If you want to use Python 3.8 or above, you’ll have to use Spark 3, as Spark 2 support was dropped from Python 3.8. There is a Spark 3 example available in the Python Spark example below.

A Python Spark example is available, which should help you getting started.