Spark resource management
Spark jobs run on a shared processing cluster where YARN manages resources such as CPU and memory. The cluster allocates these resources among all running jobs based on specific parameters. The resources are organized into queues, each with a minimum guaranteed share of the cluster’s total CPU and memory. By default, jobs are placed in the ‘default’ queue, though other queues are also available.
On the new Hadoop cluster, dynamic allocation and the shuffle service are already enabled in the Spark defaults (spark-defaults.conf) of the installed Spark libraries. You do not need to set spark.dynamicAllocation.enabled or spark.shuffle.service.enabled again in your submission.
Partitioning: Job performance depends strongly on how data is partitioned. Too few partitions underuse the cluster; too many add scheduling overhead. A common best practice is to size partitions so that each task runs for a few minutes to several minutes. See the sample scripts in the hadoop-spark-samples repository for examples.
Memory allocation
Users can specify the allocated memory for executors using --executor-memory 1G.
During PySpark job execution, most memory is consumed off-heap. Thus it is advisable to specify memory overhead using
--conf spark.executor.memoryOverhead 4000Limiting memory requirements in applications to facilitate easier job scheduling is advisable. This can be achieved by partitioning input data into smaller blocks.
Number of parallel jobs
Spark can dynamically determine the number of tasks processed in parallel. On the new cluster, dynamic allocation and the shuffle service are already enabled in the Spark defaults, so you do not need to pass these options. If you are using another environment or need to override, the relevant settings are spark.dynamicAllocation.enabled=true and spark.shuffle.service.enabled=true.
Optionally, upper and lower bounds can be set:
--conf spark.dynamicAllocation.maxExecutors=30
--conf spark.dynamicAllocation.minExecutors=10If a fixed number of executors is preferred, use:
--num-executors 10
However, this approach is not recommended as it limits the cluster resource manager’s ability to optimize resource allocation. Executors allocated in this manner remain reserved even when tasks are not actively processed.
Using multiple cores per task
By default, each Spark executor utilizes one CPU core. To accommodate tasks requiring multiple cores, adjust the --executor-cores parameter and specify the spark.task.cpus option accordingly. For tasks needing two executor cores:
spark-submit ... --executor-cores 2 --conf spark.task.cpus=2 ...
Spark blacklisting
If tasks consistently fail on the same node(s), enabling Spark blacklisting can be beneficial. This feature temporarily excludes a worker node (default duration is one hour). To activate blacklisting, use --conf spark.blacklist.enabled=true. Further blacklisting options are detailed in the Spark documentation.