Managing ownership and permissions

When running a Spark job that generates data on disk, it is important to set the correct user/group ownership and file permissions. User ownership is determined by the user for whom the Kerberos TGT was generated.

Managing group ownership

To determine the group owner, several options are available:

By default, the user group is used. Each user typically has a private group: for example, the user mep_<project> has a default group mep_<project>, so files are created with mep_<project>:<project> ownership. If needed, the default group can be changed to a shared project group, allowing multiple users to read/write the files created by the mep_<project> user.
Alternatively, the gid bit can be set on a folder (chmod g+s <root_folder>). This ensures that files and directories created under this folder inherit the group ownership of the parent folder.

Managing permissions

On Linux systems, default permissions are set by the umask, which usually inherits from the parent process or OS. When running Spark jobs, the umask of the YARN container process is used, as this process acts as the parent for the Spark executor job.

In the Yarn code, the umask is hardcoded to 0027 (see line 2435). This will generate files with -rw-r----- permissions. This setting can’t be managed at the OS level, as the setting in the Yarn container overwrites it.

If a user wants to override the default umask, several options exist depending on the type of Spark job:

For Java/Scala Spark jobs that use Hadoop FileSystem classes to generate files, the spark.hadoop.fs.permissions.umask-mode=<umask> configuration option can be set when submitting the Spark job. For example:


spark-submit ... --conf spark.hadoop.fs.permissions.umask-mode=022

For other jobs (e.g., PySpark jobs) where Hadoop FileSystem classes are typically not used to generate files, the umask should be managed in the Python code. For instance:


import os
oldmask = os.umask(0o022)
# write files ...
# reset previous umask
os.umask(oldmask)