Managing ownership and permissions
When running a Spark job that generates data on disk, it is important to set the correct user/group ownership and file permissions. User ownership is determined by the user for whom the Kerberos TGT was generated.
Managing group ownership
To determine the group owner, several options are available:
- By default, the user group is used. Each user typically has a private group: for example, the user
mep_<project>
has a default groupmep_<project>
, so files are created withmep_<project>:<project>
ownership. If needed, the default group can be changed to a shared project group, allowing multiple users to read/write the files created by themep_<project>
user. - Alternatively, the gid bit can be set on a folder (
chmod g+s <root_folder>
). This ensures that files and directories created under this folder inherit the group ownership of the parent folder.
Managing permissions
On Linux systems, default permissions are set by the umask
, which usually inherits from the parent process or OS. When running Spark jobs, the umask
of the YARN container process is used, as this process acts as the parent for the Spark executor job.
In the Yarn code, the umask
is hardcoded to 0027 (see line 2435). This will generate files with -rw-r-----
permissions. This setting can’t be managed at the OS level, as the setting in the Yarn container overwrites it.
If a user wants to override the default umask
, several options exist depending on the type of Spark job:
For Java/Scala Spark jobs that use Hadoop FileSystem classes to generate files, the spark.hadoop.fs.permissions.umask-mode=<umask>
configuration option can be set when submitting the Spark job. For example:
spark-submit ... --conf spark.hadoop.fs.permissions.umask-mode=022
For other jobs (e.g., PySpark jobs) where Hadoop FileSystem classes are typically not used to generate files, the umask
should be managed in the Python code. For instance:
import os
= os.umask(0o022)
oldmask # write files ...
# reset previous umask
os.umask(oldmask)