Terrascope
  • Overview
  • Get started
  • Introduction
    • Terrascope Introduction
    • The Copernicus Programme
    • Registration and authentication
  • Data
    • Sentinel Missions
    • Sentinel-1
    • Sentinel-2
    • Sentinel-3
    • Sentinel-5P
    • PROBA-V mission
    • PROBA-V
    • SPOT-VGT mission
    • SPOT-VGT
    • Additional Products
  • APIs
    • catalogue APIs
    • OpenSearch
    • TerraCatalogueClient
    • STAC
    • Product download
    • Streamlined Data Access APIs
    • openEO
    • Additional Web-services
    • CropSAR Service
    • Web Map Service (WMS)
    • Web Map Tile Service (WMTS)
  • Tools
    • Terrascope GUI
    • Terrascope Viewer
    • openEO web editor
    • Virtual Environments
    • Virtual Machine
    • JupyterLab
    • Hadoop Cluster
    • EOplaza
    • Getting started
    • Manage your Organisation
    • Publish a Service
    • Execute a Service
    • Manage a Service
    • Reporting
  • Quotas and Limitations
  • Support
    • Contact
    • Terrascope Forum
    • Terrascope Sample Examples
  • FAQ
  1. Virtual Environments
  2. Hadoop Cluster
  3. Manage permissions and ownership
  • Terrascope GUI
    • Terrascope Viewer
    • openEO web editor
  • Virtual Environments
    • Virtual Machine
      • Terrascope Policies
    • JupyterLab
    • Hadoop Cluster
      • Manage Spark Resources
      • Advanced Kerberos
      • Access Spark Logs
      • Use Docker on Hadoop
      • Manage permissions and ownership
  • EOplaza
    • Getting started
    • Manage your Organisation
    • Publish a Service
    • Execute a Service
    • Manage a Service
      • Service Maturity
      • Credit Strength
    • Reporting

On this page

  • Managing group ownership
  • Managing permissions
  1. Virtual Environments
  2. Hadoop Cluster
  3. Manage permissions and ownership

Managing ownership and permissions

When running a Spark job that generates data on disk, it is important to set the correct user/group ownership and file permissions. User ownership is determined by the user for whom the Kerberos TGT was generated.

Managing group ownership

To determine the group owner, several options are available:

  • By default, the user group is used. Each user typically has a private group: for example, the user mep_<project> has a default group mep_<project>, so files are created with mep_<project>:<project> ownership. If needed, the default group can be changed to a shared project group, allowing multiple users to read/write the files created by the mep_<project> user.
  • Alternatively, the gid bit can be set on a folder (chmod g+s <root_folder>). This ensures that files and directories created under this folder inherit the group ownership of the parent folder.

Managing permissions

On Linux systems, default permissions are set by the umask, which usually inherits from the parent process or OS. When running Spark jobs, the umask of the YARN container process is used, as this process acts as the parent for the Spark executor job.

In the Yarn code, the umask is hardcoded to 0027 (see line 2435). This will generate files with -rw-r----- permissions. This setting can’t be managed at the OS level, as the setting in the Yarn container overwrites it.

If a user wants to override the default umask, several options exist depending on the type of Spark job:

For Java/Scala Spark jobs that use Hadoop FileSystem classes to generate files, the spark.hadoop.fs.permissions.umask-mode=<umask> configuration option can be set when submitting the Spark job. For example:


spark-submit ... --conf spark.hadoop.fs.permissions.umask-mode=022

For other jobs (e.g., PySpark jobs) where Hadoop FileSystem classes are typically not used to generate files, the umask should be managed in the Python code. For instance:


import os
oldmask = os.umask(0o022)
# write files ...
# reset previous umask
os.umask(oldmask)
Back to top
Use Docker on Hadoop
Getting started

Copyright 2018 - 2024 VITO NV All Rights reserved

 
  • Terms of Use

  • Privacy declaration

  • Cookie Policy

  • Contact