Skip to content
This repository was archived by the owner on Jan 9, 2020. It is now read-only.
This repository was archived by the owner on Jan 9, 2020. It is now read-only.

Support SPARK_USER for specifying usernames to HDFS #408

@kimoonkim

Description

@kimoonkim

Sub-issue of #128.

@weiting-chen, @foxish, @ifilonenko

When the driver and executors pods access HDFS, the usernames currently appear as root to HDFS because k8s pods do not have Linux user accounts other than root.

Both Spark and Hadoop library supports environment variables that can override the username to HDFS. Spark supports the SPARK_USER env var, which is set to a field in the active SparkContext.:

From SparkContext.scala (code):

  // Set SPARK_USER for user who is running SparkContext.
  val sparkUser = Utils.getCurrentUserName()

Utils.scala (code)

  /**
   * Returns the current user name. This is the currently logged in user, unless that's been
   * overridden by the `SPARK_USER` environment variable.
   */
  def getCurrentUserName(): String = {
    Option(System.getenv("SPARK_USER"))
      .getOrElse(UserGroupInformation.getCurrentUser().getShortUserName())
  }

The sparkUser field is then used by SparkHadoopUtils.runAsSparkUser.

From SparkHadoopUtils.scala (code):

  /**
   * Runs the given function with a Hadoop UserGroupInformation as a thread local variable
   * (distributed to child threads), used for authenticating HDFS and YARN calls.
   *
   * IMPORTANT NOTE: If this function is going to be called repeated in the same process
   * you need to look https://issues.apache.org/jira/browse/HDFS-3545 and possibly
   * do a FileSystem.closeAllForUGI in order to avoid leaking Filesystems
   */
  def runAsSparkUser(func: () => Unit) {
    val user = Utils.getCurrentUserName()
    logDebug("running as user: " + user)
    val ugi = UserGroupInformation.createRemoteUser(user)
    transferCredentials(UserGroupInformation.getCurrentUser(), ugi)
    ugi.doAs(new PrivilegedExceptionAction[Unit] {
      def run: Unit = func()
    })
  }

But it might be the case that the K8s driver or executor code is not calling the runAsSparkUser method properly. We should look into this and make sure it works end-to-end. Note SPARK_USER is supposed to override the username for both secure and non-secure HDFS.

Hadoop library has another env/property variable, HADOOP_USER_NAME. But it appears that this is redundant in case SPARK_USER is specified. Also this doesn't seem to work for secure HDFS. So we should probably focus first on the SPARK_USER support.

From UserGroupInformation.java ([code])(https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java#L204):

      //If we don't have a kerberos user and security is disabled, check
      //if user is specified in the environment or properties
      if (!isSecurityEnabled() && (user == null)) {
        String envUser = System.getenv(HADOOP_USER_NAME);
        if (envUser == null) {
          envUser = System.getProperty(HADOOP_USER_NAME);
        }
        user = envUser == null ? null : new User(envUser);
      }

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions