Support SPARK_USER for specifying usernames to HDFS

Sub-issue of #128.

@weiting-chen, @foxish, @ifilonenko 

When the driver and executors pods access HDFS, the usernames currently appear as `root` to HDFS because k8s pods do not have Linux user accounts other than `root`.

Both Spark and Hadoop library supports environment variables that can override the username to HDFS. Spark supports the `SPARK_USER` env var, which is set to a field in the active `SparkContext`.:

From `SparkContext.scala` ([code](https://github.com/apache-spark-on-k8s/spark/blob/branch-2.2-kubernetes/core/src/main/scala/org/apache/spark/SparkContext.scala#L295)):
```
  // Set SPARK_USER for user who is running SparkContext.
  val sparkUser = Utils.getCurrentUserName()
```

`Utils.scala` ([code](https://github.com/apache-spark-on-k8s/spark/blob/branch-2.1-kubernetes/core/src/main/scala/org/apache/spark/util/Utils.scala#L2372))
```
  /**
   * Returns the current user name. This is the currently logged in user, unless that's been
   * overridden by the `SPARK_USER` environment variable.
   */
  def getCurrentUserName(): String = {
    Option(System.getenv("SPARK_USER"))
      .getOrElse(UserGroupInformation.getCurrentUser().getShortUserName())
  }
```

The `sparkUser` field is then used by `SparkHadoopUtils.runAsSparkUser`.

From `SparkHadoopUtils.scala` ([code](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L61)):
```
  /**
   * Runs the given function with a Hadoop UserGroupInformation as a thread local variable
   * (distributed to child threads), used for authenticating HDFS and YARN calls.
   *
   * IMPORTANT NOTE: If this function is going to be called repeated in the same process
   * you need to look https://issues.apache.org/jira/browse/HDFS-3545 and possibly
   * do a FileSystem.closeAllForUGI in order to avoid leaking Filesystems
   */
  def runAsSparkUser(func: () => Unit) {
    val user = Utils.getCurrentUserName()
    logDebug("running as user: " + user)
    val ugi = UserGroupInformation.createRemoteUser(user)
    transferCredentials(UserGroupInformation.getCurrentUser(), ugi)
    ugi.doAs(new PrivilegedExceptionAction[Unit] {
      def run: Unit = func()
    })
  }
```

But it might be the case that the K8s driver or executor code is not calling the `runAsSparkUser` method properly. We should look into this and make sure it works end-to-end. Note `SPARK_USER` is supposed to override the username for both secure and non-secure HDFS.

Hadoop library has another env/property variable, `HADOOP_USER_NAME`. But it appears that this is redundant in case SPARK_USER is specified. Also this doesn't seem to work for secure HDFS. So we should probably focus first on the SPARK_USER support.

From `UserGroupInformation.java` ([code])(https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java#L204):
```
      //If we don't have a kerberos user and security is disabled, check
      //if user is specified in the environment or properties
      if (!isSecurityEnabled() && (user == null)) {
        String envUser = System.getenv(HADOOP_USER_NAME);
        if (envUser == null) {
          envUser = System.getProperty(HADOOP_USER_NAME);
        }
        user = envUser == null ? null : new User(envUser);
      }
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support SPARK_USER for specifying usernames to HDFS #408

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support SPARK_USER for specifying usernames to HDFS #408

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions