-
Notifications
You must be signed in to change notification settings - Fork 117
Support SPARK_USER for specifying usernames to HDFS #408
Description
Sub-issue of #128.
@weiting-chen, @foxish, @ifilonenko
When the driver and executors pods access HDFS, the usernames currently appear as root to HDFS because k8s pods do not have Linux user accounts other than root.
Both Spark and Hadoop library supports environment variables that can override the username to HDFS. Spark supports the SPARK_USER env var, which is set to a field in the active SparkContext.:
From SparkContext.scala (code):
// Set SPARK_USER for user who is running SparkContext.
val sparkUser = Utils.getCurrentUserName()
Utils.scala (code)
/**
* Returns the current user name. This is the currently logged in user, unless that's been
* overridden by the `SPARK_USER` environment variable.
*/
def getCurrentUserName(): String = {
Option(System.getenv("SPARK_USER"))
.getOrElse(UserGroupInformation.getCurrentUser().getShortUserName())
}
The sparkUser field is then used by SparkHadoopUtils.runAsSparkUser.
From SparkHadoopUtils.scala (code):
/**
* Runs the given function with a Hadoop UserGroupInformation as a thread local variable
* (distributed to child threads), used for authenticating HDFS and YARN calls.
*
* IMPORTANT NOTE: If this function is going to be called repeated in the same process
* you need to look https://issues.apache.org/jira/browse/HDFS-3545 and possibly
* do a FileSystem.closeAllForUGI in order to avoid leaking Filesystems
*/
def runAsSparkUser(func: () => Unit) {
val user = Utils.getCurrentUserName()
logDebug("running as user: " + user)
val ugi = UserGroupInformation.createRemoteUser(user)
transferCredentials(UserGroupInformation.getCurrentUser(), ugi)
ugi.doAs(new PrivilegedExceptionAction[Unit] {
def run: Unit = func()
})
}
But it might be the case that the K8s driver or executor code is not calling the runAsSparkUser method properly. We should look into this and make sure it works end-to-end. Note SPARK_USER is supposed to override the username for both secure and non-secure HDFS.
Hadoop library has another env/property variable, HADOOP_USER_NAME. But it appears that this is redundant in case SPARK_USER is specified. Also this doesn't seem to work for secure HDFS. So we should probably focus first on the SPARK_USER support.
From UserGroupInformation.java ([code])(https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java#L204):
//If we don't have a kerberos user and security is disabled, check
//if user is specified in the environment or properties
if (!isSecurityEnabled() && (user == null)) {
String envUser = System.getenv(HADOOP_USER_NAME);
if (envUser == null) {
envUser = System.getProperty(HADOOP_USER_NAME);
}
user = envUser == null ? null : new User(envUser);
}