[SPARK-37205][YARN] Introduce a new config 'spark.yarn.am.tokenConfRegex' to support renewing delegation tokens in a multi-cluster environment #34635

sunchao · 2021-11-17T17:28:05Z

What changes were proposed in this pull request?

This adds a new config spark.yarn.am.tokenConfRegex which is similar to mapreduce.job.send-token-conf introduced via YARN-5910. It is used for YARN AM to pass Hadoop configs, such as dfs.nameservices, dfs.ha.namenodes., dfs.namenode.rpc-address., etc, to RM for renewing delegation tokens.

Why are the changes needed?

YARN-5910 introduced a new config mapreduce.job.send-token-conf which can be used to pass a job's local configuration to RM which uses them when renewing delegation tokens. A typical use case is when a YARN cluster needs to talk to multiple HDFS clusters, where the RM may not have all the configs (e.g., dfs.nameservices, dfs.ha.namenodes.<nameservice>.*, dfs.namenode.rpc-address) to connect to these clusters when renewing delegation tokens. In this case, the clients can use the feature to pass their local HDFS configs to RM.

Does this PR introduce any user-facing change?

Yes, a new config spark.yarn.am.tokenConfRegex will be introduced to Spark users. By default it is disabled.

How was this patch tested?

It seems difficult to come up with a unit test for this. I manually tested it against a YARN cluster with Hadoop version 3.x and it worked as expected.

$SPARK_HOME/bin/spark-shell --master yarn \
            --deploy-mode client \
            --conf spark.driver.extraClassPath="${HADOOP_CONF_DIR}" \
            --conf spark.executor.extraclasspath="${HADOOP_CONF_DIR}" \
            --conf spark.yarn.am.tokenConfRegex="^dfs.nameservices$|^dfs.namenode.rpc-address.*$|^dfs.ha.namenodes.*$|^dfs.client.failover.proxy.provider.*$|^dfs.namenode.kerberos.principal|^dfs.namenode.kerberos.principal.pattern" \
            --conf spark.yarn.access.hadoopFileSystems="<HDFS_URI>"

sunchao · 2021-11-17T17:57:18Z

cc @gaborgsomogyi @xkrogen

SparkQA · 2021-11-17T18:02:39Z

Test build #145343 has finished for PR 34635 at commit b2411ba.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-11-17T18:07:10Z

Test build #145345 has finished for PR 34635 at commit 8c6e5b8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-11-17T18:15:24Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49814/

SparkQA · 2021-11-17T18:34:15Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49816/

SparkQA · 2021-11-17T19:13:03Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49814/

SparkQA · 2021-11-17T19:14:41Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49816/

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/config.scala

dongjoon-hyun · 2021-11-18T03:14:56Z

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/config.scala

+          "'mapreduce.job.send-token-conf'. Please check YARN-5910 for more details.")
+        .version("3.3.0")
+        .stringConf
+        .createWithDefault("")


Since this is a regex expression, what does empty string regex mean here as a default value?

If it's not clear, shall we use .createOptional?

Good suggestion. I think createOptional is better.

dongjoon-hyun · 2021-11-18T03:16:52Z

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/config.scala

+          "needs to talk to multiple downstream HDFS clusters, where the YARN RM may not have " +
+          "configs (e.g., dfs.nameservices, dfs.ha.namenodes.*, dfs.namenode.rpc-address.*)" +
+          "to connect to these clusters. This config is very similar to " +
+          "'mapreduce.job.send-token-conf'. Please check YARN-5910 for more details.")


We had better mention explicitly that this config is ignored in Hadoop 2.7 because we still have Hadoop 2.7 distribution.

Yea I missed that, added.

dongjoon-hyun · 2021-11-18T03:19:12Z

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/config.scala

      .createWithDefault(false)

+  private[spark] val AM_SEND_TOKEN_CONF =
+    ConfigBuilder("spark.yarn.am.sendTokenConf")


nit.spark.yarn.am.tokenConf instead spark.yarn.am.sendTokenConf? sendTokenConf sounds like a boolean config like send or not send.

I don't know what is a good name for this and just followed the Hadoop side config name. send here is supposed to mean that the token conf is sent from AM to RM

Use regexConf ?
Also, add .regex to config name ? (in addition to @dongjoon-hyun's suggestion for rename).

Take a look at spark.redaction.string.regex for an example.

Thanks, I updated the config name to spark.yarn.am.tokenConfRegex. Let me know if this looks better.

dongjoon-hyun · 2021-11-18T03:21:56Z

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala

+        }
+      }
+      copy.write(dob);
+      amContainer.setTokensConf(ByteBuffer.wrap(dob.getData))


Just a question. The compilation works with Hadoop 2.7, right?

Oops, it won't work for Hadoop 2.7. Hmm let me think how to make it work ..

I guess we can follow ResourceRequestHelper and use reflection to lookup the method when using Hadoop 3.x, to avoid compilation error.

dongjoon-hyun

Thank you, @sunchao . It looks reasonable.

Sorry, but I need to ask if you think we can add a test coverage for this.

sunchao · 2021-11-18T04:45:50Z

Sorry, but I need to ask if you think we can add a test coverage for this.

I mentioned a bit in the PR description. It's pretty hard to come up with a e2e test for this esp. with kerberos involved. I checked a few related PRs such as #31761 and #23525 and they also didn't come with tests.

SparkQA · 2021-11-18T05:19:29Z

Test build #145369 has finished for PR 34635 at commit 37c430c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-11-18T05:44:17Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49841/

SparkQA · 2021-11-18T06:39:45Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49841/

dongjoon-hyun

+1, LGTM. I agree with you about the test case. Thanks, @sunchao .

mridulm · 2021-11-19T18:57:11Z

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala

+      }
+      copy.write(dob);
+
+      // since this method was added in Hadoop 2.9 and 3.0, we use reflection here to avoid


Are we doing this only for 3.x ? If not, relax the isHadoop3 condition ?

Yes, since Spark only supports built-in Hadoop 2.7 or 3.3, we have the check here. Do you mean support custom Hadoop version 2.9+ too with -Phadoop.version=2.9.x?

Exactly - both 2.9 and 2.10 for example.

Got it. I added the change.

Gently ping @mridulm . Does the latest change look good to you?

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/config.scala

SparkQA · 2021-11-19T20:21:18Z

Test build #145464 has finished for PR 34635 at commit 8ff6be1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-11-19T20:36:46Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49936/

SparkQA · 2021-11-19T21:19:01Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49936/

dongjoon-hyun · 2021-11-30T04:58:03Z

Could you review this once more please, @mridulm ?

SparkQA · 2021-11-30T07:02:54Z

Test build #145752 has finished for PR 34635 at commit c69cb6e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-11-30T07:24:48Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50224/

SparkQA · 2021-11-30T07:27:21Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50223/

SparkQA · 2021-11-30T08:14:05Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50223/

SparkQA · 2021-11-30T08:24:20Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50224/

sunchao · 2021-12-08T17:10:30Z

@dongjoon-hyun could you help to double check the new changes and see if they look good to you? if so I'm going to merge this soon. Thanks.

dongjoon-hyun

+1, LGTM. New changes look good to me, @sunchao .

sunchao · 2021-12-08T17:25:49Z

Merged, thanks!

…rn.am.tokenConfRegex' to support renewing delegation tokens in a multi-cluster environment (apache#1300) This adds a new config `spark.yarn.am.tokenConfRegex` which is similar to `mapreduce.job.send-token-conf` introduced via [YARN-5910](https://issues.apache.org/jira/browse/YARN-5910). It is used for YARN AM to pass Hadoop configs, such as `dfs.nameservices`, `dfs.ha.namenodes.`, `dfs.namenode.rpc-address.`, etc, to RM for renewing delegation tokens. [YARN-5910](https://issues.apache.org/jira/browse/YARN-5910) introduced a new config `mapreduce.job.send-token-conf` which can be used to pass a job's local configuration to RM which uses them when renewing delegation tokens. A typical use case is when a YARN cluster needs to talk to multiple HDFS clusters, where the RM may not have all the configs (e.g., `dfs.nameservices`, `dfs.ha.namenodes.<nameservice>.*`, `dfs.namenode.rpc-address`) to connect to these clusters when renewing delegation tokens. In this case, the clients can use the feature to pass their local HDFS configs to RM. Yes, a new config `spark.yarn.am.tokenConfRegex` will be introduced to Spark users. By default it is disabled. It seems difficult to come up with a unit test for this. I manually tested it against a YARN cluster with Hadoop version 3.x and it worked as expected. ``` $SPARK_HOME/bin/spark-shell --master yarn \ --deploy-mode client \ --conf spark.driver.extraClassPath="${HADOOP_CONF_DIR}" \ --conf spark.executor.extraclasspath="${HADOOP_CONF_DIR}" \ --conf spark.yarn.am.tokenConfRegex="^dfs.nameservices$|^dfs.namenode.rpc-address.*$|^dfs.ha.namenodes.*$|^dfs.client.failover.proxy.provider.*$|^dfs.namenode.kerberos.principal|^dfs.namenode.kerberos.principal.pattern" \ --conf spark.yarn.access.hadoopFileSystems="<HDFS_URI>" ``` Closes apache#34635 from sunchao/SPARK-37205. Authored-by: Chao Sun <[email protected]> Signed-off-by: Chao Sun <[email protected]>

wip

b2411ba

github-actions bot added the YARN label Nov 17, 2021

logging & rewording

8c6e5b8

dongjoon-hyun reviewed Nov 18, 2021

View reviewed changes

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/config.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Nov 18, 2021

View reviewed changes

address comments

37c430c

dongjoon-hyun approved these changes Nov 19, 2021

View reviewed changes

mridulm reviewed Nov 19, 2021

View reviewed changes

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/config.scala Outdated Show resolved Hide resolved

address review comments

8ff6be1

sunchao added 2 commits November 29, 2021 22:30

address comments

a109ab6

improve documentation

c69cb6e

dongjoon-hyun approved these changes Dec 8, 2021

View reviewed changes

sunchao closed this in 77a8778 Dec 8, 2021

sunchao deleted the SPARK-37205 branch December 8, 2021 17:25

[SPARK-37205][YARN] Introduce a new config 'spark.yarn.am.tokenConfRegex' to support renewing delegation tokens in a multi-cluster environment #34635

[SPARK-37205][YARN] Introduce a new config 'spark.yarn.am.tokenConfRegex' to support renewing delegation tokens in a multi-cluster environment #34635

Uh oh!

Conversation

sunchao commented Nov 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

sunchao commented Nov 17, 2021

Uh oh!

SparkQA commented Nov 17, 2021

Uh oh!

SparkQA commented Nov 17, 2021

Uh oh!

SparkQA commented Nov 17, 2021

Uh oh!

SparkQA commented Nov 17, 2021

Uh oh!

SparkQA commented Nov 17, 2021

Uh oh!

SparkQA commented Nov 17, 2021

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mridulm Nov 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

sunchao commented Nov 18, 2021

Uh oh!

SparkQA commented Nov 18, 2021

Uh oh!

SparkQA commented Nov 18, 2021

Uh oh!

SparkQA commented Nov 18, 2021

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SparkQA commented Nov 19, 2021

Uh oh!

SparkQA commented Nov 19, 2021

sunchao commented Nov 17, 2021 •

edited

Loading

mridulm Nov 19, 2021 •

edited

Loading