Skip to content

Conversation

@manuzhang
Copy link
Member

@manuzhang manuzhang commented May 25, 2022

What changes were proposed in this pull request?

Fix precedence of configs of Hadoop Filesystems to access.

Before this PR

spark.kerberos.access.hadoopFileSystems -> spark.yarn.access.namenodes -> spark.yarn.access.hadoopFileSystems

After this PR

spark.kerberos.access.hadoopFileSystems ->  spark.yarn.access.hadoopFileSystems -> spark.yarn.access.namenodes

Why are the changes needed?

Before #23698, the precedence of configuring Hadoop Filesystems to access is

spark.yarn.access.hadoopFileSystems -> spark.yarn.access.namenodes

Afterwards, it's

spark.kerberos.access.hadoopFileSystems -> spark.yarn.access.namenodes -> spark.yarn.access.hadoopFileSystems

When both spark.yarn.access.hadoopFileSystems and spark.yarn.access.namenodes are configured with different values, the PR will break backward compatibility and cause application failure.

Does this PR introduce any user-facing change?

Yes. Fix backward compatibility.

How was this patch tested?

Updated UT.

@github-actions github-actions bot added the CORE label May 25, 2022
@manuzhang
Copy link
Member Author

cc @gaborgsomogyi @vanzin

@gaborgsomogyi
Copy link
Contributor

Let's take a look at how the configs were evolving:

  • spark.yarn.access.namenodes introduced
  • spark.yarn.access.hadoopFileSystems introduced so spark.yarn.access.namenodes deprecated
  • spark.kerberos.access.hadoopFileSystems introduced so spark.yarn.access.hadoopFileSystems deprecated

So from my perspective the correct order is:

  • spark.yarn.access.namenodes must be overwritten by spark.yarn.access.hadoopFileSystems
  • spark.yarn.access.namenodes and spark.yarn.access.hadoopFileSystems must be overwritten by spark.kerberos.access.hadoopFileSystems

I understand that it was different previously but I wouldn't change it because of the following reasons:

@manuzhang
Copy link
Member Author

newer configs must take precedence

This is not true currently. Meanwhile, deprecated configs are not removed yet and as long as the configs are used they should not break users' applications.

conf.set("spark.yarn.access.namenodes", "testNode")
assert(conf.get(KERBEROS_FILESYSTEMS_TO_ACCESS) === Array("testNode"))

conf.set("spark.yarn.access.hadoopFileSystems", "testNode")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to point out this test is meaningless before this change.

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Sep 24, 2022
@github-actions github-actions bot closed this Sep 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants