Skip to content

Conversation

@jerryshao
Copy link
Contributor

What changes were proposed in this pull request?

For the details of the exception please see SPARK-24062.

The issue is:

Spark on Yarn stores SASL secret in current UGI's credentials, this credentials will be distributed to AM and executors, so that executors and drive share the same secret to communicate. But STS/Hive library code will refresh the current UGI by UGI's loginFromKeytab() after Spark application is started, this will create a new UGI in the current driver's context with empty tokens and secret keys, so secret key is lost in the current context's UGI, that's why Spark driver throws secret key not found exception.

In Spark 2.2 code, Spark also stores this secret key in SecurityManager's class variable, so even UGI is refreshed, the secret is still existed in the object, so STS with SASL can still be worked in Spark 2.2. But in Spark 2.3, we always search key from current UGI, which makes it fail to work in Spark 2.3.

To fix this issue, there're two possible solutions:

  1. Fix in STS/Hive library, when a new UGI is refreshed, copy the secret key from original UGI to the new one. The difficulty is that some codes to refresh the UGI is existed in Hive library, which makes us hard to change the code.
  2. Roll back the logics in SecurityManager to match Spark 2.2, so that this issue can be fixed.

2nd solution seems a simple one. So I will propose a PR with 2nd solution.

How was this patch tested?

Verified in local cluster.

CC @vanzin @tgravescs please help to review. Thanks!

@SparkQA
Copy link

SparkQA commented Apr 24, 2018

Test build #89773 has finished for PR 21138 at commit 0077685.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mridulm
Copy link
Contributor

mridulm commented Apr 24, 2018

IMO the fix would be to not do UGI.loginUserFromKeytab in HiveClientImpl; or rather, do it only if it is absolutely necessary.
I will share the snippet with you @jerryshao - this is something I have fixed in the past (invoking UGI.login multiple times; not this specific bug - though it is a manifestation of the same).

@jerryshao
Copy link
Contributor Author

Hi @mridulm , thanks a lot for your comments.

UGI.loginUserFromKeytab is not existed any more in Spark 2.3+ (dc2714d#diff-6fd847124f8eae45ba2de1cf7d6296fe).

Actually it is the code here (

) in ThriftServer and somewhere else in Hive Library which calls UGI.loginUserFromKeytab, note here the principal and keytab is hive one, not sure if we can remove this.

PS. I saw two "Login successful" from UGI in the thrift server log, but I can only find out one login in the thrift server code. So I assume another one is in the Hive library.

Yes, I agree with you, ideally we should not login from keytab unnecessarily, but thinking of thrift server as a Spark application, it doesn't know the context of Spark's UGI and do login to refresh the UGI in its context, seems we cannot defend user to do that in the user layer. So I think my fix could workaround such issue, though may not be the elegant fix.

@mridulm
Copy link
Contributor

mridulm commented Apr 25, 2018

@jerryshao As we discussed IRL HiveClientImpl was one place this is happening (now fixed - git pull delays, my bad).
The other is in HadoopThriftAuthBridge we rely on - createServer results in always doing a UGI.loginUserFromKeytab : resulting in changing the static loginUser in UGI (even though principal and keytab are the same).
This results in :

  • The current bug - where secret token's are now gone.
  • Long running STS in secure cluster failing (due to hadoop IPC failure)
    • This is the reason why I found/had worked on this - though for 1.6 branch.

@jerryshao
Copy link
Contributor Author

@mridulm I would treat the current fix as a workaround for SASL issue, since it is a regression in 2.3.

For UGI refreshing issue (mainly cause STS long running failure, also lead to SASL failure here), I think we can create a separate JIRA to fix the issue. Since this is not a regression.

What do you think?

@mridulm
Copy link
Contributor

mridulm commented Apr 25, 2018

Sounds good to me; any thoughts @vanzin ? (since you changed this last)

@vanzin
Copy link
Contributor

vanzin commented Apr 25, 2018

I'm fine with the fix. Not familiar with the internals of the STS / Hive to suggest anything different.

@jerryshao
Copy link
Contributor Author

Thanks for the review @mridulm @vanzin . Let me test again. I will merge the code when test is passed.

@jerryshao
Copy link
Contributor Author

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Apr 26, 2018

Test build #89864 has finished for PR 21138 at commit 0077685.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jerryshao
Copy link
Contributor Author

Merging to master and branch 2.3.

asfgit pushed a commit that referenced this pull request Apr 26, 2018
… in thrift server

## What changes were proposed in this pull request?

For the details of the exception please see [SPARK-24062](https://issues.apache.org/jira/browse/SPARK-24062).

The issue is:

Spark on Yarn stores SASL secret in current UGI's credentials, this credentials will be distributed to AM and executors, so that executors and drive share the same secret to communicate. But STS/Hive library code will refresh the current UGI by UGI's loginFromKeytab() after Spark application is started, this will create a new UGI in the current driver's context with empty tokens and secret keys, so secret key is lost in the current context's UGI, that's why Spark driver throws secret key not found exception.

In Spark 2.2 code, Spark also stores this secret key in SecurityManager's class variable, so even UGI is refreshed, the secret is still existed in the object, so STS with SASL can still be worked in Spark 2.2. But in Spark 2.3, we always search key from current UGI, which makes it fail to work in Spark 2.3.

To fix this issue, there're two possible solutions:

1. Fix in STS/Hive library, when a new UGI is refreshed, copy the secret key from original UGI to the new one. The difficulty is that some codes to refresh the UGI is existed in Hive library, which makes us hard to change the code.
2. Roll back the logics in SecurityManager to match Spark 2.2, so that this issue can be fixed.

2nd solution seems a simple one. So I will propose a PR with 2nd solution.

## How was this patch tested?

Verified in local cluster.

CC vanzin  tgravescs  please help to review. Thanks!

Author: jerryshao <[email protected]>

Closes #21138 from jerryshao/SPARK-24062.

(cherry picked from commit ffaf0f9)
Signed-off-by: jerryshao <[email protected]>
@asfgit asfgit closed this in ffaf0f9 Apr 26, 2018
peter-toth pushed a commit to peter-toth/spark that referenced this pull request Oct 6, 2018
… in thrift server

For the details of the exception please see [SPARK-24062](https://issues.apache.org/jira/browse/SPARK-24062).

The issue is:

Spark on Yarn stores SASL secret in current UGI's credentials, this credentials will be distributed to AM and executors, so that executors and drive share the same secret to communicate. But STS/Hive library code will refresh the current UGI by UGI's loginFromKeytab() after Spark application is started, this will create a new UGI in the current driver's context with empty tokens and secret keys, so secret key is lost in the current context's UGI, that's why Spark driver throws secret key not found exception.

In Spark 2.2 code, Spark also stores this secret key in SecurityManager's class variable, so even UGI is refreshed, the secret is still existed in the object, so STS with SASL can still be worked in Spark 2.2. But in Spark 2.3, we always search key from current UGI, which makes it fail to work in Spark 2.3.

To fix this issue, there're two possible solutions:

1. Fix in STS/Hive library, when a new UGI is refreshed, copy the secret key from original UGI to the new one. The difficulty is that some codes to refresh the UGI is existed in Hive library, which makes us hard to change the code.
2. Roll back the logics in SecurityManager to match Spark 2.2, so that this issue can be fixed.

2nd solution seems a simple one. So I will propose a PR with 2nd solution.

Verified in local cluster.

CC vanzin  tgravescs  please help to review. Thanks!

Author: jerryshao <[email protected]>

Closes apache#21138 from jerryshao/SPARK-24062.

(cherry picked from commit ffaf0f9)
Signed-off-by: jerryshao <[email protected]>

Change-Id: I7696fedd013e4c0981753bd4feeffaf4dd45dc7f
arjunshroff pushed a commit to arjunshroff/spark that referenced this pull request Nov 24, 2020
… in thrift server (apache#271)

## What changes were proposed in this pull request?

For the details of the exception please see [SPARK-24062](https://issues.apache.org/jira/browse/SPARK-24062).

The issue is:

Spark on Yarn stores SASL secret in current UGI's credentials, this credentials will be distributed to AM and executors, so that executors and drive share the same secret to communicate. But STS/Hive library code will refresh the current UGI by UGI's loginFromKeytab() after Spark application is started, this will create a new UGI in the current driver's context with empty tokens and secret keys, so secret key is lost in the current context's UGI, that's why Spark driver throws secret key not found exception.

In Spark 2.2 code, Spark also stores this secret key in SecurityManager's class variable, so even UGI is refreshed, the secret is still existed in the object, so STS with SASL can still be worked in Spark 2.2. But in Spark 2.3, we always search key from current UGI, which makes it fail to work in Spark 2.3.

To fix this issue, there're two possible solutions:

1. Fix in STS/Hive library, when a new UGI is refreshed, copy the secret key from original UGI to the new one. The difficulty is that some codes to refresh the UGI is existed in Hive library, which makes us hard to change the code.
2. Roll back the logics in SecurityManager to match Spark 2.2, so that this issue can be fixed.

2nd solution seems a simple one. So I will propose a PR with 2nd solution.

## How was this patch tested?

Verified in local cluster.

CC vanzin  tgravescs  please help to review. Thanks!

Author: jerryshao <[email protected]>

Closes apache#21138 from jerryshao/SPARK-24062.

(cherry picked from commit ffaf0f9)
Signed-off-by: jerryshao <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants