- 
                Notifications
    You must be signed in to change notification settings 
- Fork 28.9k
[SPARK-29046][SQL] Fix NPE in SQLConf.get when active SparkContext is stopping #25753
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| Test build #110468 has finished for PR 25753 at commit  
 | 
| } else { | ||
| val isSchedulerEventLoopThread = SparkContext.getActive | ||
| .map(_.dagScheduler.eventProcessLoop.eventThread) | ||
| .flatMap(sc => Option(sc.dagScheduler)) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we use { sc => ... } for closures. Also later.
| Test build #110485 has finished for PR 25753 at commit  
 | 
| retest this, please | 
| Test build #110489 has finished for PR 25753 at commit  
 | 
| Merged to master. | 
| Thanks all for reviewing and merging! | 
| SQLConf.get | ||
| } | ||
| } { | ||
| oldSparkContext.orElse(Some(null)).foreach(SparkContext.setActiveContext) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, All.
Ur, I'm wondering if this PR's test case is safe in concurrent testing environments?
This might be the root cause of the current outage in Apache Spark Jenkins jobs.
After this PR, all Jenkins on master branch never succeeds.
Could you take a look a little bit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given this is late Friday, just revert this if you thinks it's breaking things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I've been investigating this and the other commits. But, this is the usual suspect given the error message. Since master branch has been broken over 2 days and it seems that we cannot fix this during weekend, I'll revert this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for finding. I haven't noticed concurrent tests are executed, as there're many tests leveraging single object like SparkContext. I'll just exclude test code in new PR as I don't see other way to test this.
Btw, do we run different options between PR and master build? I'm curious about the reason - as we tend to guess PR build pass as "OK".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the issue is probably Maven vs SBT testing differences. I personally would prefer to standardize on the Maven build for everything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HeartSaVioR . PRBuilder is the same. PRBuilder has been very flaky last two days. I hit the same failure on my independent PRs too many times. That was the reason I investigated this and reached here to ask help.
@srowen . According to the Jenkins status, All Jenkins job combinations failed. This is irrelevant to SBT and Maven difference.
| Hi, All. | 
| FYI, this is reverted via 13b77e5 . | 
| After reverting, Jenkins jobs are recovered. Please make another PR for this. | 
| Thanks all for taking care of this. | 
… stopping # What changes were proposed in this pull request? This patch fixes the bug regarding NPE in SQLConf.get, which is only possible when SparkContext._dagScheduler is null due to stopping SparkContext. The logic doesn't seem to consider active SparkContext could be in progress of stopping. Note that it can't be encountered easily as `SparkContext.stop()` blocks the main thread, but there're many cases which SQLConf.get is accessed concurrently while SparkContext.stop() is executing - users run another threads, or listener is accessing SQLConf.get after dagScheduler is set to null (this is the case what I encountered.) ### Why are the changes needed? The bug brings NPE. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added new UT to verify NPE doesn't occur. Without patch, the test fails with throwing NPE. Closes apache#25753 from HeartSaVioR/SPARK-29046. Authored-by: Jungtaek Lim (HeartSaVioR) <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
… stopping ### What changes were proposed in this pull request? This patch fixes the bug regarding NPE in SQLConf.get, which is only possible when SparkContext._dagScheduler is null due to stopping SparkContext. The logic doesn't seem to consider active SparkContext could be in progress of stopping. Note that it can't be encountered easily as SparkContext.stop() blocks the main thread, but there're many cases which SQLConf.get is accessed concurrently while SparkContext.stop() is executing - users run another threads, or listener is accessing SQLConf.get after dagScheduler is set to null (this is the case what I encountered.) ### Why are the changes needed? The bug brings NPE. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Previous patch #25753 was tested with new UT, and due to disruption with other tests in concurrent test run, the test is excluded in this patch. Closes #25790 from HeartSaVioR/SPARK-29046-v2. Authored-by: Jungtaek Lim (HeartSaVioR) <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
… stopping ### What changes were proposed in this pull request? This patch fixes the bug regarding NPE in SQLConf.get, which is only possible when SparkContext._dagScheduler is null due to stopping SparkContext. The logic doesn't seem to consider active SparkContext could be in progress of stopping. Note that it can't be encountered easily as SparkContext.stop() blocks the main thread, but there're many cases which SQLConf.get is accessed concurrently while SparkContext.stop() is executing - users run another threads, or listener is accessing SQLConf.get after dagScheduler is set to null (this is the case what I encountered.) ### Why are the changes needed? The bug brings NPE. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Previous patch apache#25753 was tested with new UT, and due to disruption with other tests in concurrent test run, the test is excluded in this patch. Closes apache#25790 from HeartSaVioR/SPARK-29046-v2. Authored-by: Jungtaek Lim (HeartSaVioR) <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…xt is stopping ### What changes were proposed in this pull request? This patch fixes the bug regarding NPE in SQLConf.get, which is only possible when SparkContext._dagScheduler is null due to stopping SparkContext. The logic doesn't seem to consider active SparkContext could be in progress of stopping. Note that it can't be encountered easily as SparkContext.stop() blocks the main thread, but there're many cases which SQLConf.get is accessed concurrently while SparkContext.stop() is executing - users run another threads, or listener is accessing SQLConf.get after dagScheduler is set to null (this is the case what I encountered.) ### Why are the changes needed? The bug brings NPE. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Previous patch #25753 was tested with new UT, and due to disruption with other tests in concurrent test run, the test is excluded in this patch. Closes #25798 from HeartSaVioR/SPARK-29046-branch-2.4. Authored-by: Jungtaek Lim (HeartSaVioR) <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
What changes were proposed in this pull request?
This patch fixes the bug regarding NPE in SQLConf.get, which is only possible when SparkContext._dagScheduler is null due to stopping SparkContext. The logic doesn't seem to consider active SparkContext could be in progress of stopping.
Note that it can't be encountered easily as
SparkContext.stop()blocks the main thread, but there're many cases which SQLConf.get is accessed concurrently while SparkContext.stop() is executing - users run another threads, or listener is accessing SQLConf.get after dagScheduler is set to null (this is the case what I encountered.)Why are the changes needed?
The bug brings NPE.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Added new UT to verify NPE doesn't occur. Without patch, the test fails with throwing NPE.