Skip to content

Conversation

@HeartSaVioR
Copy link
Contributor

@HeartSaVioR HeartSaVioR commented Sep 11, 2019

What changes were proposed in this pull request?

This patch fixes the bug regarding NPE in SQLConf.get, which is only possible when SparkContext._dagScheduler is null due to stopping SparkContext. The logic doesn't seem to consider active SparkContext could be in progress of stopping.

Note that it can't be encountered easily as SparkContext.stop() blocks the main thread, but there're many cases which SQLConf.get is accessed concurrently while SparkContext.stop() is executing - users run another threads, or listener is accessing SQLConf.get after dagScheduler is set to null (this is the case what I encountered.)

Why are the changes needed?

The bug brings NPE.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added new UT to verify NPE doesn't occur. Without patch, the test fails with throwing NPE.

@HeartSaVioR HeartSaVioR changed the title [SPARK-29046] Fix NPE in SQLConf.get when active SparkContext is stopping [SPARK-29046][SQL] Fix NPE in SQLConf.get when active SparkContext is stopping Sep 11, 2019
@SparkQA
Copy link

SparkQA commented Sep 11, 2019

Test build #110468 has finished for PR 25753 at commit 4266ee0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

} else {
val isSchedulerEventLoopThread = SparkContext.getActive
.map(_.dagScheduler.eventProcessLoop.eventThread)
.flatMap(sc => Option(sc.dagScheduler))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we use { sc => ... } for closures. Also later.

@SparkQA
Copy link

SparkQA commented Sep 11, 2019

Test build #110485 has finished for PR 25753 at commit a1472e8.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor Author

retest this, please

@SparkQA
Copy link

SparkQA commented Sep 12, 2019

Test build #110489 has finished for PR 25753 at commit a1472e8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to master.

@HeartSaVioR
Copy link
Contributor Author

Thanks all for reviewing and merging!

SQLConf.get
}
} {
oldSparkContext.orElse(Some(null)).foreach(SparkContext.setActiveContext)
Copy link
Member

@dongjoon-hyun dongjoon-hyun Sep 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, All.

Ur, I'm wondering if this PR's test case is safe in concurrent testing environments?
This might be the root cause of the current outage in Apache Spark Jenkins jobs.
After this PR, all Jenkins on master branch never succeeds.

Could you take a look a little bit?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this is late Friday, just revert this if you thinks it's breaking things.

Copy link
Member

@dongjoon-hyun dongjoon-hyun Sep 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I've been investigating this and the other commits. But, this is the usual suspect given the error message. Since master branch has been broken over 2 days and it seems that we cannot fix this during weekend, I'll revert this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for finding. I haven't noticed concurrent tests are executed, as there're many tests leveraging single object like SparkContext. I'll just exclude test code in new PR as I don't see other way to test this.

Btw, do we run different options between PR and master build? I'm curious about the reason - as we tend to guess PR build pass as "OK".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the issue is probably Maven vs SBT testing differences. I personally would prefer to standardize on the Maven build for everything.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HeartSaVioR . PRBuilder is the same. PRBuilder has been very flaky last two days. I hit the same failure on my independent PRs too many times. That was the reason I investigated this and reached here to ask help.

@srowen . According to the Jenkins status, All Jenkins job combinations failed. This is irrelevant to SBT and Maven difference.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Sep 14, 2019

Hi, All.
Since Jenkins will be reset at midnight, I'll revert this after Jenkins reset around Tomorrow 00:15AM. Please let me know if you think you can recover Jenkins with another method.

@dongjoon-hyun
Copy link
Member

FYI, this is reverted via 13b77e5 .

@dongjoon-hyun
Copy link
Member

After reverting, Jenkins jobs are recovered. Please make another PR for this.

@HyukjinKwon
Copy link
Member

Thanks all for taking care of this.

PavithraRamachandran pushed a commit to PavithraRamachandran/spark that referenced this pull request Sep 15, 2019
… stopping

# What changes were proposed in this pull request?

This patch fixes the bug regarding NPE in SQLConf.get, which is only possible when SparkContext._dagScheduler is null due to stopping SparkContext. The logic doesn't seem to consider active SparkContext could be in progress of stopping.

Note that it can't be encountered easily as `SparkContext.stop()` blocks the main thread, but there're many cases which SQLConf.get is accessed concurrently while SparkContext.stop() is executing - users run another threads, or listener is accessing SQLConf.get after dagScheduler is set to null (this is the case what I encountered.)

### Why are the changes needed?

The bug brings NPE.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Added new UT to verify NPE doesn't occur. Without patch, the test fails with throwing NPE.

Closes apache#25753 from HeartSaVioR/SPARK-29046.

Authored-by: Jungtaek Lim (HeartSaVioR) <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
dongjoon-hyun pushed a commit that referenced this pull request Sep 15, 2019
… stopping

### What changes were proposed in this pull request?

This patch fixes the bug regarding NPE in SQLConf.get, which is only possible when SparkContext._dagScheduler is null due to stopping SparkContext. The logic doesn't seem to consider active SparkContext could be in progress of stopping.

Note that it can't be encountered easily as SparkContext.stop() blocks the main thread, but there're many cases which SQLConf.get is accessed concurrently while SparkContext.stop() is executing - users run another threads, or listener is accessing SQLConf.get after dagScheduler is set to null (this is the case what I encountered.)

### Why are the changes needed?

The bug brings NPE.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Previous patch #25753 was tested with new UT, and due to disruption with other tests in concurrent test run, the test is excluded in this patch.

Closes #25790 from HeartSaVioR/SPARK-29046-v2.

Authored-by: Jungtaek Lim (HeartSaVioR) <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
HeartSaVioR added a commit to HeartSaVioR/spark that referenced this pull request Sep 16, 2019
… stopping

### What changes were proposed in this pull request?

This patch fixes the bug regarding NPE in SQLConf.get, which is only possible when SparkContext._dagScheduler is null due to stopping SparkContext. The logic doesn't seem to consider active SparkContext could be in progress of stopping.

Note that it can't be encountered easily as SparkContext.stop() blocks the main thread, but there're many cases which SQLConf.get is accessed concurrently while SparkContext.stop() is executing - users run another threads, or listener is accessing SQLConf.get after dagScheduler is set to null (this is the case what I encountered.)

### Why are the changes needed?

The bug brings NPE.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Previous patch apache#25753 was tested with new UT, and due to disruption with other tests in concurrent test run, the test is excluded in this patch.

Closes apache#25790 from HeartSaVioR/SPARK-29046-v2.

Authored-by: Jungtaek Lim (HeartSaVioR) <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
dongjoon-hyun pushed a commit that referenced this pull request Sep 17, 2019
…xt is stopping

### What changes were proposed in this pull request?

This patch fixes the bug regarding NPE in SQLConf.get, which is only possible when SparkContext._dagScheduler is null due to stopping SparkContext. The logic doesn't seem to consider active SparkContext could be in progress of stopping.

Note that it can't be encountered easily as SparkContext.stop() blocks the main thread, but there're many cases which SQLConf.get is accessed concurrently while SparkContext.stop() is executing - users run another threads, or listener is accessing SQLConf.get after dagScheduler is set to null (this is the case what I encountered.)

### Why are the changes needed?

The bug brings NPE.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Previous patch #25753 was tested with new UT, and due to disruption with other tests in concurrent test run, the test is excluded in this patch.

Closes #25798 from HeartSaVioR/SPARK-29046-branch-2.4.

Authored-by: Jungtaek Lim (HeartSaVioR) <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants