[SPARK-18280][Core]Fix potential deadlock in `StandaloneSchedulerBackend.dead` #15775

zsxwing · 2016-11-04T22:15:05Z

What changes were proposed in this pull request?

"StandaloneSchedulerBackend.dead" is called in a RPC thread, so it should not call "SparkContext.stop" in the same thread. "SparkContext.stop" will block until all RPC threads exit, if it's called inside a RPC thread, it will be dead-lock.

This PR add a thread local flag inside RPC threads. SparkContext.stop uses it to decide if launching a new thread to stop the SparkContext.

How was this patch tested?

Jenkins

rxin · 2016-11-04T22:28:50Z

How can we prevent issues like this from happening again in the future? Any exception we can throw better or linter to do?

zsxwing · 2016-11-04T22:55:23Z

How about throw an exception if the current thread is an RPC thread in sc.stop?

SparkQA · 2016-11-05T00:57:20Z

Test build #68163 has finished for PR 15775 at commit 90e9090.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-11-05T01:16:41Z

Yea I think that's a good idea ... at least we will know.

rxin · 2016-11-05T01:17:05Z

core/src/main/scala/org/apache/spark/scheduler/cluster/StandaloneSchedulerBackend.scala

      } finally {
        // Ensure the application terminates, as we can no longer run jobs.
-        sc.stop()
+        new Thread("stop-spark-context") {


also document why we need this.

…d automatically

zsxwing · 2016-11-07T18:24:31Z

I added a thread local flag inside RPC threads. Instead of throwing an exception, SparkContext.stop will always launch a new thread to stop SparkContext.

I think this is better since we always launch a new thread to fix such issue.

SparkQA · 2016-11-07T20:01:51Z

Test build #68287 has finished for PR 15775 at commit d9c5626.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2016-11-07T20:04:39Z

retest this please

SparkQA · 2016-11-08T00:17:34Z

Test build #68297 has finished for PR 15775 at commit d9c5626.

This patch fails from timeout after a configured wait of 250m.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-08T03:15:56Z

Test build #3418 has finished for PR 15775 at commit d9c5626.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2016-11-08T05:30:52Z

retest this please

SparkQA · 2016-11-08T07:50:09Z

Test build #68320 has finished for PR 15775 at commit d9c5626.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2016-11-08T21:14:30Z

Thanks! Merging to master, 2.1 and 2.0.

…kend.dead` ## What changes were proposed in this pull request? "StandaloneSchedulerBackend.dead" is called in a RPC thread, so it should not call "SparkContext.stop" in the same thread. "SparkContext.stop" will block until all RPC threads exit, if it's called inside a RPC thread, it will be dead-lock. This PR add a thread local flag inside RPC threads. `SparkContext.stop` uses it to decide if launching a new thread to stop the SparkContext. ## How was this patch tested? Jenkins Author: Shixiong Zhu <[email protected]> Closes #15775 from zsxwing/SPARK-18280.

…Utils.tryOrStopSparkContext ## What changes were proposed in this pull request? When `SparkContext.stop` is called in `Utils.tryOrStopSparkContext` (the following three places), it will cause deadlock because the `stop` method needs to wait for the thread running `stop` to exit. - ContextCleaner.keepCleaning - LiveListenerBus.listenerThread.run - TaskSchedulerImpl.start This PR adds `SparkContext.stopInNewThread` and uses it to eliminate the potential deadlock. I also removed my changes in #15775 since they are not necessary now. ## How was this patch tested? Jenkins Author: Shixiong Zhu <[email protected]> Closes #16178 from zsxwing/fix-stop-deadlock. (cherry picked from commit 26432df) Signed-off-by: Shixiong Zhu <[email protected]>

…Utils.tryOrStopSparkContext ## What changes were proposed in this pull request? When `SparkContext.stop` is called in `Utils.tryOrStopSparkContext` (the following three places), it will cause deadlock because the `stop` method needs to wait for the thread running `stop` to exit. - ContextCleaner.keepCleaning - LiveListenerBus.listenerThread.run - TaskSchedulerImpl.start This PR adds `SparkContext.stopInNewThread` and uses it to eliminate the potential deadlock. I also removed my changes in #15775 since they are not necessary now. ## How was this patch tested? Jenkins Author: Shixiong Zhu <[email protected]> Closes #16178 from zsxwing/fix-stop-deadlock.

…Utils.tryOrStopSparkContext ## What changes were proposed in this pull request? When `SparkContext.stop` is called in `Utils.tryOrStopSparkContext` (the following three places), it will cause deadlock because the `stop` method needs to wait for the thread running `stop` to exit. - ContextCleaner.keepCleaning - LiveListenerBus.listenerThread.run - TaskSchedulerImpl.start This PR adds `SparkContext.stopInNewThread` and uses it to eliminate the potential deadlock. I also removed my changes in apache#15775 since they are not necessary now. ## How was this patch tested? Jenkins Author: Shixiong Zhu <[email protected]> Closes apache#16178 from zsxwing/fix-stop-deadlock.

…kend.dead` ## What changes were proposed in this pull request? "StandaloneSchedulerBackend.dead" is called in a RPC thread, so it should not call "SparkContext.stop" in the same thread. "SparkContext.stop" will block until all RPC threads exit, if it's called inside a RPC thread, it will be dead-lock. This PR add a thread local flag inside RPC threads. `SparkContext.stop` uses it to decide if launching a new thread to stop the SparkContext. ## How was this patch tested? Jenkins Author: Shixiong Zhu <[email protected]> Closes apache#15775 from zsxwing/SPARK-18280.

…Utils.tryOrStopSparkContext ## What changes were proposed in this pull request? When `SparkContext.stop` is called in `Utils.tryOrStopSparkContext` (the following three places), it will cause deadlock because the `stop` method needs to wait for the thread running `stop` to exit. - ContextCleaner.keepCleaning - LiveListenerBus.listenerThread.run - TaskSchedulerImpl.start This PR adds `SparkContext.stopInNewThread` and uses it to eliminate the potential deadlock. I also removed my changes in apache#15775 since they are not necessary now. ## How was this patch tested? Jenkins Author: Shixiong Zhu <[email protected]> Closes apache#16178 from zsxwing/fix-stop-deadlock.

Fix potential deadlock in StandaloneSchedulerBackend.dead

90e9090

rxin reviewed Nov 5, 2016

View reviewed changes

Add a thread local flag for RPC threads and run 'stop' in a new threa…

d9c5626

…d automatically

asfgit closed this in b6de0c9 Nov 8, 2016

zsxwing deleted the SPARK-18280 branch November 8, 2016 21:20

zsxwing mentioned this pull request Dec 6, 2016

[SPARK-18751][Core]Fix deadlock when SparkContext.stop is called in Utils.tryOrStopSparkContext #16178

Closed

[SPARK-18280][Core]Fix potential deadlock in StandaloneSchedulerBackend.dead #15775

[SPARK-18280][Core]Fix potential deadlock in StandaloneSchedulerBackend.dead #15775

Uh oh!

Conversation

zsxwing commented Nov 4, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

rxin commented Nov 4, 2016

Uh oh!

zsxwing commented Nov 4, 2016

Uh oh!

SparkQA commented Nov 5, 2016

Uh oh!

rxin commented Nov 5, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rxin Nov 5, 2016

Choose a reason for hiding this comment

Uh oh!

zsxwing commented Nov 7, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Nov 7, 2016

Uh oh!

zsxwing commented Nov 7, 2016

Uh oh!

SparkQA commented Nov 8, 2016

Uh oh!

SparkQA commented Nov 8, 2016

Uh oh!

zsxwing commented Nov 8, 2016

Uh oh!

SparkQA commented Nov 8, 2016

Uh oh!

zsxwing commented Nov 8, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-18280][Core]Fix potential deadlock in `StandaloneSchedulerBackend.dead` #15775

[SPARK-18280][Core]Fix potential deadlock in `StandaloneSchedulerBackend.dead` #15775

zsxwing commented Nov 4, 2016 •

edited

Loading

rxin commented Nov 5, 2016 •

edited

Loading

zsxwing commented Nov 7, 2016 •

edited

Loading