Skip to content

Conversation

@zsxwing
Copy link
Member

@zsxwing zsxwing commented Nov 4, 2016

What changes were proposed in this pull request?

"StandaloneSchedulerBackend.dead" is called in a RPC thread, so it should not call "SparkContext.stop" in the same thread. "SparkContext.stop" will block until all RPC threads exit, if it's called inside a RPC thread, it will be dead-lock.

This PR add a thread local flag inside RPC threads. SparkContext.stop uses it to decide if launching a new thread to stop the SparkContext.

How was this patch tested?

Jenkins

@rxin
Copy link
Contributor

rxin commented Nov 4, 2016

How can we prevent issues like this from happening again in the future? Any exception we can throw better or linter to do?

@zsxwing
Copy link
Member Author

zsxwing commented Nov 4, 2016

How about throw an exception if the current thread is an RPC thread in sc.stop?

@SparkQA
Copy link

SparkQA commented Nov 5, 2016

Test build #68163 has finished for PR 15775 at commit 90e9090.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Nov 5, 2016

Yea I think that's a good idea ... at least we will know.

} finally {
// Ensure the application terminates, as we can no longer run jobs.
sc.stop()
new Thread("stop-spark-context") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also document why we need this.

@zsxwing
Copy link
Member Author

zsxwing commented Nov 7, 2016

I added a thread local flag inside RPC threads. Instead of throwing an exception, SparkContext.stop will always launch a new thread to stop SparkContext.

I think this is better since we always launch a new thread to fix such issue.

@SparkQA
Copy link

SparkQA commented Nov 7, 2016

Test build #68287 has finished for PR 15775 at commit d9c5626.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zsxwing
Copy link
Member Author

zsxwing commented Nov 7, 2016

retest this please

@SparkQA
Copy link

SparkQA commented Nov 8, 2016

Test build #68297 has finished for PR 15775 at commit d9c5626.

  • This patch fails from timeout after a configured wait of 250m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 8, 2016

Test build #3418 has finished for PR 15775 at commit d9c5626.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zsxwing
Copy link
Member Author

zsxwing commented Nov 8, 2016

retest this please

@SparkQA
Copy link

SparkQA commented Nov 8, 2016

Test build #68320 has finished for PR 15775 at commit d9c5626.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zsxwing
Copy link
Member Author

zsxwing commented Nov 8, 2016

Thanks! Merging to master, 2.1 and 2.0.

@asfgit asfgit closed this in b6de0c9 Nov 8, 2016
asfgit pushed a commit that referenced this pull request Nov 8, 2016
…kend.dead`

## What changes were proposed in this pull request?

"StandaloneSchedulerBackend.dead" is called in a RPC thread, so it should not call "SparkContext.stop" in the same thread. "SparkContext.stop" will block until all RPC threads exit, if it's called inside a RPC thread, it will be dead-lock.

This PR add a thread local flag inside RPC threads. `SparkContext.stop` uses it to decide if launching a new thread to stop the SparkContext.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <[email protected]>

Closes #15775 from zsxwing/SPARK-18280.
asfgit pushed a commit that referenced this pull request Nov 8, 2016
…kend.dead`

## What changes were proposed in this pull request?

"StandaloneSchedulerBackend.dead" is called in a RPC thread, so it should not call "SparkContext.stop" in the same thread. "SparkContext.stop" will block until all RPC threads exit, if it's called inside a RPC thread, it will be dead-lock.

This PR add a thread local flag inside RPC threads. `SparkContext.stop` uses it to decide if launching a new thread to stop the SparkContext.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <[email protected]>

Closes #15775 from zsxwing/SPARK-18280.
@zsxwing zsxwing deleted the SPARK-18280 branch November 8, 2016 21:20
asfgit pushed a commit that referenced this pull request Dec 8, 2016
…Utils.tryOrStopSparkContext

## What changes were proposed in this pull request?

When `SparkContext.stop` is called in `Utils.tryOrStopSparkContext` (the following three places), it will cause deadlock because the `stop` method needs to wait for the thread running `stop` to exit.

- ContextCleaner.keepCleaning
- LiveListenerBus.listenerThread.run
- TaskSchedulerImpl.start

This PR adds `SparkContext.stopInNewThread` and uses it to eliminate the potential deadlock. I also removed my changes in #15775 since they are not necessary now.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <[email protected]>

Closes #16178 from zsxwing/fix-stop-deadlock.

(cherry picked from commit 26432df)
Signed-off-by: Shixiong Zhu <[email protected]>
asfgit pushed a commit that referenced this pull request Dec 8, 2016
…Utils.tryOrStopSparkContext

## What changes were proposed in this pull request?

When `SparkContext.stop` is called in `Utils.tryOrStopSparkContext` (the following three places), it will cause deadlock because the `stop` method needs to wait for the thread running `stop` to exit.

- ContextCleaner.keepCleaning
- LiveListenerBus.listenerThread.run
- TaskSchedulerImpl.start

This PR adds `SparkContext.stopInNewThread` and uses it to eliminate the potential deadlock. I also removed my changes in #15775 since they are not necessary now.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <[email protected]>

Closes #16178 from zsxwing/fix-stop-deadlock.
robert3005 pushed a commit to palantir/spark that referenced this pull request Dec 15, 2016
…Utils.tryOrStopSparkContext

## What changes were proposed in this pull request?

When `SparkContext.stop` is called in `Utils.tryOrStopSparkContext` (the following three places), it will cause deadlock because the `stop` method needs to wait for the thread running `stop` to exit.

- ContextCleaner.keepCleaning
- LiveListenerBus.listenerThread.run
- TaskSchedulerImpl.start

This PR adds `SparkContext.stopInNewThread` and uses it to eliminate the potential deadlock. I also removed my changes in apache#15775 since they are not necessary now.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <[email protected]>

Closes apache#16178 from zsxwing/fix-stop-deadlock.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…kend.dead`

## What changes were proposed in this pull request?

"StandaloneSchedulerBackend.dead" is called in a RPC thread, so it should not call "SparkContext.stop" in the same thread. "SparkContext.stop" will block until all RPC threads exit, if it's called inside a RPC thread, it will be dead-lock.

This PR add a thread local flag inside RPC threads. `SparkContext.stop` uses it to decide if launching a new thread to stop the SparkContext.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <[email protected]>

Closes apache#15775 from zsxwing/SPARK-18280.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…Utils.tryOrStopSparkContext

## What changes were proposed in this pull request?

When `SparkContext.stop` is called in `Utils.tryOrStopSparkContext` (the following three places), it will cause deadlock because the `stop` method needs to wait for the thread running `stop` to exit.

- ContextCleaner.keepCleaning
- LiveListenerBus.listenerThread.run
- TaskSchedulerImpl.start

This PR adds `SparkContext.stopInNewThread` and uses it to eliminate the potential deadlock. I also removed my changes in apache#15775 since they are not necessary now.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <[email protected]>

Closes apache#16178 from zsxwing/fix-stop-deadlock.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants