-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-16925] Master should call schedule() after all executor exit events, not only failures #14510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Member
|
LGTM |
|
Test build #63286 has finished for PR 14510 at commit
|
Contributor
Author
|
I'm going to merge this to master, branch-2.0, and branch-1.6. I have a followup patch to add configuration options for controlling the "remove application that has experienced too many back-to-back executor failures" code path, which I'll submit tomorrow. |
asfgit
pushed a commit
that referenced
this pull request
Aug 7, 2016
…vents, not only failures
## What changes were proposed in this pull request?
This patch fixes a bug in Spark's standalone Master which could cause applications to hang if tasks cause executors to exit with zero exit codes.
As an example of the bug, run
```
sc.parallelize(1 to 1, 1).foreachPartition { _ => System.exit(0) }
```
on a standalone cluster which has a single Spark application. This will cause all executors to die but those executors won't be replaced unless another Spark application or worker joins or leaves the cluster (or if an executor exits with a non-zero exit code). This behavior is caused by a bug in how the Master handles the `ExecutorStateChanged` event: the current implementation calls `schedule()` only if the executor exited with a non-zero exit code, so a task which causes a JVM to unexpectedly exit "cleanly" will skip the `schedule()` call.
This patch addresses this by modifying the `ExecutorStateChanged` to always unconditionally call `schedule()`. This should be safe because it should always be safe to call `schedule()`; adding extra `schedule()` calls can only affect performance and should not introduce correctness bugs.
## How was this patch tested?
I added a regression test in `DistributedSuite`.
Author: Josh Rosen <[email protected]>
Closes #14510 from JoshRosen/SPARK-16925.
(cherry picked from commit 4f5f9b6)
Signed-off-by: Josh Rosen <[email protected]>
asfgit
pushed a commit
that referenced
this pull request
Aug 7, 2016
…vents, not only failures
This patch fixes a bug in Spark's standalone Master which could cause applications to hang if tasks cause executors to exit with zero exit codes.
As an example of the bug, run
```
sc.parallelize(1 to 1, 1).foreachPartition { _ => System.exit(0) }
```
on a standalone cluster which has a single Spark application. This will cause all executors to die but those executors won't be replaced unless another Spark application or worker joins or leaves the cluster (or if an executor exits with a non-zero exit code). This behavior is caused by a bug in how the Master handles the `ExecutorStateChanged` event: the current implementation calls `schedule()` only if the executor exited with a non-zero exit code, so a task which causes a JVM to unexpectedly exit "cleanly" will skip the `schedule()` call.
This patch addresses this by modifying the `ExecutorStateChanged` to always unconditionally call `schedule()`. This should be safe because it should always be safe to call `schedule()`; adding extra `schedule()` calls can only affect performance and should not introduce correctness bugs.
I added a regression test in `DistributedSuite`.
Author: Josh Rosen <[email protected]>
Closes #14510 from JoshRosen/SPARK-16925.
(cherry picked from commit 4f5f9b6)
Signed-off-by: Josh Rosen <[email protected]>
zzcclp
pushed a commit
to zzcclp/spark
that referenced
this pull request
Aug 8, 2016
…vents, not only failures
This patch fixes a bug in Spark's standalone Master which could cause applications to hang if tasks cause executors to exit with zero exit codes.
As an example of the bug, run
```
sc.parallelize(1 to 1, 1).foreachPartition { _ => System.exit(0) }
```
on a standalone cluster which has a single Spark application. This will cause all executors to die but those executors won't be replaced unless another Spark application or worker joins or leaves the cluster (or if an executor exits with a non-zero exit code). This behavior is caused by a bug in how the Master handles the `ExecutorStateChanged` event: the current implementation calls `schedule()` only if the executor exited with a non-zero exit code, so a task which causes a JVM to unexpectedly exit "cleanly" will skip the `schedule()` call.
This patch addresses this by modifying the `ExecutorStateChanged` to always unconditionally call `schedule()`. This should be safe because it should always be safe to call `schedule()`; adding extra `schedule()` calls can only affect performance and should not introduce correctness bugs.
I added a regression test in `DistributedSuite`.
Author: Josh Rosen <[email protected]>
Closes apache#14510 from JoshRosen/SPARK-16925.
(cherry picked from commit 4f5f9b6)
Signed-off-by: Josh Rosen <[email protected]>
(cherry picked from commit c162886)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
This patch fixes a bug in Spark's standalone Master which could cause applications to hang if tasks cause executors to exit with zero exit codes.
As an example of the bug, run
on a standalone cluster which has a single Spark application. This will cause all executors to die but those executors won't be replaced unless another Spark application or worker joins or leaves the cluster (or if an executor exits with a non-zero exit code). This behavior is caused by a bug in how the Master handles the
ExecutorStateChangedevent: the current implementation callsschedule()only if the executor exited with a non-zero exit code, so a task which causes a JVM to unexpectedly exit "cleanly" will skip theschedule()call.This patch addresses this by modifying the
ExecutorStateChangedto always unconditionally callschedule(). This should be safe because it should always be safe to callschedule(); adding extraschedule()calls can only affect performance and should not introduce correctness bugs.How was this patch tested?
I added a regression test in
DistributedSuite.