-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-16230] [CORE] CoarseGrainedExecutorBackend to self kill if there is an exception while creating an Executor #14202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #62324 has finished for PR 14202 at commit
|
|
Hi @zsxwing , can you please review this diff ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tejasapatil do you know if all log frameworks can handle null throwable here? Perhaps it's better to handle it by ourselves like:
if (throwable != null) {
logError(reason, throwable)
} else {
logError(reason)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Did the change.
|
@tejasapatil Looks pretty good. Just one minor comment. |
…ile creating an Executor
0c71699 to
5499071
Compare
|
Test build #62381 has finished for PR 14202 at commit
|
|
@zsxwing : I am done with the change(s) you suggested. |
|
LGTM. Thanks! Merging to master and 2.0. |
…e is an exception while creating an Executor ## What changes were proposed in this pull request? With the fix from SPARK-13112, I see that `LaunchTask` is always processed after `RegisteredExecutor` is done and so it gets chance to do all retries to startup an executor. There is still a problem that if `Executor` creation itself fails and there is some exception, it gets unnoticed and the executor is killed when it tries to process the `LaunchTask` as `executor` is null : https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L88 So if one looks at the logs, it does not tell that there was problem during `Executor` creation and thats why it was killed. This PR explicitly catches exception in `Executor` creation, logs a proper message and then exits the JVM. Also, I have changed the `exitExecutor` method to accept `reason` so that backends can use that reason and do stuff like logging to a DB to get an aggregate of such exits at a cluster level ## How was this patch tested? I am relying on existing tests Author: Tejas Patil <[email protected]> Closes #14202 from tejasapatil/exit_executor_failure. (cherry picked from commit b2f24f9) Signed-off-by: Shixiong Zhu <[email protected]>
…f there is an exception while creating an Executor apache#14202
What changes were proposed in this pull request?
With the fix from SPARK-13112, I see that
LaunchTaskis always processed afterRegisteredExecutoris done and so it gets chance to do all retries to startup an executor. There is still a problem that ifExecutorcreation itself fails and there is some exception, it gets unnoticed and the executor is killed when it tries to process theLaunchTaskasexecutoris null : https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L88 So if one looks at the logs, it does not tell that there was problem duringExecutorcreation and thats why it was killed.This PR explicitly catches exception in
Executorcreation, logs a proper message and then exits the JVM. Also, I have changed theexitExecutormethod to acceptreasonso that backends can use that reason and do stuff like logging to a DB to get an aggregate of such exits at a cluster levelHow was this patch tested?
I am relying on existing tests