Skip to content

Conversation

@tejasapatil
Copy link
Contributor

What changes were proposed in this pull request?

With the fix from SPARK-13112, I see that LaunchTask is always processed after RegisteredExecutor is done and so it gets chance to do all retries to startup an executor. There is still a problem that if Executor creation itself fails and there is some exception, it gets unnoticed and the executor is killed when it tries to process the LaunchTask as executor is null : https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L88 So if one looks at the logs, it does not tell that there was problem during Executor creation and thats why it was killed.

This PR explicitly catches exception in Executor creation, logs a proper message and then exits the JVM. Also, I have changed the exitExecutor method to accept reason so that backends can use that reason and do stuff like logging to a DB to get an aggregate of such exits at a cluster level

How was this patch tested?

I am relying on existing tests

@SparkQA
Copy link

SparkQA commented Jul 14, 2016

Test build #62324 has finished for PR 14202 at commit 0c71699.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tejasapatil
Copy link
Contributor Author

Hi @zsxwing , can you please review this diff ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tejasapatil do you know if all log frameworks can handle null throwable here? Perhaps it's better to handle it by ourselves like:

if (throwable != null) {
   logError(reason, throwable)
} else {
  logError(reason)
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Did the change.

@zsxwing
Copy link
Member

zsxwing commented Jul 14, 2016

@tejasapatil Looks pretty good. Just one minor comment.

@tejasapatil tejasapatil force-pushed the exit_executor_failure branch from 0c71699 to 5499071 Compare July 15, 2016 17:29
@SparkQA
Copy link

SparkQA commented Jul 15, 2016

Test build #62381 has finished for PR 14202 at commit a4f80f5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tejasapatil
Copy link
Contributor Author

@zsxwing : I am done with the change(s) you suggested.

@zsxwing
Copy link
Member

zsxwing commented Jul 15, 2016

LGTM. Thanks! Merging to master and 2.0.

asfgit pushed a commit that referenced this pull request Jul 15, 2016
…e is an exception while creating an Executor

## What changes were proposed in this pull request?

With the fix from SPARK-13112, I see that `LaunchTask` is always processed after `RegisteredExecutor` is done and so it gets chance to do all retries to startup an executor. There is still a problem that if `Executor` creation itself fails and there is some exception, it gets unnoticed and the executor is killed when it tries to process the `LaunchTask` as `executor` is null : https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L88 So if one looks at the logs, it does not tell that there was problem during `Executor` creation and thats why it was killed.

This PR explicitly catches exception in `Executor` creation, logs a proper message and then exits the JVM. Also, I have changed the `exitExecutor` method to accept `reason` so that backends can use that reason and do stuff like logging to a DB to get an aggregate of such exits at a cluster level

## How was this patch tested?

I am relying on existing tests

Author: Tejas Patil <[email protected]>

Closes #14202 from tejasapatil/exit_executor_failure.

(cherry picked from commit b2f24f9)
Signed-off-by: Shixiong Zhu <[email protected]>
@asfgit asfgit closed this in b2f24f9 Jul 15, 2016
zzcclp added a commit to zzcclp/spark that referenced this pull request Jul 18, 2016
@tejasapatil tejasapatil deleted the exit_executor_failure branch August 9, 2016 20:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants