Skip to content

Conversation

@zsxwing
Copy link
Member

@zsxwing zsxwing commented Nov 28, 2020

What changes were proposed in this pull request?

Currently we will kill the executor when hitting a fatal error. However, if the fatal error is wrapped by another exception, such as

We will still keep the executor running. Fatal errors are usually unrecoverable (such as OutOfMemoryError), some components may be in a broken state when hitting a fatal error and it's hard to predicate the behaviors of a broken component. Hence, it's better to detect the nested fatal error as well and kill the executor. Then we can rely on Spark's fault tolerance to recover.

Why are the changes needed?

Fatal errors are usually unrecoverable (such as OutOfMemoryError), some components may be in a broken state when hitting a fatal error and it's hard to predicate the behaviors of a broken component. Hence, it's better to detect the nested fatal error as well and kill the executor. Then we can rely on Spark's fault tolerance to recover.

Does this PR introduce any user-facing change?

Yep. There is a slight internal behavior change on when to kill an executor. We will kill the executor when detecting a nested fatal error in the exception chain. spark.executor.killOnFatalError.depth is added to allow users to turn off this change if the slight behavior change impacts them.

How was this patch tested?

The new method Executor.isFatalError is tested by spark.executor.killOnNestedFatalError.

@zsxwing zsxwing requested a review from jiangxb1987 November 28, 2020 18:26
@github-actions github-actions bot added the CORE label Nov 28, 2020
}

test("SPARK-33587: isFatalError") {
def errorInThreadPool(e: => Throwable): Throwable = {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to make this test cover the cases I mentioned in the description.

* This is to avoid `StackOverflowError` when hitting a cycle in the exception chain.
*/
def isFatalError(t: Throwable, shouldDetectNestedFatalError: Boolean, depth: Int = 0): Boolean = {
if (depth <= 5) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pick up 5 which should be enough to cover most of cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, just create a config with that default value instead of the bool config spark.executor.killOnNestedFatalError and this magic number?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point!

def isFatalError(t: Throwable, shouldDetectNestedFatalError: Boolean, depth: Int = 0): Boolean = {
if (depth <= 5) {
t match {
case _: SparkOutOfMemoryError => false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just in case, we are sure that OOM cannot be caused by a fatal error, and it cannot present somewhere in the chain?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an existing behavior. #20014 added SparkOutOfMemoryError to avoid killing the executor when it's not thrown by JVM.

@SparkQA
Copy link

SparkQA commented Nov 28, 2020

Test build #131910 has finished for PR 30528 at commit 4156c03.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 29, 2020

Test build #131915 has finished for PR 30528 at commit 2720b60.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 29, 2020

Test build #131916 has finished for PR 30528 at commit 1ec1c1d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 29, 2020

Test build #131918 has finished for PR 30528 at commit 312f042.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Build finished.

@AmplabJenkins
Copy link

Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/131918/

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @zsxwing and @MaxGekk .
Merged to master for Apache Spark 3.1.0.

@zsxwing zsxwing deleted the SPARK-33587 branch November 29, 2020 22:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants