[SPARK-33587][Core]Kill the executor on nested fatal errors #30528

zsxwing · 2020-11-28T18:26:13Z

What changes were proposed in this pull request?

Currently we will kill the executor when hitting a fatal error. However, if the fatal error is wrapped by another exception, such as

java.util.concurrent.ExecutionException, com.google.common.util.concurrent.UncheckedExecutionException, com.google.common.util.concurrent.ExecutionError when using Guava cache or Java thread pool.
SparkException thrown from

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala

Line 231 in cf98a76

throw new SparkException("Job aborted.", cause)

or

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala

Line 296 in cf98a76

throw new SparkException("Task failed while writing rows.", t)

We will still keep the executor running. Fatal errors are usually unrecoverable (such as OutOfMemoryError), some components may be in a broken state when hitting a fatal error and it's hard to predicate the behaviors of a broken component. Hence, it's better to detect the nested fatal error as well and kill the executor. Then we can rely on Spark's fault tolerance to recover.

Why are the changes needed?

Fatal errors are usually unrecoverable (such as OutOfMemoryError), some components may be in a broken state when hitting a fatal error and it's hard to predicate the behaviors of a broken component. Hence, it's better to detect the nested fatal error as well and kill the executor. Then we can rely on Spark's fault tolerance to recover.

Does this PR introduce any user-facing change?

Yep. There is a slight internal behavior change on when to kill an executor. We will kill the executor when detecting a nested fatal error in the exception chain. spark.executor.killOnFatalError.depth is added to allow users to turn off this change if the slight behavior change impacts them.

How was this patch tested?

The new method Executor.isFatalError is tested by spark.executor.killOnNestedFatalError.

zsxwing · 2020-11-28T18:27:51Z

core/src/test/scala/org/apache/spark/executor/ExecutorSuite.scala

  }

+  test("SPARK-33587: isFatalError") {
+    def errorInThreadPool(e: => Throwable): Throwable = {


Trying to make this test cover the cases I mentioned in the description.

zsxwing · 2020-11-28T18:28:27Z

core/src/main/scala/org/apache/spark/executor/Executor.scala

+   *              This is to avoid `StackOverflowError` when hitting a cycle in the exception chain.
+   */
+  def isFatalError(t: Throwable, shouldDetectNestedFatalError: Boolean, depth: Int = 0): Boolean = {
+    if (depth <= 5) {


Pick up 5 which should be enough to cover most of cases.

Maybe, just create a config with that default value instead of the bool config spark.executor.killOnNestedFatalError and this magic number?

Good point!

MaxGekk · 2020-11-28T19:01:12Z

core/src/main/scala/org/apache/spark/executor/Executor.scala

+  def isFatalError(t: Throwable, shouldDetectNestedFatalError: Boolean, depth: Int = 0): Boolean = {
+    if (depth <= 5) {
+      t match {
+        case _: SparkOutOfMemoryError => false


Just in case, we are sure that OOM cannot be caused by a fatal error, and it cannot present somewhere in the chain?

This is an existing behavior. #20014 added SparkOutOfMemoryError to avoid killing the executor when it's not thrown by JVM.

SparkQA · 2020-11-28T20:44:01Z

Test build #131910 has finished for PR 30528 at commit 4156c03.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

core/src/main/scala/org/apache/spark/internal/config/package.scala

SparkQA · 2020-11-29T01:18:48Z

Test build #131915 has finished for PR 30528 at commit 2720b60.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-11-29T01:51:15Z

Test build #131916 has finished for PR 30528 at commit 1ec1c1d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-11-29T03:37:17Z

Test build #131918 has finished for PR 30528 at commit 312f042.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2020-11-29T03:39:22Z

Build finished.

AmplabJenkins · 2020-11-29T03:39:23Z

Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/131918/

dongjoon-hyun

+1, LGTM. Thank you, @zsxwing and @MaxGekk .
Merged to master for Apache Spark 3.1.0.

Kill the executor on nested fatal errors

4156c03

zsxwing requested a review from jiangxb1987 November 28, 2020 18:26

github-actions bot added the CORE label Nov 28, 2020

zsxwing commented Nov 28, 2020

View reviewed changes

MaxGekk reviewed Nov 28, 2020

View reviewed changes

zsxwing added 2 commits November 28, 2020 15:24

add a config for depth

2720b60

fix

1ec1c1d

dongjoon-hyun reviewed Nov 29, 2020

View reviewed changes

core/src/main/scala/org/apache/spark/internal/config/package.scala Show resolved Hide resolved

version

312f042

dongjoon-hyun approved these changes Nov 29, 2020

View reviewed changes

dongjoon-hyun closed this in c8286ec Nov 29, 2020

zsxwing deleted the SPARK-33587 branch November 29, 2020 22:21

[SPARK-33587][Core]Kill the executor on nested fatal errors #30528

[SPARK-33587][Core]Kill the executor on nested fatal errors #30528

Uh oh!

Conversation

zsxwing commented Nov 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

zsxwing Nov 28, 2020

Choose a reason for hiding this comment

Uh oh!

zsxwing Nov 28, 2020

Choose a reason for hiding this comment

Uh oh!

MaxGekk Nov 28, 2020

Choose a reason for hiding this comment

Uh oh!

zsxwing Nov 28, 2020

Choose a reason for hiding this comment

Uh oh!

MaxGekk Nov 28, 2020

Choose a reason for hiding this comment

Uh oh!

zsxwing Nov 28, 2020

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 28, 2020

Uh oh!

Uh oh!

SparkQA commented Nov 29, 2020

Uh oh!

SparkQA commented Nov 29, 2020

Uh oh!

SparkQA commented Nov 29, 2020

Uh oh!

AmplabJenkins commented Nov 29, 2020

Uh oh!

AmplabJenkins commented Nov 29, 2020

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

zsxwing commented Nov 28, 2020 •

edited

Loading