-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-33587][Core]Kill the executor on nested fatal errors #30528
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| } | ||
|
|
||
| test("SPARK-33587: isFatalError") { | ||
| def errorInThreadPool(e: => Throwable): Throwable = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trying to make this test cover the cases I mentioned in the description.
| * This is to avoid `StackOverflowError` when hitting a cycle in the exception chain. | ||
| */ | ||
| def isFatalError(t: Throwable, shouldDetectNestedFatalError: Boolean, depth: Int = 0): Boolean = { | ||
| if (depth <= 5) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pick up 5 which should be enough to cover most of cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, just create a config with that default value instead of the bool config spark.executor.killOnNestedFatalError and this magic number?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point!
| def isFatalError(t: Throwable, shouldDetectNestedFatalError: Boolean, depth: Int = 0): Boolean = { | ||
| if (depth <= 5) { | ||
| t match { | ||
| case _: SparkOutOfMemoryError => false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just in case, we are sure that OOM cannot be caused by a fatal error, and it cannot present somewhere in the chain?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an existing behavior. #20014 added SparkOutOfMemoryError to avoid killing the executor when it's not thrown by JVM.
|
Test build #131910 has finished for PR 30528 at commit
|
|
Test build #131915 has finished for PR 30528 at commit
|
|
Test build #131916 has finished for PR 30528 at commit
|
|
Test build #131918 has finished for PR 30528 at commit
|
|
Build finished. |
|
Refer to this link for build results (access rights to CI server needed): |
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What changes were proposed in this pull request?
Currently we will kill the executor when hitting a fatal error. However, if the fatal error is wrapped by another exception, such as
spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala
Line 231 in cf98a76
spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala
Line 296 in cf98a76
We will still keep the executor running. Fatal errors are usually unrecoverable (such as OutOfMemoryError), some components may be in a broken state when hitting a fatal error and it's hard to predicate the behaviors of a broken component. Hence, it's better to detect the nested fatal error as well and kill the executor. Then we can rely on Spark's fault tolerance to recover.
Why are the changes needed?
Fatal errors are usually unrecoverable (such as OutOfMemoryError), some components may be in a broken state when hitting a fatal error and it's hard to predicate the behaviors of a broken component. Hence, it's better to detect the nested fatal error as well and kill the executor. Then we can rely on Spark's fault tolerance to recover.
Does this PR introduce any user-facing change?
Yep. There is a slight internal behavior change on when to kill an executor. We will kill the executor when detecting a nested fatal error in the exception chain.
spark.executor.killOnFatalError.depthis added to allow users to turn off this change if the slight behavior change impacts them.How was this patch tested?
The new method
Executor.isFatalErroris tested byspark.executor.killOnNestedFatalError.