[SPARK-29177] [Core] fix zombie tasks after stage abort #25850

adrian-wang · 2019-09-19T08:53:12Z

What changes were proposed in this pull request?

Do task handling even the task exceeds maxResultSize configured. More details are in the jira description https://issues.apache.org/jira/browse/SPARK-29177 .

Why are the changes needed?

Without this patch, the zombie tasks will prevent yarn from recycle those containers running these tasks, which will affect other applications.

Does this PR introduce any user-facing change?

No

How was this patch tested?

unit test and production test with a very large SELECT in spark thriftserver.

adrian-wang · 2019-09-19T10:06:25Z

retest this please.

SparkQA · 2019-09-19T11:29:48Z

Test build #110991 has finished for PR 25850 at commit 82740d0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-09-19T12:25:38Z

Test build #110993 has finished for PR 25850 at commit 82740d0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-09-19T14:21:01Z

Test build #110999 has finished for PR 25850 at commit d1e744e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-09-20T04:29:09Z

Test build #111040 has finished for PR 25850 at commit d1e744e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-09-20T05:51:59Z

Test build #111042 has finished for PR 25850 at commit b9dc92b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

adrian-wang · 2019-09-20T06:14:47Z

@xuanyuanking Could you please help review this?

xuanyuanking

Thanks for pinging me, I think it make sense to handle a success task as killed task for resource cleaning. We did the same thing in TaskSetManager.handleSuccessfulTask for speculative tasks.

xuanyuanking · 2019-09-20T16:53:56Z

core/src/main/scala/org/apache/spark/scheduler/TaskResultGetter.scala

          val (result, size) = serializer.get().deserialize[TaskResult[_]](serializedData) match {
            case directResult: DirectTaskResult[_] =>
              if (!taskSetManager.canFetchMoreResults(serializedData.limit())) {
+                scheduler.handleFailedTask(taskSetManager, tid, TaskState.KILLED, TaskKilled(


How about directly call taskSetManager.handleFailedTask here?
If canFetchMoreResults return false, taskSetManger.isZombie has set to true. scheduler.handlerFailedTask equally same with taskSetManager.handleFailedTask, and this will make UT easy to write.

calling scheduler.handleFailedTask is to be consistent with other cases in this function.

core/src/test/scala/org/apache/spark/scheduler/TaskResultGetterSuite.scala

SparkQA · 2019-09-21T14:42:03Z

Test build #111119 has finished for PR 25850 at commit fe3c674.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

xuanyuanking

Just one nit.
LGTM, cc @jiangxb1987 @cloud-fan

xuanyuanking · 2019-09-23T07:53:56Z

core/src/main/scala/org/apache/spark/scheduler/TaskResultGetter.scala

          val (result, size) = serializer.get().deserialize[TaskResult[_]](serializedData) match {
            case directResult: DirectTaskResult[_] =>
              if (!taskSetManager.canFetchMoreResults(serializedData.limit())) {
+                scheduler.handleFailedTask(taskSetManager, tid, TaskState.KILLED, TaskKilled(


Better to leave a comment here to explain why we handle the oversize task as a killed task.

Updated, thanks.

add comments

SparkQA · 2019-09-23T11:23:29Z

Test build #111204 has finished for PR 25850 at commit aa41348.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

### What changes were proposed in this pull request? Do task handling even the task exceeds maxResultSize configured. More details are in the jira description https://issues.apache.org/jira/browse/SPARK-29177 . ### Why are the changes needed? Without this patch, the zombie tasks will prevent yarn from recycle those containers running these tasks, which will affect other applications. ### Does this PR introduce any user-facing change? No ### How was this patch tested? unit test and production test with a very large `SELECT` in spark thriftserver. Closes #25850 from adrian-wang/zombie. Authored-by: Daoyuan Wang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit c08bc37) Signed-off-by: Wenchen Fan <[email protected]>

cloud-fan · 2019-09-23T11:46:54Z

thanks, merging to master/2.4!

srowen · 2019-09-23T14:32:56Z

@cloud-fan @adrian-wang oops, looks like this doesn't compile in 2.4:

[error] /home/jenkins/workspace/spark-branch-2.4-test-sbt-hadoop-2.7/core/src/test/scala/org/apache/spark/scheduler/TaskResultGetterSuite.scala:155: overloaded method constructor DirectTaskResult with alternatives:
[error]   ()org.apache.spark.scheduler.DirectTaskResult[T] <and>
[error]   (valueBytes: java.nio.ByteBuffer,accumUpdates: Seq[org.apache.spark.util.AccumulatorV2[_, _]])org.apache.spark.scheduler.DirectTaskResult[T]
[error]  cannot be applied to (java.nio.ByteBuffer, scala.collection.immutable.Nil.type, Array[Nothing])
[error]     val directTaskResult = new DirectTaskResult(ByteBuffer.allocate(0), Nil, Array())
[error]

Want to revert it or just hot-fix forward? it may be pretty easy.

cloud-fan · 2019-09-23T14:52:36Z

@srowen thanks for catching! I've pushed a commit to fix it.

fix zombie tasks after stage abort

82740d0

fix tests

d1e744e

dongjoon-hyun added the SPARK CORE label Sep 20, 2019

sleep 1s

b9dc92b

xuanyuanking reviewed Sep 20, 2019

View reviewed changes

refine tests

fe3c674

xuanyuanking approved these changes Sep 23, 2019

View reviewed changes

Update TaskResultGetter.scala

aa41348

add comments

cloud-fan approved these changes Sep 23, 2019

View reviewed changes

cloud-fan closed this in c08bc37 Sep 23, 2019

adrian-wang deleted the zombie branch October 17, 2019 02:45

[SPARK-29177] [Core] fix zombie tasks after stage abort #25850

[SPARK-29177] [Core] fix zombie tasks after stage abort #25850

Uh oh!

Conversation

adrian-wang commented Sep 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

adrian-wang commented Sep 19, 2019

Uh oh!

SparkQA commented Sep 19, 2019

Uh oh!

SparkQA commented Sep 19, 2019

Uh oh!

SparkQA commented Sep 19, 2019

Uh oh!

SparkQA commented Sep 20, 2019

Uh oh!

SparkQA commented Sep 20, 2019

Uh oh!

adrian-wang commented Sep 20, 2019

Uh oh!

xuanyuanking left a comment

Choose a reason for hiding this comment

Uh oh!

xuanyuanking Sep 20, 2019

Choose a reason for hiding this comment

Uh oh!

adrian-wang Sep 21, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SparkQA commented Sep 21, 2019

Uh oh!

xuanyuanking left a comment

Choose a reason for hiding this comment

Uh oh!

xuanyuanking Sep 23, 2019

Choose a reason for hiding this comment

Uh oh!

adrian-wang Sep 23, 2019

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 23, 2019

Uh oh!

cloud-fan commented Sep 23, 2019

Uh oh!

srowen commented Sep 23, 2019

Uh oh!

cloud-fan commented Sep 23, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

adrian-wang commented Sep 19, 2019 •

edited

Loading

cloud-fan commented Sep 23, 2019 •

edited

Loading