-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-29177] [Core] fix zombie tasks after stage abort #25850
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
retest this please. |
|
Test build #110991 has finished for PR 25850 at commit
|
|
Test build #110993 has finished for PR 25850 at commit
|
|
Test build #110999 has finished for PR 25850 at commit
|
|
Test build #111040 has finished for PR 25850 at commit
|
|
Test build #111042 has finished for PR 25850 at commit
|
|
@xuanyuanking Could you please help review this? |
xuanyuanking
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pinging me, I think it make sense to handle a success task as killed task for resource cleaning. We did the same thing in TaskSetManager.handleSuccessfulTask for speculative tasks.
| val (result, size) = serializer.get().deserialize[TaskResult[_]](serializedData) match { | ||
| case directResult: DirectTaskResult[_] => | ||
| if (!taskSetManager.canFetchMoreResults(serializedData.limit())) { | ||
| scheduler.handleFailedTask(taskSetManager, tid, TaskState.KILLED, TaskKilled( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about directly call taskSetManager.handleFailedTask here?
If canFetchMoreResults return false, taskSetManger.isZombie has set to true. scheduler.handlerFailedTask equally same with taskSetManager.handleFailedTask, and this will make UT easy to write.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
calling scheduler.handleFailedTask is to be consistent with other cases in this function.
core/src/test/scala/org/apache/spark/scheduler/TaskResultGetterSuite.scala
Outdated
Show resolved
Hide resolved
|
Test build #111119 has finished for PR 25850 at commit
|
xuanyuanking
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one nit.
LGTM, cc @jiangxb1987 @cloud-fan
| val (result, size) = serializer.get().deserialize[TaskResult[_]](serializedData) match { | ||
| case directResult: DirectTaskResult[_] => | ||
| if (!taskSetManager.canFetchMoreResults(serializedData.limit())) { | ||
| scheduler.handleFailedTask(taskSetManager, tid, TaskState.KILLED, TaskKilled( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to leave a comment here to explain why we handle the oversize task as a killed task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, thanks.
add comments
|
Test build #111204 has finished for PR 25850 at commit
|
### What changes were proposed in this pull request? Do task handling even the task exceeds maxResultSize configured. More details are in the jira description https://issues.apache.org/jira/browse/SPARK-29177 . ### Why are the changes needed? Without this patch, the zombie tasks will prevent yarn from recycle those containers running these tasks, which will affect other applications. ### Does this PR introduce any user-facing change? No ### How was this patch tested? unit test and production test with a very large `SELECT` in spark thriftserver. Closes #25850 from adrian-wang/zombie. Authored-by: Daoyuan Wang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit c08bc37) Signed-off-by: Wenchen Fan <[email protected]>
|
thanks, merging to master/2.4! |
|
@cloud-fan @adrian-wang oops, looks like this doesn't compile in 2.4: Want to revert it or just hot-fix forward? it may be pretty easy. |
|
@srowen thanks for catching! I've pushed a commit to fix it. |
What changes were proposed in this pull request?
Do task handling even the task exceeds maxResultSize configured. More details are in the jira description https://issues.apache.org/jira/browse/SPARK-29177 .
Why are the changes needed?
Without this patch, the zombie tasks will prevent yarn from recycle those containers running these tasks, which will affect other applications.
Does this PR introduce any user-facing change?
No
How was this patch tested?
unit test and production test with a very large
SELECTin spark thriftserver.