-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-16709][CORE] Kill the running task if stage failed #14557
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #63421 has finished for PR 14557 at commit
|
| } | ||
| } | ||
|
|
||
| def killTasks(tasks: HashSet[Long], taskInfo: HashMap[Long, TaskInfo]): Boolean = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not suitable to add a public method here in SparkContext, SparkContext is a public entry point, any method adds to here should be considered carefully. In your case looks like only Spark internally will use this method, why not directly change the TaskSetManager?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jerryshao, Thanks for your prompt. I will move the method to TaskSetManager.
|
Test build #63424 has finished for PR 14557 at commit
|
|
Test build #63434 has finished for PR 14557 at commit
|
| maybeFinishTaskSet() | ||
|
|
||
| // kill running task if stage failed | ||
| if(reason.isInstanceOf[FetchFailed]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A space between if and (.
|
LGTM |
|
Test build #63768 has finished for PR 14557 at commit
|
|
There are multiple issues with this PR. Some are at a more stylistic level, but some include deeper issues -- e.g. see SPARK-17064. Most fundamentally, this PR is the wrong solution at least in the sense that it does not implement a minimal fix without other side effects. The problem is that TaskCommitDenied is not being handled properly when a duplicate Task tries to commit a result that has already been successfully committed by another attempt of this Task. The proper fix needs to be at that point of committing duplicate results, not by making the larger, unnecessary change in how we handle cancellation/interruption of other Tasks in a TaskSet when one of them produces a FetchFailed. |
|
@shenh062326 I would rather like to propose to close this if there is no argument against ^. |
## What changes were proposed in this pull request? This PR proposes to close PRs ... - inactive to the review comments more than a month - WIP and inactive more than a month - with Jenkins build failure but inactive more than a month - suggested to be closed and no comment against that - obviously looking inappropriate (e.g., Branch 0.5) To make sure, I left a comment for each PR about a week ago and I could not have a response back from the author in these PRs below: Closes apache#11129 Closes apache#12085 Closes apache#12162 Closes apache#12419 Closes apache#12420 Closes apache#12491 Closes apache#13762 Closes apache#13837 Closes apache#13851 Closes apache#13881 Closes apache#13891 Closes apache#13959 Closes apache#14091 Closes apache#14481 Closes apache#14547 Closes apache#14557 Closes apache#14686 Closes apache#15594 Closes apache#15652 Closes apache#15850 Closes apache#15914 Closes apache#15918 Closes apache#16285 Closes apache#16389 Closes apache#16652 Closes apache#16743 Closes apache#16893 Closes apache#16975 Closes apache#17001 Closes apache#17088 Closes apache#17119 Closes apache#17272 Closes apache#17971 Added: Closes apache#17778 Closes apache#17303 Closes apache#17872 ## How was this patch tested? N/A Author: hyukjinkwon <[email protected]> Closes apache#18017 from HyukjinKwon/close-inactive-prs.
What changes were proposed in this pull request?
At SPARK-16709, when a stage failed, but the running task is still running, the retry stage will rerun the running task, it could cause TaskCommitDeniedException and task retry forever.
Here is the log:
1 task 1.0 in stage1.0 start
2 stage1.0 failed, start stage1.1.
3 task 1.0 in stage1.1 start
4 task 1.0 in stage1.0 finished.
5 task 1.0 in stage1.1 failed with TaskCommitDenied Exception, then retry forever.
How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)