-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-14915] [CORE] Don't re-queue a task if another attempt has already succeeded #12751
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Would appreciate your thoughts on this change (or anybody else who you recommend with some experience with the task scheduler). As an aside, the current behavior of allowing a TaskCommitDenied to be retried without limit is probably called into question? See the comment on countTowardsTaskFailures. I haven't testing doing so, but I'm finding that the tasks that fail this way are now (with this change) only attempted once or twice before the other copy that has the lock registers as successful. |
|
Although that makes some logical sense to me, I'd really like to hear an expert weigh in. Also paging @markhamstra @pwendell . It seems like Andrew conceptually agreed with this change. |
|
Jenkins test this please |
|
Test build #57534 has finished for PR 12751 at commit
|
|
Yeah, conceptually this LGTM. In speculation if a task already succeeded then the slower failed attempt should not retry it again. @kayousterhout should sign off in case there's corner case that we're missing, though. |
|
|
||
| if (successful(index)) { | ||
| logWarning( | ||
| s"Task ${info.id} in stage ${taskSet.id} (TID $tid) will not be re-queued " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change mostly looks good, but can you make this log message a little clearer? I'd move it to logInfo (since it's not something the user should be concerned about, so doesn't seem severe enough for warning) and make it say something like "Task ${info.id} in stage ${taskSet.id} failed, but another instance of the task has already succeeded, so not re-queuing the task to be re-executed."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Too easy, thanks for the review!
|
With the logging fix, this LGTM! Thanks for fixing this! |
|
Done! Thanks for the review. |
|
Jenkins retest this please |
|
Test build #57864 has finished for PR 12751 at commit
|
…ady succeeded ## What changes were proposed in this pull request? Don't re-queue a task if another attempt has already succeeded. This currently happens when a speculative task is denied from committing the result due to another copy of the task already having succeeded. ## How was this patch tested? I'm running a job which has a fair bit of skew in the processing time across the tasks for speculation to trigger in the last quarter (default settings), causing many commit denied exceptions to be thrown. Previously, these tasks were then being retried over and over again until the stage possibly completes (despite using compute resources on these superfluous tasks). With this change (applied to the 1.6 branch), they no longer retry and the stage completes successfully without these extra task attempts. Author: Jason Moore <[email protected]> Closes #12751 from jasonmoore2k/SPARK-14915. (cherry picked from commit 77361a4) Signed-off-by: Sean Owen <[email protected]>
|
Merged to master/2.0 |
|
It seems reasonable if it applies to the same code path cleanly in those branches. Any other opinions? |
|
Great, thanks! Not too concerned with 1.5 (and after checking it out looks like it doesn't go on cleanly), but I've been testing this patch on top of the 1.6 branch. If a 1.6.2 patch gets cut, I'd really like this to be included. |
…s already succeeded apache#12751
|
I'm ok with merging this into 1.6. Usually I argue against back-porting scheduler patches because they tend to be pretty risky and have high potential for serious regressions, but this particular change seems to be fixing a somewhat-bad bug, and is also very surgical. |
…ady succeeded Don't re-queue a task if another attempt has already succeeded. This currently happens when a speculative task is denied from committing the result due to another copy of the task already having succeeded. I'm running a job which has a fair bit of skew in the processing time across the tasks for speculation to trigger in the last quarter (default settings), causing many commit denied exceptions to be thrown. Previously, these tasks were then being retried over and over again until the stage possibly completes (despite using compute resources on these superfluous tasks). With this change (applied to the 1.6 branch), they no longer retry and the stage completes successfully without these extra task attempts. Author: Jason Moore <[email protected]> Closes #12751 from jasonmoore2k/SPARK-14915. (cherry picked from commit 77361a4) Signed-off-by: Sean Owen <[email protected]>
|
I back-ported to 1.6. |
|
Ta! |
|
@srowen , add below code twice into branch 1.6: |
|
@zzcclp Ooops, weird, I must have somehow missed a step in resolving the merge conflict. I'll get that fixed ASAP |
|
Fixing the back-port in #12950 |
…ady succeeded Don't re-queue a task if another attempt has already succeeded. This currently happens when a speculative task is denied from committing the result due to another copy of the task already having succeeded. I'm running a job which has a fair bit of skew in the processing time across the tasks for speculation to trigger in the last quarter (default settings), causing many commit denied exceptions to be thrown. Previously, these tasks were then being retried over and over again until the stage possibly completes (despite using compute resources on these superfluous tasks). With this change (applied to the 1.6 branch), they no longer retry and the stage completes successfully without these extra task attempts. Author: Jason Moore <[email protected]> Closes apache#12751 from jasonmoore2k/SPARK-14915. (cherry picked from commit 77361a4) Signed-off-by: Sean Owen <[email protected]> (cherry picked from commit bf3c060)
What changes were proposed in this pull request?
Don't re-queue a task if another attempt has already succeeded. This currently happens when a speculative task is denied from committing the result due to another copy of the task already having succeeded.
How was this patch tested?
I'm running a job which has a fair bit of skew in the processing time across the tasks for speculation to trigger in the last quarter (default settings), causing many commit denied exceptions to be thrown. Previously, these tasks were then being retried over and over again until the stage possibly completes (despite using compute resources on these superfluous tasks). With this change (applied to the 1.6 branch), they no longer retry and the stage completes successfully without these extra task attempts.