Skip to content

Conversation

@jasonmoore2k
Copy link
Contributor

What changes were proposed in this pull request?

Don't re-queue a task if another attempt has already succeeded. This currently happens when a speculative task is denied from committing the result due to another copy of the task already having succeeded.

How was this patch tested?

I'm running a job which has a fair bit of skew in the processing time across the tasks for speculation to trigger in the last quarter (default settings), causing many commit denied exceptions to be thrown. Previously, these tasks were then being retried over and over again until the stage possibly completes (despite using compute resources on these superfluous tasks). With this change (applied to the 1.6 branch), they no longer retry and the stage completes successfully without these extra task attempts.

@jasonmoore2k
Copy link
Contributor Author

jasonmoore2k commented Apr 28, 2016

@andrewor14 @kayousterhout

Would appreciate your thoughts on this change (or anybody else who you recommend with some experience with the task scheduler).

As an aside, the current behavior of allowing a TaskCommitDenied to be retried without limit is probably called into question? See the comment on countTowardsTaskFailures. I haven't testing doing so, but I'm finding that the tasks that fail this way are now (with this change) only attempted once or twice before the other copy that has the lock registers as successful.

@srowen
Copy link
Member

srowen commented May 2, 2016

Although that makes some logical sense to me, I'd really like to hear an expert weigh in. Also paging @markhamstra @pwendell . It seems like Andrew conceptually agreed with this change.

@srowen
Copy link
Member

srowen commented May 2, 2016

Jenkins test this please

@SparkQA
Copy link

SparkQA commented May 2, 2016

Test build #57534 has finished for PR 12751 at commit a3e69c0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@andrewor14
Copy link
Contributor

Yeah, conceptually this LGTM. In speculation if a task already succeeded then the slower failed attempt should not retry it again. @kayousterhout should sign off in case there's corner case that we're missing, though.


if (successful(index)) {
logWarning(
s"Task ${info.id} in stage ${taskSet.id} (TID $tid) will not be re-queued " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change mostly looks good, but can you make this log message a little clearer? I'd move it to logInfo (since it's not something the user should be concerned about, so doesn't seem severe enough for warning) and make it say something like "Task ${info.id} in stage ${taskSet.id} failed, but another instance of the task has already succeeded, so not re-queuing the task to be re-executed."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too easy, thanks for the review!

@kayousterhout
Copy link
Contributor

With the logging fix, this LGTM! Thanks for fixing this!

@jasonmoore2k
Copy link
Contributor Author

Done! Thanks for the review.

@srowen
Copy link
Member

srowen commented May 5, 2016

Jenkins retest this please

@SparkQA
Copy link

SparkQA commented May 5, 2016

Test build #57864 has finished for PR 12751 at commit fa6068a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request May 5, 2016
…ady succeeded

## What changes were proposed in this pull request?

Don't re-queue a task if another attempt has already succeeded.  This currently happens when a speculative task is denied from committing the result due to another copy of the task already having succeeded.

## How was this patch tested?

I'm running a job which has a fair bit of skew in the processing time across the tasks for speculation to trigger in the last quarter (default settings), causing many commit denied exceptions to be thrown.  Previously, these tasks were then being retried over and over again until the stage possibly completes (despite using compute resources on these superfluous tasks).  With this change (applied to the 1.6 branch), they no longer retry and the stage completes successfully without these extra task attempts.

Author: Jason Moore <[email protected]>

Closes #12751 from jasonmoore2k/SPARK-14915.

(cherry picked from commit 77361a4)
Signed-off-by: Sean Owen <[email protected]>
@srowen
Copy link
Member

srowen commented May 5, 2016

Merged to master/2.0

@asfgit asfgit closed this in 77361a4 May 5, 2016
@jasonmoore2k
Copy link
Contributor Author

jasonmoore2k commented May 5, 2016

@srowen Any chance of getting this picked onto the 1.5 and 1.6 branches too?

(To make the the issue revealed by #12228 - which was also merged onto these branches - is dealt with)

@srowen
Copy link
Member

srowen commented May 5, 2016

It seems reasonable if it applies to the same code path cleanly in those branches. Any other opinions?
I doubt there will be a 1.5.x release after this, and not sure about more 1.6.x.

@jasonmoore2k
Copy link
Contributor Author

Great, thanks! Not too concerned with 1.5 (and after checking it out looks like it doesn't go on cleanly), but I've been testing this patch on top of the 1.6 branch. If a 1.6.2 patch gets cut, I'd really like this to be included.

zzcclp added a commit to zzcclp/spark that referenced this pull request May 5, 2016
@kayousterhout
Copy link
Contributor

I'm ok with merging this into 1.6. Usually I argue against back-porting scheduler patches because they tend to be pretty risky and have high potential for serious regressions, but this particular change seems to be fixing a somewhat-bad bug, and is also very surgical.

asfgit pushed a commit that referenced this pull request May 5, 2016
…ady succeeded

Don't re-queue a task if another attempt has already succeeded.  This currently happens when a speculative task is denied from committing the result due to another copy of the task already having succeeded.

I'm running a job which has a fair bit of skew in the processing time across the tasks for speculation to trigger in the last quarter (default settings), causing many commit denied exceptions to be thrown.  Previously, these tasks were then being retried over and over again until the stage possibly completes (despite using compute resources on these superfluous tasks).  With this change (applied to the 1.6 branch), they no longer retry and the stage completes successfully without these extra task attempts.

Author: Jason Moore <[email protected]>

Closes #12751 from jasonmoore2k/SPARK-14915.

(cherry picked from commit 77361a4)
Signed-off-by: Sean Owen <[email protected]>
@srowen
Copy link
Member

srowen commented May 5, 2016

I back-ported to 1.6.

@jasonmoore2k
Copy link
Contributor Author

Ta!

@zzcclp
Copy link
Contributor

zzcclp commented May 6, 2016

@srowen , add below code twice into branch 1.6:
if (successful(index)) {
logInfo(
s"Task ${info.id} in stage ${taskSet.id} (TID $tid) failed, " +
"but another instance of the task has already succeeded, " +
"so not re-queuing the task to be re-executed.")
} else {
addPendingTask(index)
}

@srowen
Copy link
Member

srowen commented May 6, 2016

@zzcclp Ooops, weird, I must have somehow missed a step in resolving the merge conflict. I'll get that fixed ASAP

@srowen
Copy link
Member

srowen commented May 6, 2016

Fixing the back-port in #12950

zzcclp pushed a commit to zzcclp/spark that referenced this pull request May 9, 2016
…ady succeeded

Don't re-queue a task if another attempt has already succeeded.  This currently happens when a speculative task is denied from committing the result due to another copy of the task already having succeeded.

I'm running a job which has a fair bit of skew in the processing time across the tasks for speculation to trigger in the last quarter (default settings), causing many commit denied exceptions to be thrown.  Previously, these tasks were then being retried over and over again until the stage possibly completes (despite using compute resources on these superfluous tasks).  With this change (applied to the 1.6 branch), they no longer retry and the stage completes successfully without these extra task attempts.

Author: Jason Moore <[email protected]>

Closes apache#12751 from jasonmoore2k/SPARK-14915.

(cherry picked from commit 77361a4)
Signed-off-by: Sean Owen <[email protected]>
(cherry picked from commit bf3c060)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants