[SPARK-14915] [CORE] Don't re-queue a task if another attempt has already succeeded #12751

jasonmoore2k · 2016-04-28T06:29:23Z

What changes were proposed in this pull request?

Don't re-queue a task if another attempt has already succeeded. This currently happens when a speculative task is denied from committing the result due to another copy of the task already having succeeded.

How was this patch tested?

I'm running a job which has a fair bit of skew in the processing time across the tasks for speculation to trigger in the last quarter (default settings), causing many commit denied exceptions to be thrown. Previously, these tasks were then being retried over and over again until the stage possibly completes (despite using compute resources on these superfluous tasks). With this change (applied to the 1.6 branch), they no longer retry and the stage completes successfully without these extra task attempts.

…eady succeeded

jasonmoore2k · 2016-04-28T06:38:45Z

@andrewor14 @kayousterhout

Would appreciate your thoughts on this change (or anybody else who you recommend with some experience with the task scheduler).

As an aside, the current behavior of allowing a TaskCommitDenied to be retried without limit is probably called into question? See the comment on countTowardsTaskFailures. I haven't testing doing so, but I'm finding that the tasks that fail this way are now (with this change) only attempted once or twice before the other copy that has the lock registers as successful.

srowen · 2016-05-02T15:37:33Z

Although that makes some logical sense to me, I'd really like to hear an expert weigh in. Also paging @markhamstra @pwendell . It seems like Andrew conceptually agreed with this change.

srowen · 2016-05-02T15:37:39Z

Jenkins test this please

SparkQA · 2016-05-02T17:40:10Z

Test build #57534 has finished for PR 12751 at commit a3e69c0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

andrewor14 · 2016-05-03T23:31:59Z

Yeah, conceptually this LGTM. In speculation if a task already succeeded then the slower failed attempt should not retry it again. @kayousterhout should sign off in case there's corner case that we're missing, though.

kayousterhout · 2016-05-05T00:27:44Z

core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala

+
+    if (successful(index)) {
+      logWarning(
+        s"Task ${info.id} in stage ${taskSet.id} (TID $tid) will not be re-queued " +


This change mostly looks good, but can you make this log message a little clearer? I'd move it to logInfo (since it's not something the user should be concerned about, so doesn't seem severe enough for warning) and make it say something like "Task ${info.id} in stage ${taskSet.id} failed, but another instance of the task has already succeeded, so not re-queuing the task to be re-executed."

Too easy, thanks for the review!

kayousterhout · 2016-05-05T00:28:22Z

With the logging fix, this LGTM! Thanks for fixing this!

…r clarity

jasonmoore2k · 2016-05-05T05:51:30Z

Done! Thanks for the review.

srowen · 2016-05-05T08:16:53Z

Jenkins retest this please

SparkQA · 2016-05-05T10:01:41Z

Test build #57864 has finished for PR 12751 at commit fa6068a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…ady succeeded ## What changes were proposed in this pull request? Don't re-queue a task if another attempt has already succeeded. This currently happens when a speculative task is denied from committing the result due to another copy of the task already having succeeded. ## How was this patch tested? I'm running a job which has a fair bit of skew in the processing time across the tasks for speculation to trigger in the last quarter (default settings), causing many commit denied exceptions to be thrown. Previously, these tasks were then being retried over and over again until the stage possibly completes (despite using compute resources on these superfluous tasks). With this change (applied to the 1.6 branch), they no longer retry and the stage completes successfully without these extra task attempts. Author: Jason Moore <[email protected]> Closes #12751 from jasonmoore2k/SPARK-14915. (cherry picked from commit 77361a4) Signed-off-by: Sean Owen <[email protected]>

srowen · 2016-05-05T10:03:02Z

Merged to master/2.0

jasonmoore2k · 2016-05-05T11:21:46Z

@srowen Any chance of getting this picked onto the 1.5 and 1.6 branches too?

(To make the the issue revealed by #12228 - which was also merged onto these branches - is dealt with)

srowen · 2016-05-05T11:37:26Z

It seems reasonable if it applies to the same code path cleanly in those branches. Any other opinions?
I doubt there will be a 1.5.x release after this, and not sure about more 1.6.x.

jasonmoore2k · 2016-05-05T11:45:05Z

Great, thanks! Not too concerned with 1.5 (and after checking it out looks like it doesn't go on cleanly), but I've been testing this patch on top of the 1.6 branch. If a 1.6.2 patch gets cut, I'd really like this to be included.

…s already succeeded apache#12751

kayousterhout · 2016-05-05T17:27:02Z

I'm ok with merging this into 1.6. Usually I argue against back-porting scheduler patches because they tend to be pretty risky and have high potential for serious regressions, but this particular change seems to be fixing a somewhat-bad bug, and is also very surgical.

…ady succeeded Don't re-queue a task if another attempt has already succeeded. This currently happens when a speculative task is denied from committing the result due to another copy of the task already having succeeded. I'm running a job which has a fair bit of skew in the processing time across the tasks for speculation to trigger in the last quarter (default settings), causing many commit denied exceptions to be thrown. Previously, these tasks were then being retried over and over again until the stage possibly completes (despite using compute resources on these superfluous tasks). With this change (applied to the 1.6 branch), they no longer retry and the stage completes successfully without these extra task attempts. Author: Jason Moore <[email protected]> Closes #12751 from jasonmoore2k/SPARK-14915. (cherry picked from commit 77361a4) Signed-off-by: Sean Owen <[email protected]>

srowen · 2016-05-05T19:16:54Z

I back-ported to 1.6.

jasonmoore2k · 2016-05-05T22:06:21Z

Ta!

zzcclp · 2016-05-06T00:53:40Z

@srowen , add below code twice into branch 1.6:
if (successful(index)) {
logInfo(
s"Task ${info.id} in stage ${taskSet.id} (TID $tid) failed, " +
"but another instance of the task has already succeeded, " +
"so not re-queuing the task to be re-executed.")
} else {
addPendingTask(index)
}

srowen · 2016-05-06T09:11:08Z

@zzcclp Ooops, weird, I must have somehow missed a step in resolving the merge conflict. I'll get that fixed ASAP

srowen · 2016-05-06T09:16:42Z

Fixing the back-port in #12950

…ady succeeded Don't re-queue a task if another attempt has already succeeded. This currently happens when a speculative task is denied from committing the result due to another copy of the task already having succeeded. I'm running a job which has a fair bit of skew in the processing time across the tasks for speculation to trigger in the last quarter (default settings), causing many commit denied exceptions to be thrown. Previously, these tasks were then being retried over and over again until the stage possibly completes (despite using compute resources on these superfluous tasks). With this change (applied to the 1.6 branch), they no longer retry and the stage completes successfully without these extra task attempts. Author: Jason Moore <[email protected]> Closes apache#12751 from jasonmoore2k/SPARK-14915. (cherry picked from commit 77361a4) Signed-off-by: Sean Owen <[email protected]> (cherry picked from commit bf3c060)

[SPARK-14915] [CORE] Don't re-queue a task if another attempt has alr…

a3e69c0

…eady succeeded

kayousterhout reviewed May 5, 2016
View reviewed changes

[SPARK-14915] [CORE] Change the log to INFO and slightly re-worded fo…

fa6068a

…r clarity

asfgit closed this in 77361a4 May 5, 2016

zzcclp added a commit to zzcclp/spark that referenced this pull request May 5, 2016

[EXT][SPARK-14915] [CORE] Don't re-queue a task if another attempt ha…

2820bc1

…s already succeeded apache#12751

[SPARK-14915] [CORE] Don't re-queue a task if another attempt has already succeeded #12751

[SPARK-14915] [CORE] Don't re-queue a task if another attempt has already succeeded #12751

Uh oh!

Conversation

jasonmoore2k commented Apr 28, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

jasonmoore2k commented Apr 28, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

srowen commented May 2, 2016

Uh oh!

srowen commented May 2, 2016

Uh oh!

SparkQA commented May 2, 2016

Uh oh!

andrewor14 commented May 3, 2016

Uh oh!

kayousterhout May 5, 2016

Choose a reason for hiding this comment

Uh oh!

jasonmoore2k May 5, 2016

Choose a reason for hiding this comment

Uh oh!

kayousterhout commented May 5, 2016

Uh oh!

jasonmoore2k commented May 5, 2016

Uh oh!

srowen commented May 5, 2016

Uh oh!

SparkQA commented May 5, 2016

Uh oh!

srowen commented May 5, 2016

Uh oh!

jasonmoore2k commented May 5, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

srowen commented May 5, 2016

Uh oh!

jasonmoore2k commented May 5, 2016

Uh oh!

kayousterhout commented May 5, 2016

Uh oh!

srowen commented May 5, 2016

Uh oh!

jasonmoore2k commented May 5, 2016

Uh oh!

zzcclp commented May 6, 2016

Uh oh!

srowen commented May 6, 2016

Uh oh!

srowen commented May 6, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

jasonmoore2k commented Apr 28, 2016 •

edited

Loading

jasonmoore2k commented May 5, 2016 •

edited

Loading