[SPARK-7308] prevent concurrent attempts for one stage #5964

squito · 2015-05-07T04:28:24Z

https://issues.apache.org/jira/browse/SPARK-7308

Reproduction of multiple concurrent stage attempts, and a fix. There is a more complete discussion in the doc on JIRA. This address four different issues. They all happen when there are left over tasks from one attempt for a stage that are still running when another attempt for the stage begins.

More fetch failures from the originally attempt result in way more concurrent attempts for one stage.
A second attempt for a ShuffleMapTask will append to the shuffle output data file, but it will still write an index file which assumes the file starts at 0
Multiple attempts for a task get scheduled on the same executor -- one fails, and it deletes the output for the other one, even if it succeeded
Multiple attempts for a task get schedules on the same executor -- they both think they have succeeded, but they have actually tried to write to the same output and now its corrupt.

Note that problems (3) & (4) are only partially solved here. One attempt for a task may finish after the other stage attempt has had all tasks finish, we may have already started the next stage. However, in this case, though the next stage will fail, the retry behavior should take care of it. In fact, the same thing will happen if we do nothing for problems (3) & (4). So I am actually leaning towards ignore problems (3) & (4). I am still submitting this with some code to partially handle those problems, since I already wrote it, just to see what reviewers think.

I'll highlight which problem is being solved in various parts of the change.

I'd recommend reviewers check out a branch with just has the failure reproduction, without the fix here: https://github.com/squito/spark/tree/SPARK-7308_failure_reproduction. Run this test -- even just watch the logs with tail -f core/targer/unit-tests.log | grep DAGScheduler and you will see some really weird behavior: Stage 2 has multiple concurrent attempts, which appear to stomp all over each other; Stage 3 get submitted before Stage 2 ever finishes, and then it will rapidly fire off a bunch of attempts which all quickly die (I've seen > 50 attempts); and lots of executors continue to get lost, though the test case only simulates one executor getting lost. And though the test is contrived, we've seen this exact same behavior from customers with large clusters and real workloads.

Another thing to figure out about this, is what to do with the unit test -- it takes a while to run, and its randomized. The randomization was intentional while developing, it helped discover other corner cases, but perhaps we could bring back the unit / integration test split: https://issues.apache.org/jira/browse/SPARK-4746

…rtial fix, still have some concurrent attempts

squito · 2015-05-07T04:31:44Z

core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala

ignored b/c it spawns 10 executors, takes about 2 mins on my laptop, and makes everything pretty sluggish -- I didn't want to swamp jenkins. I tried a variety of permutations and this consistently demonstrated the problem for me, but maybe we can pare this down some. (Or maybe we need another home for tests like this?)

AmplabJenkins · 2015-05-07T04:32:11Z

Merged build triggered.

AmplabJenkins · 2015-05-07T04:32:19Z

Merged build started.

squito · 2015-05-07T04:32:47Z

core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala

really, stage 3 should have 0 failures as well, I still need to solve that.

SparkQA · 2015-05-07T04:33:53Z

Test build #32073 has started for PR 5964 at commit 7fbcefb.

squito · 2015-05-07T04:34:20Z

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

This addresses issue (1)-- if we get a fetch failure, but we've already failed the attempt of the stage that caused the fetch failure, then do not resubmit the stage again. (Lots of other small changes to add stageAttemptId to the task)

This block is a little awkward. Being a little more explicit would be good here.

Something like this:

//it is possible failure has already been handled by the scheduler. val failureRequiresHandling = runningStages.contains(failedStage); if (failureRequireHandling) { val stageHasFailed = failedStage.attemptId - 1 > task.stageAttemptId; if (stageHasFailed) { ... } }

I'm confused about how this works. Doesn't the stage still get added to failed stages on line 1149, so it will still be resubmitted?

hmm, good point. I think it works in my existing test case, because submitStage already checks if the stage is running before submitting it. So now this makes the stage simultaneously running and failed :/. Most likely this would result in issues if my test case had an even longer pipeline of stages in one job, so at some point a later attempt for this stage would succeed, so it would no longer be running and only be failed, and then it would get resubmitted for no reason. this is just from the top of my head though ... I'll need to look more carefully and try some more cases to see what is going on here.

(btw, thanks for looking at it in this state, I do still plan on splitting this apart some, just keep getting sidetracked ...)

SparkQA · 2015-05-07T04:35:24Z

Test build #32073 has finished for PR 5964 at commit 7fbcefb.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-05-07T04:35:25Z

Merged build finished. Test FAILed.

AmplabJenkins · 2015-05-07T04:35:25Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32073/
Test FAILed.

AmplabJenkins · 2015-05-07T04:52:10Z

Merged build triggered.

AmplabJenkins · 2015-05-07T04:52:19Z

Merged build started.

SparkQA · 2015-05-07T04:53:52Z

Test build #32076 has started for PR 5964 at commit 2eebbf2.

SparkQA · 2015-05-07T06:32:49Z

Test build #32076 has finished for PR 5964 at commit 2eebbf2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-05-07T06:32:54Z

Merged build finished. Test PASSed.

AmplabJenkins · 2015-05-07T06:32:55Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32076/
Test PASSed.

…e actual data is in the middle of it

…ts for the same stage

AmplabJenkins · 2015-05-13T00:47:12Z

Build triggered.

AmplabJenkins · 2015-05-13T00:47:22Z

Build started.

Conflicts: core/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleWriter.scala

AmplabJenkins · 2015-05-13T19:52:57Z

Merged build finished. Test PASSed.

AmplabJenkins · 2015-05-13T19:52:57Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32630/
Test PASSed.

Conflicts: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala

AmplabJenkins · 2015-05-21T20:32:12Z

Merged build triggered.

AmplabJenkins · 2015-05-21T20:32:19Z

Merged build started.

SparkQA · 2015-05-21T20:33:11Z

Test build #33278 has started for PR 5964 at commit 6654c53.

SparkQA · 2015-05-21T21:42:55Z

Test build #33278 has finished for PR 5964 at commit 6654c53.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class SparkIllegalStateException(message: String, cause: Throwable)
- s" but class priors vector pi had $
- s" but class conditionals array theta had $
- class KernelDensity extends Serializable

AmplabJenkins · 2015-05-21T21:43:01Z

Merged build finished. Test FAILed.

AmplabJenkins · 2015-05-21T21:43:01Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/33278/
Test FAILed.

AmplabJenkins · 2015-05-21T21:57:15Z

Merged build triggered.

AmplabJenkins · 2015-05-21T21:57:20Z

Merged build started.

SparkQA · 2015-05-21T21:59:22Z

Test build #33291 has started for PR 5964 at commit dd2839d.

SparkQA · 2015-05-21T23:51:21Z

Test build #33291 has finished for PR 5964 at commit dd2839d.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class SparkIllegalStateException(message: String, cause: Throwable)

AmplabJenkins · 2015-05-21T23:51:25Z

Merged build finished. Test PASSed.

AmplabJenkins · 2015-05-21T23:51:26Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/33291/
Test PASSed.

squito · 2015-05-22T00:35:32Z

Would love to get some feedback from the scheduler maintainers: @mateiz @markhamstra @kayousterhout @pwendell

squito · 2015-05-22T00:36:49Z

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

this (and everything related to partitionComputeCount) is for issues (3) & (4)

kayousterhout · 2015-05-22T04:58:49Z

I haven't yet looked at this closely, but is it possible to split this change into different pull requests, one for each issue?

squito · 2015-05-22T13:28:35Z

@kayousterhout I debated doing that, but I kept them together b/c my test case produces all four. So I wouldn't have passing tests unless I addressed them all. (Though I could actually relax the criteria in the test so 3 & 4 were unnecessary.)

To put it another way: I'm happy to restructure this for merging. But, I feel that reviewers should consider all of the issues together to properly grasp what is going wrong in the existing implementation. Again, I'd like to stress running the reproduction and seeing what is wrong (ideally adding a loop and running it many times) , that is far more important than the diff for reviewing this IMO.

Though maybe this also leads to another question for reviewers -- how do you feel about that test? Its unlike our other unit tests, in that it doesn't try to very narrowly recreate one issue. instead it simulates a workload, with some randomization. The downside of that test is, its slow and you can't easily tease apart the different issues into separate tests. But the upside is that you actually get better coverage -- eg., I wouldn't have discovered issue (2) without this.

squito · 2015-05-22T13:43:59Z

all that said, I do see the value in separating them out as well, so I'll do that in addition. But I'd still like reviewers to consider this holistically :)

kayousterhout · 2015-05-26T22:12:08Z

Mostly I care about separating (1), since it seems like its handling is totally separate from the other issues; can you put the fix to that in its own pull request (or move the other changes into their own pull request)? For that issue, it seems possible to write a narrow unit test to test the specific issue (I wrote two such tests here that you should be able to mostly re-use, if you like: kayousterhout@2b7d232; the first test passes but the second one fails with the current code). I find that preferable to the end-to-end test that you wrote, since it makes it easier to debug the issue when the test fails, and the test also runs much more quickly.

I'm also a little confused about how this fixes (1), as I commented on in the code.

squito · 2015-06-10T20:23:11Z

closing, since I've broken this into separate prs -- thanks for the comments everyone, would appreciate reviews of #6648 and #6750

squito added 4 commits May 6, 2015 19:49

tasks know which stageAttempt they belong to

d08c20c

reproduce the failure

89e8428

ignore fetch failure from attempts that are already failed. only a pa…

70a787b

…rtial fix, still have some concurrent attempts

ignore the test for now just to avoid swamping jenkins

7fbcefb

squito reviewed May 7, 2015
View reviewed changes

style

2eebbf2

squito added 6 commits May 12, 2015 10:52

more rigorous test case

7142242

index file needs to handle cases when data file already exist, and th…

ccaa159

…e actual data is in the middle of it

pare down the unit test

3585b96

SparkIllegalStateException if we ever have multiple concurrent attemp…

de23530

…ts for the same stage

better unit test

c91ee10

handle more cases from bad ordering of task attempt completion

05c72fd

squito added 2 commits May 12, 2015 19:47

Merge branch 'master' into SPARK-7308_fix

5dc5436

Conflicts: core/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleWriter.scala

cleanup imports

37eece8

squito added 2 commits May 21, 2015 14:52

Merge branch 'master' into SPARK-7308_fix

de0a596

Conflicts: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala

fixes from merge

6654c53

squito changed the title ~~[SPARK-7308][WIP] prevent concurrent attempts for one stage~~ [SPARK-7308] prevent concurrent attempts for one stage May 21, 2015

better fix from merge

dd2839d

squito reviewed May 22, 2015
View reviewed changes

squito closed this Jun 10, 2015

[SPARK-7308] prevent concurrent attempts for one stage #5964

[SPARK-7308] prevent concurrent attempts for one stage #5964

Uh oh!

Conversation

squito commented May 7, 2015

Uh oh!

squito May 7, 2015

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented May 7, 2015

Uh oh!

AmplabJenkins commented May 7, 2015

Uh oh!

squito May 7, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 7, 2015

Uh oh!

squito May 7, 2015

Choose a reason for hiding this comment

Uh oh!

duncanfinney May 7, 2015

Choose a reason for hiding this comment

Uh oh!

kayousterhout May 26, 2015

Choose a reason for hiding this comment

Uh oh!

squito May 27, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 7, 2015

Uh oh!

AmplabJenkins commented May 7, 2015

Uh oh!

AmplabJenkins commented May 7, 2015

Uh oh!

AmplabJenkins commented May 7, 2015

Uh oh!

AmplabJenkins commented May 7, 2015

Uh oh!

SparkQA commented May 7, 2015

Uh oh!

SparkQA commented May 7, 2015

Uh oh!

AmplabJenkins commented May 7, 2015

Uh oh!

AmplabJenkins commented May 7, 2015

Uh oh!

AmplabJenkins commented May 13, 2015

Uh oh!

AmplabJenkins commented May 13, 2015

Uh oh!

AmplabJenkins commented May 13, 2015

Uh oh!

AmplabJenkins commented May 13, 2015

Uh oh!

AmplabJenkins commented May 21, 2015

Uh oh!

AmplabJenkins commented May 21, 2015

Uh oh!

SparkQA commented May 21, 2015

Uh oh!

SparkQA commented May 21, 2015

Uh oh!

AmplabJenkins commented May 21, 2015

Uh oh!

AmplabJenkins commented May 21, 2015

Uh oh!

AmplabJenkins commented May 21, 2015

Uh oh!

AmplabJenkins commented May 21, 2015

Uh oh!

SparkQA commented May 21, 2015

Uh oh!

SparkQA commented May 21, 2015

Uh oh!

AmplabJenkins commented May 21, 2015

Uh oh!

AmplabJenkins commented May 21, 2015

Uh oh!