Skip to content

Conversation

@markhamstra
Copy link
Contributor

… attempts for a stage

https://issues.apache.org/jira/browse/SPARK-8103

cc kayousterhout (thanks for the extra test case)

Author: Imran Rashid [email protected]
Author: Kay Ousterhout [email protected]
Author: Imran Rashid [email protected]

Closes #6750 from squito/SPARK-8103 and squashes the following commits:

fb3acfc [Imran Rashid] fix log msg
e01b7aa [Imran Rashid] fix some comments, style
584acd4 [Imran Rashid] simplify going from taskId to taskSetMgr
e43ac25 [Imran Rashid] Merge branch 'master' into SPARK-8103
6bc23af [Imran Rashid] update log msg
4470fa1 [Imran Rashid] rename
c04707e [Imran Rashid] style
88b61cc [Imran Rashid] add tests to make sure that TaskSchedulerImpl schedules correctly with zombie attempts
d7f1ef2 [Imran Rashid] get rid of activeTaskSets
a21c8b5 [Imran Rashid] Merge branch 'master' into SPARK-8103
906d626 [Imran Rashid] fix merge
109900e [Imran Rashid] Merge branch 'master' into SPARK-8103
c0d4d90 [Imran Rashid] Revert "Index active task sets by stage Id rather than by task set id"
f025154 [Imran Rashid] Merge pull request #2 from kayousterhout/imran_SPARK-8103
baf46e1 [Kay Ousterhout] Index active task sets by stage Id rather than by task set id
19685bb [Imran Rashid] switch to using latestInfo.attemptId, and add comments
a5f7c8c [Imran Rashid] remove comment for reviewers
227b40d [Imran Rashid] style
517b6e5 [Imran Rashid] get rid of SparkIllegalStateException
b2faef5 [Imran Rashid] faster check for conflicting task sets
6542b42 [Imran Rashid] remove extra stageAttemptId
ada7726 [Imran Rashid] reviewer feedback
d8eb202 [Imran Rashid] Merge branch 'master' into SPARK-8103
46bc26a [Imran Rashid] more cleanup of debug garbage
cb245da [Imran Rashid] finally found the issue ... clean up debug stuff
8c29707 [Imran Rashid] Merge branch 'master' into SPARK-8103
89a59b6 [Imran Rashid] more printlns ...
9601b47 [Imran Rashid] more debug printlns
ecb4e7d [Imran Rashid] debugging printlns
b6bc248 [Imran Rashid] style
55f4a94 [Imran Rashid] get rid of more random test case since kays tests are clearer
7021d28 [Imran Rashid] update test since listenerBus.waitUntilEmpty now throws an exception instead of returning a boolean
883fe49 [Kay Ousterhout] Unit tests for concurrent stages issue
6e14683 [Imran Rashid] unit test just to make sure we fail fast on concurrent attempts
06a0af6 [Imran Rashid] ignore for jenkins
c443def [Imran Rashid] better fix and simpler test case
28d70aa [Imran Rashid] wip on getting a better test case ...
a9bf31f [Imran Rashid] wip

… attempts for a stage

https://issues.apache.org/jira/browse/SPARK-8103

cc kayousterhout (thanks for the extra test case)

Author: Imran Rashid <[email protected]>
Author: Kay Ousterhout <[email protected]>
Author: Imran Rashid <[email protected]>

Closes apache#6750 from squito/SPARK-8103 and squashes the following commits:

fb3acfc [Imran Rashid] fix log msg
e01b7aa [Imran Rashid] fix some comments, style
584acd4 [Imran Rashid] simplify going from taskId to taskSetMgr
e43ac25 [Imran Rashid] Merge branch 'master' into SPARK-8103
6bc23af [Imran Rashid] update log msg
4470fa1 [Imran Rashid] rename
c04707e [Imran Rashid] style
88b61cc [Imran Rashid] add tests to make sure that TaskSchedulerImpl schedules correctly with zombie attempts
d7f1ef2 [Imran Rashid] get rid of activeTaskSets
a21c8b5 [Imran Rashid] Merge branch 'master' into SPARK-8103
906d626 [Imran Rashid] fix merge
109900e [Imran Rashid] Merge branch 'master' into SPARK-8103
c0d4d90 [Imran Rashid] Revert "Index active task sets by stage Id rather than by task set id"
f025154 [Imran Rashid] Merge pull request #2 from kayousterhout/imran_SPARK-8103
baf46e1 [Kay Ousterhout] Index active task sets by stage Id rather than by task set id
19685bb [Imran Rashid] switch to using latestInfo.attemptId, and add comments
a5f7c8c [Imran Rashid] remove comment for reviewers
227b40d [Imran Rashid] style
517b6e5 [Imran Rashid] get rid of SparkIllegalStateException
b2faef5 [Imran Rashid] faster check for conflicting task sets
6542b42 [Imran Rashid] remove extra stageAttemptId
ada7726 [Imran Rashid] reviewer feedback
d8eb202 [Imran Rashid] Merge branch 'master' into SPARK-8103
46bc26a [Imran Rashid] more cleanup of debug garbage
cb245da [Imran Rashid] finally found the issue ... clean up debug stuff
8c29707 [Imran Rashid] Merge branch 'master' into SPARK-8103
89a59b6 [Imran Rashid] more printlns ...
9601b47 [Imran Rashid] more debug printlns
ecb4e7d [Imran Rashid] debugging printlns
b6bc248 [Imran Rashid] style
55f4a94 [Imran Rashid] get rid of more random test case since kays tests are clearer
7021d28 [Imran Rashid] update test since listenerBus.waitUntilEmpty now throws an exception instead of returning a boolean
883fe49 [Kay Ousterhout] Unit tests for concurrent stages issue
6e14683 [Imran Rashid] unit test just to make sure we fail fast on concurrent attempts
06a0af6 [Imran Rashid] ignore for jenkins
c443def [Imran Rashid] better fix and simpler test case
28d70aa [Imran Rashid] wip on getting a better test case ...
a9bf31f [Imran Rashid] wip
@markhamstra
Copy link
Contributor Author

@squito @kayousterhout

This applies cleanly to branch-1.4, and I'm not seeing any reason why it shouldn't be cherry-picked to that branch. I haven't looked at applying the fix to any earlier branches.

@kayousterhout
Copy link
Contributor

@markhamstra I'm pretty nervous about applying this to earlier branches, because it's a relatively invasive change to the scheduler. Are you suggesting this because the underlying bug is a big issue for folks on earlier branches?

@markhamstra
Copy link
Contributor Author

@kayousterhout Yup, precisely -- it can be a big problem for 1.4 users.

@kayousterhout
Copy link
Contributor

But this isn't a regression from earlier Spark versions right?

@markhamstra
Copy link
Contributor Author

Nope -- it can be a big problem with earlier versions, too :)

@kayousterhout
Copy link
Contributor

I'm more hesitant to backport this if we're not fixing a regression though, since this is a somewhat high-risk fix that could introduce other issues. If this was a major issue, I'd have expected a JIRA to be filed before Imran filed this last month. Were there earlier reports of this that I'm missing?

@squito what's your opinion on backporting this?

@markhamstra
Copy link
Contributor Author

I think the lack of a JIRA earlier is largely a matter of users not even being able to sort out the confusing behavior sufficiently to be able to file a useful bug report. After Imran's heroic effort to unwind what was going on, it is pretty clear that the prior behavior was badly broken. I know from experience that the old behavior causes significant issues in production environments, and I know that I will be using this fix in a 1.4 context even if it is not merged into the 1.4 distribution -- but I'd rather not have to maintain divergent code.

Certainly if this were a regression we'd be looking more at "must fix", but a significant problem that can be resolved without violating our versioning guarantees falls pretty clearly into "should fix", imo.

@pwendell

@kayousterhout
Copy link
Contributor

Good point re:users having no idea WTF is going on.

@pwendell
Copy link
Contributor

Hi all,

IMO - this patch is far to invasive to be considered for a backport. Of course, it is always a judgement call, but here is how I think about it:

  1. The DAG scheduler is by far the most brittle component in Spark in terms of having unintended consequences when we fix something. Creating a new bug in a maintenance release is the absolute worst thing we can do in terms of eroding confidence in our release process. I take a very risk averse approach to backports for this reason. This is also a very central component, every user workload will be affected if there is an issue we didn't anticipate.
  2. This bug is super annoying/confusing, no argument there, but it's a long standing bug and not something that came up in a regressing way from earlier versions of Spark.

So I think the cost/benefit here just isn't in favor of doing it.

@markhamstra
Copy link
Contributor Author

Ok, I'll leave this open for about 24 hr to collect any more comments. If Patrick isn't convinced by then, I'll close this PR.

@SparkQA
Copy link

SparkQA commented Jul 21, 2015

Test build #37974 has finished for PR 7572 at commit 49d9bbf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@squito
Copy link
Contributor

squito commented Jul 22, 2015

agree that (1) this is not a regression, and (2) that the lack of bug reports is more from confusion than not encountering it. I'm nearly certain that I have run into this (or maybe it was SPARK-8029 or SPARK-5259 ... plus a little SPARK-5945 mixed in) many times in the past but simply been too confused / busy to understand what was going on, and assumed it must have been something we were doing wrong. We also have bug reports from customers where they are running into this. But even sophisticated customers that have made other requests that we work on SPARK-XYZ have never realized this was the issue they were hitting.

FWIW, I'm going to be thinking very carefully about backporting that whole set of scheduler changes to versions of CDH that are based on pre-1.5. I'm definitely nervous about it, but it might be necessary just because of how important these fixes are. Frankly the unit tests are not nearly enough to gain confidence in the scheduler, so I'll need to do a lot more testing before I'm comfortable with that. Not that it needs to effect our decision on backporting inside spark itself, just thought it may be a useful data point.

But I suppose if I try to look at this objectively, this isn't worth breaking our policy of what we backport. Perhaps we should consider changing that release model as we get mor stability, but that is a separate discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants