[SPARK-10582][Yarn][Core] Fix AM failure situation for dynamic allocation #9963

jerryshao · 2015-11-25T06:17:14Z

Because of AM failure, the target executor number between driver and AM will be different, which will lead to unexpected behavior in dynamic allocation. So when AM is re-registered with driver, state in ExecutorAllocationManager and CoarseGrainedSchedulerBacked should be reset.

This issue is originally addressed in #8737 , here re-opened again. Thanks a lot @KaiXinXiaoLei for finding this issue.

@andrewor14 and @vanzin would you please help to review this, thanks a lot.

SparkQA · 2015-11-25T07:59:35Z

Test build #46674 has finished for PR 9963 at commit 1f92d27.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * case class ReregisterClusterManager(am: RpcEndpointRef) extends CoarseGrainedClusterMessage\n

vanzin · 2015-11-25T17:55:41Z

retest this please

SparkQA · 2015-11-25T20:08:24Z

Test build #46695 has finished for PR 9963 at commit 1f92d27.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * case class ReregisterClusterManager(am: RpcEndpointRef) extends CoarseGrainedClusterMessage\n

vanzin · 2015-12-01T23:39:50Z

core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala

"where registered again" -> "when the AM re-registers after a failure".

SparkQA · 2015-12-02T11:21:55Z

Test build #47051 has finished for PR 9963 at commit 4e413b1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2015-12-02T19:09:18Z

core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala

"Stale", missing period.

SparkQA · 2015-12-04T08:47:16Z

Test build #47191 has finished for PR 9963 at commit b432968.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2015-12-05T00:17:17Z

yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala

The name of this variable is misleading; it sounds like it's tracking whether there is an AM registered, but it's tracking whether an AM has ever been registered, even if currently there's no AM.

So the name should be something like "shouldResetOnAmRegistration" or something less verbose... which raises the question, could you just reset also on the first registration? What ill side-effects could that cause?

Sure, I will fix it. From my guessing, I think it should be OK to call reset on the first registration.

jerryshao · 2015-12-07T07:20:55Z

Hi @vanzin , from my test and understanding so far, I think calling reset() on the first registration should be OK. But here I still don't change to that way, for me I prefer to add a explicit flag to mention whether SchedulerBackend should be reset or not. Please review and suggest, thanks a lot.

SparkQA · 2015-12-07T09:21:29Z

Test build #47256 has finished for PR 9963 at commit 0e1d796.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2015-12-09T17:48:48Z

yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala

nit: "this can only happen".

I'll fix during merge.

vanzin · 2015-12-09T17:49:11Z

Merging to master.

…estart situation ## What changes were proposed in this pull request? This is a follow-up fix of #9963, in #9963 we handle this stale states clean-up work only for dynamic allocation enabled scenario. Here we should also clean the states in `CoarseGrainedSchedulerBackend` for dynamic allocation disabled scenario. Please review, CC andrewor14 lianhuiwang , thanks a lot. ## How was this patch tested? Run the unit test locally, also with integration test manually. Author: jerryshao <[email protected]> Closes #11366 from jerryshao/SPARK-13447.

vanzin reviewed Dec 1, 2015
View reviewed changes

jerryshao force-pushed the SPARK-10582 branch from 1f92d27 to 33c235e Compare December 2, 2015 09:02

vanzin reviewed Dec 2, 2015
View reviewed changes

core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala Outdated

Copy link

Contributor

vanzin Dec 2, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Stale", missing period.

jerryshao added 5 commits December 4, 2015 14:43

Fix AM failure situation for dynamic allocation

1ab35c0

Remove unnecessary code

1bd306d

Address the comments

929cc70

Some words change

cc7cc5d

Style fix

b432968

jerryshao force-pushed the SPARK-10582 branch from 4e413b1 to b432968 Compare December 4, 2015 06:54

vanzin reviewed Dec 5, 2015
View reviewed changes

Address the comments to change the variable name

0e1d796

vanzin reviewed Dec 9, 2015
View reviewed changes

asfgit closed this in 6900f01 Dec 9, 2015

andrewor14 mentioned this pull request Feb 19, 2016

[SPARK-12864][YARN] initialize executorIdCounter after ApplicationMaster killed for max n… #10794

Closed

jerryshao mentioned this pull request Feb 25, 2016

[SPARK-13447][Yarn][Core] Clean the stale states for AM failure and restart situation #11366

Closed

viirya mentioned this pull request Oct 19, 2016

[SPARK-17929] [CORE] Fix deadlock when CoarseGrainedSchedulerBackend reset #15481

Closed

[SPARK-10582][Yarn][Core] Fix AM failure situation for dynamic allocation #9963

[SPARK-10582][Yarn][Core] Fix AM failure situation for dynamic allocation #9963

Uh oh!

Conversation

jerryshao commented Nov 25, 2015

Uh oh!

SparkQA commented Nov 25, 2015

Uh oh!

vanzin commented Nov 25, 2015

Uh oh!

SparkQA commented Nov 25, 2015

Uh oh!

vanzin Dec 1, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 2, 2015

Uh oh!

vanzin Dec 2, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 4, 2015

Uh oh!

vanzin Dec 5, 2015

Choose a reason for hiding this comment

Uh oh!

jerryshao Dec 7, 2015

Choose a reason for hiding this comment

Uh oh!

jerryshao commented Dec 7, 2015

Uh oh!

SparkQA commented Dec 7, 2015

Uh oh!

vanzin Dec 9, 2015

Choose a reason for hiding this comment

Uh oh!

vanzin commented Dec 9, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants