Skip to content

Conversation

@jerryshao
Copy link
Contributor

Because of AM failure, the target executor number between driver and AM will be different, which will lead to unexpected behavior in dynamic allocation. So when AM is re-registered with driver, state in ExecutorAllocationManager and CoarseGrainedSchedulerBacked should be reset.

This issue is originally addressed in #8737 , here re-opened again. Thanks a lot @KaiXinXiaoLei for finding this issue.

@andrewor14 and @vanzin would you please help to review this, thanks a lot.

@SparkQA
Copy link

SparkQA commented Nov 25, 2015

Test build #46674 has finished for PR 9963 at commit 1f92d27.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * case class ReregisterClusterManager(am: RpcEndpointRef) extends CoarseGrainedClusterMessage\n

@vanzin
Copy link
Contributor

vanzin commented Nov 25, 2015

retest this please

@SparkQA
Copy link

SparkQA commented Nov 25, 2015

Test build #46695 has finished for PR 9963 at commit 1f92d27.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * case class ReregisterClusterManager(am: RpcEndpointRef) extends CoarseGrainedClusterMessage\n

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"where registered again" -> "when the AM re-registers after a failure".

@SparkQA
Copy link

SparkQA commented Dec 2, 2015

Test build #47051 has finished for PR 9963 at commit 4e413b1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Stale", missing period.

@SparkQA
Copy link

SparkQA commented Dec 4, 2015

Test build #47191 has finished for PR 9963 at commit b432968.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name of this variable is misleading; it sounds like it's tracking whether there is an AM registered, but it's tracking whether an AM has ever been registered, even if currently there's no AM.

So the name should be something like "shouldResetOnAmRegistration" or something less verbose... which raises the question, could you just reset also on the first registration? What ill side-effects could that cause?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I will fix it. From my guessing, I think it should be OK to call reset on the first registration.

@jerryshao
Copy link
Contributor Author

Hi @vanzin , from my test and understanding so far, I think calling reset() on the first registration should be OK. But here I still don't change to that way, for me I prefer to add a explicit flag to mention whether SchedulerBackend should be reset or not. Please review and suggest, thanks a lot.

@SparkQA
Copy link

SparkQA commented Dec 7, 2015

Test build #47256 has finished for PR 9963 at commit 0e1d796.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "this can only happen".

I'll fix during merge.

@vanzin
Copy link
Contributor

vanzin commented Dec 9, 2015

Merging to master.

@asfgit asfgit closed this in 6900f01 Dec 9, 2015
asfgit pushed a commit that referenced this pull request Mar 29, 2016
…estart situation

## What changes were proposed in this pull request?

This is a follow-up fix of #9963, in #9963 we handle this stale states clean-up work only for dynamic allocation enabled scenario. Here we should also clean the states in `CoarseGrainedSchedulerBackend` for dynamic allocation disabled scenario.

Please review, CC andrewor14 lianhuiwang , thanks a lot.

## How was this patch tested?

Run the unit test locally, also with integration test manually.

Author: jerryshao <[email protected]>

Closes #11366 from jerryshao/SPARK-13447.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants