-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-10582][Yarn][Core] Fix AM failure situation for dynamic allocation #9963
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #46674 has finished for PR 9963 at commit
|
|
retest this please |
|
Test build #46695 has finished for PR 9963 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"where registered again" -> "when the AM re-registers after a failure".
1f92d27 to
33c235e
Compare
|
Test build #47051 has finished for PR 9963 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Stale", missing period.
4e413b1 to
b432968
Compare
|
Test build #47191 has finished for PR 9963 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name of this variable is misleading; it sounds like it's tracking whether there is an AM registered, but it's tracking whether an AM has ever been registered, even if currently there's no AM.
So the name should be something like "shouldResetOnAmRegistration" or something less verbose... which raises the question, could you just reset also on the first registration? What ill side-effects could that cause?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I will fix it. From my guessing, I think it should be OK to call reset on the first registration.
|
Hi @vanzin , from my test and understanding so far, I think calling |
|
Test build #47256 has finished for PR 9963 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "this can only happen".
I'll fix during merge.
|
Merging to master. |
…estart situation ## What changes were proposed in this pull request? This is a follow-up fix of #9963, in #9963 we handle this stale states clean-up work only for dynamic allocation enabled scenario. Here we should also clean the states in `CoarseGrainedSchedulerBackend` for dynamic allocation disabled scenario. Please review, CC andrewor14 lianhuiwang , thanks a lot. ## How was this patch tested? Run the unit test locally, also with integration test manually. Author: jerryshao <[email protected]> Closes #11366 from jerryshao/SPARK-13447.
Because of AM failure, the target executor number between driver and AM will be different, which will lead to unexpected behavior in dynamic allocation. So when AM is re-registered with driver, state in
ExecutorAllocationManagerandCoarseGrainedSchedulerBackedshould be reset.This issue is originally addressed in #8737 , here re-opened again. Thanks a lot @KaiXinXiaoLei for finding this issue.
@andrewor14 and @vanzin would you please help to review this, thanks a lot.