-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-8297] [YARN] Scheduler backend is not notified in case node fails in YARN #7243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #36599 has finished for PR 7243 at commit
|
…o fix_yarn_scheduler_bug
…o fix_yarn_scheduler_bug
…o fix_yarn_scheduler_bug
|
Some stupid merge issues in the history of this PR - but hopefully should be fine now. |
|
@pwendell weird, the tests are still going on but there is a FAILed note ? |
|
Test build #36609 has finished for PR 7243 at commit
|
|
jenkins test this please |
|
Test build #36633 has finished for PR 7243 at commit
|
|
The failures (in YarnClusterSuite) are unrelated to this change - and happen with a clean checkout as well. |
|
@tgravescs pls take a look, thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this work? This method is called before the allocator field is initialized. From runDriver, for example:
// This a bit hacky, but we need to wait until the spark.driver.port property has
// been set by the Thread executing the user class.
val sc = waitForSparkContextInitialized()
// If there is no SparkContext at this point, just fail the app.
if (sc == null) {
...
} else {
....
registerAM(rpcEnv, sc.ui.map(_.appUIAddress).getOrElse(""), securityMgr) /* this is where allocator is initialized */
...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, good point - I think there is some merge issue here since it is working for us in cluster mode (hope it is not specific to client case ?).
Let me revisit the PR - apologies if this is a merge mess up !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may be lucky, because the notify above wakes up the thread that initializes the allocator. But relying on luck doesn't sound like a good solution. :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hopefully this fixes it ? Do let me know in case I am missing something - it has been more than 5+ releases since I looked at yarn module !
|
nice catch mridul. The logic looks fine other then the note from @vanzin. |
|
What about You could just send a message to the driver and have |
|
@vanzin The endpoint is for messages from core to yarn if I am not wrong ? Or am I missing something ? |
If you look at |
|
Hmm, I think I will punt on fixing the client and defend against NPE's with null checks - and leave it to someone more familiar with yarn-client mode to patch the support in. Fixing the merge issue (our changes are on 1.3 and the port to master is causing the issues :-) ) and submitting momentarily. |
The thing is, if you use the endpoint route, the same code would work for both client and cluster. You don't need to special case anything. You'd just send a message to the driver saying "removeExecutor" and the driver would handle it, regardless of where it is. No plumbing of the "backend" variable anywhere. Your current code is explicitly coupled to cluster mode and cannot, ever, work in client mode. So fixing this for client mode would mean reverting your changes. So why not do the final change right now? |
|
The bug was filed a month back - and is a fairly critical issue but was seeing no progress/resolution. I am perfectly fine if anyone wants to use this as a template and fix it in a more principled manner aligned with master; in which case I can close the PR. |
|
It's ok if you don't want to work on the full fix, I'm just wary of pushing a change to master that only fixes part of the problem, when the change to fix everything would be even smaller (and, in my view, cleaner). But maybe others feel differently. |
|
@vanzin I do not disagree with you - it would indeed be cleaner, and I do agree with your point of view. So currently this change works only for yarn cluster mode - while still of value, it would require some rework when client mode needs to be fixed. Unfortunately I am on 1.3 on cluster mode - and I cannot move to master or test against that. Which is why I am perfectly fine with closing this PR in case anyone wants to fix it in a more principled manner. I just dont want the issue to not get resolved before next release (we missed it for current already). |
|
If you give me a few days I can take over your patch and update it (just don't delete your branch). But at this moment I'm swamped with other things here... :-/ |
|
Test build #36726 has finished for PR 7243 at commit
|
|
@vanzin Oh great, that is a very generous offer ! |
No description provided.