-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-8297] [YARN] Scheduler backend is not notified in case node fails in YARN #7431
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…o fix_yarn_scheduler_bug
…o fix_yarn_scheduler_bug
…o fix_yarn_scheduler_bug
…o fix_yarn_scheduler_bug
|
This is an updated version of #7243, which also works in yarn-client mode. /cc @mridulm @tgravescs I ran a bunch of tests on yarn client and cluster modes, but I don't have a test that specifically exercises the bug, so take that with a grain of salt. |
|
Looks good to me ! If Tom does not come back with objections I am +1 on this. Thanks for patching this in @vanzin ! |
|
Test build #37412 has finished for PR 7431 at commit
|
|
The changes look good. To manually test this can't you just remove/kill a nodemanager that has an executor on it? Perhaps there is a race between when yarn tells us and when we recognize? If you don't have a cluster I can try it out later today. |
|
I can try it out (I'll also add some logging in my internal build to make sure the code is actually being exercised). |
|
So now that I tried the new code path (which works), I'm a little skeptical that sending a message back to the driver is really needed. The driver already removes the executor when the RPC connection is reset: The new message ends up being a no-op: See So I'm a little confused about how this change is fixing anything. The bug talks about "repeated re-execution of stages" - isn't that the correct way of handling executor failures? You retry tasks or stages depending on what the failure is. Perhaps the real issue you ran into is something like #6750 instead? |
|
(BTW to test this, killing the NM is not enough, since that doesn't seem to cause child containers to be killed.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aren't we already in an if (!alreadyReleased) up there? If I'm not wrong we don't need to check this here again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no it comes out of first if !alreadyReleased at line 423 of the diff (418 or original) if you expand. The diff here just makes it look like it
|
Yeah it seems like a race condition between the two detection code paths. If they are both guaranteed to be called, then we're always going to end up with |
|
has more information. (timeout's for heartbeat is increased due to gc issues we see in spark) I'm guessing there might be situations it doesn't recognize the disconnect right away. Either way I don't think this change hurts anything and it covers any weird cases that might happen. The error message I assume only happens in a few cases when you hit the exact race. Otherwise I would expect one side or the other to win. Generally I would expect the spark scheduler recognizing disconnect first and cause it to be removed from our list and not even call it. I didn't verify that in the code though, @vanzin do you happen to know if that is true, otherwise I'll check tomorrow. |
That's the only case in which this code might help; but does that really happen? The executor process is already gone (otherwise YARN wouldn't notify the AM). The disconnection message has to go through the akka queues to get to the destination, but so do the messages being sent here. So with this code, unless the disconnect message can somehow be lost, you'll always end up with that |
|
thats not necessarily true. if a NM goes away for long enough the RM will assume all containers on it are gone too so you would get the notification saying they are gone even if they are still running. If that executor was somehow in a weird state we would want it removed. |
Feels to me like Spark's own heartbeat would take care of that failure mode. The NM being in a bad state does not mean the executor is in a bad state. I'm just trying to understand why the change is needed. I haven't seen anything that really requires it yet - seems like other safeguards in Spark already take care of the failure modes that have been identified so far. I'm also not against adding it, but perhaps if we do we should demote that log message to be less scary (error is too strong when we expect it to happen). |
|
It doesn't necessarily mean executor is in bad state but it could be in a bad state. I've seen very weird things happen on machines before so personally I don't think it hurts putting this in and it covers those weird cases that could come up. I'm fine with changing log message to less scary. |
|
Test build #37851 has finished for PR 7431 at commit
|
|
Jenkins retest this please. |
|
Test build #37967 has finished for PR 7431 at commit
|
|
Test build #44 has finished for PR 7431 at commit
|
|
Jenkins retest this please. |
|
Test build #38233 has finished for PR 7431 at commit
|
|
Test build #81 has finished for PR 7431 at commit
|
|
the last change was just log level change so this lgtm. |
Conflicts: yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
|
Jenkins retest this please. |
|
Test build #139 has finished for PR 7431 at commit
|
|
Test build #38747 has finished for PR 7431 at commit
|
|
Jenkins retest this please. |
|
Test build #38772 has finished for PR 7431 at commit
|
|
Test build #142 has finished for PR 7431 at commit
|
|
Test build #38794 has finished for PR 7431 at commit
|
|
Test build #38900 has finished for PR 7431 at commit
|
|
Merged to master. |
…ils in YARN This change adds code to notify the scheduler backend when a container dies in YARN. Author: Mridul Muralidharan <[email protected]> Author: Marcelo Vanzin <[email protected]> Closes #7431 from vanzin/SPARK-8297 and squashes the following commits: 471e4a0 [Marcelo Vanzin] Fix unit test after merge. d4adf4e [Marcelo Vanzin] Merge branch 'master' into SPARK-8297 3b262e8 [Marcelo Vanzin] Merge branch 'master' into SPARK-8297 537da6f [Marcelo Vanzin] Make an expected log less scary. 04dc112 [Marcelo Vanzin] Use driver <-> AM communication to send "remove executor" request. 8855b97 [Marcelo Vanzin] Merge remote-tracking branch 'mridul/fix_yarn_scheduler_bug' into SPARK-8297 687790f [Mridul Muralidharan] Merge branch 'fix_yarn_scheduler_bug' of github.com:mridulm/spark into fix_yarn_scheduler_bug e1b0067 [Mridul Muralidharan] Fix failing testcase, fix merge issue from our 1.3 -> master 9218fcc [Mridul Muralidharan] Fix failing testcase 362d64a [Mridul Muralidharan] Merge branch 'fix_yarn_scheduler_bug' of github.com:mridulm/spark into fix_yarn_scheduler_bug 62ad0cc [Mridul Muralidharan] Merge branch 'fix_yarn_scheduler_bug' of github.com:mridulm/spark into fix_yarn_scheduler_bug bbf8811 [Mridul Muralidharan] Merge branch 'fix_yarn_scheduler_bug' of github.com:mridulm/spark into fix_yarn_scheduler_bug 9ee1307 [Mridul Muralidharan] Fix SPARK-8297 a3a0f01 [Mridul Muralidharan] Fix SPARK-8297
This change adds code to notify the scheduler backend when a container dies in YARN.