[SPARK-8297] [YARN] Scheduler backend is not notified in case node fails in YARN #7243

mridulm · 2015-07-06T19:44:05Z

No description provided.

SparkQA · 2015-07-06T19:53:56Z

Test build #36599 has finished for PR 7243 at commit a3a0f01.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

…o fix_yarn_scheduler_bug

mridulm · 2015-07-06T22:24:06Z

Some stupid merge issues in the history of this PR - but hopefully should be fine now.

mridulm · 2015-07-06T22:57:43Z

@pwendell weird, the tests are still going on but there is a FAILed note ?

SparkQA · 2015-07-07T00:24:48Z

Test build #36609 has finished for PR 7243 at commit 362d64a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

mridulm · 2015-07-07T02:34:39Z

jenkins test this please

SparkQA · 2015-07-07T05:15:38Z

Test build #36633 has finished for PR 7243 at commit 9218fcc.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

mridulm · 2015-07-07T06:33:31Z

The failures (in YarnClusterSuite) are unrelated to this change - and happen with a clean checkout as well.

mridulm · 2015-07-07T06:43:51Z

@tgravescs pls take a look, thanks.

vanzin · 2015-07-07T16:56:54Z

yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala

How does this work? This method is called before the allocator field is initialized. From runDriver, for example:

// This a bit hacky, but we need to wait until the spark.driver.port property has // been set by the Thread executing the user class. val sc = waitForSparkContextInitialized() // If there is no SparkContext at this point, just fail the app. if (sc == null) { ... } else { .... registerAM(rpcEnv, sc.ui.map(_.appUIAddress).getOrElse(""), securityMgr) /* this is where allocator is initialized */ ... }

Hmm, good point - I think there is some merge issue here since it is working for us in cluster mode (hope it is not specific to client case ?).
Let me revisit the PR - apologies if this is a merge mess up !

You may be lucky, because the notify above wakes up the thread that initializes the allocator. But relying on luck doesn't sound like a good solution. :-)

Hopefully this fixes it ? Do let me know in case I am missing something - it has been more than 5+ releases since I looked at yarn module !

tgravescs · 2015-07-07T17:11:54Z

nice catch mridul. The logic looks fine other then the note from @vanzin.

vanzin · 2015-07-07T17:12:41Z

What about yarn-client? ApplicationMaster already has a way to talk to the driver regardless of the deploy mode (see AMEndpoint, and the equivalent on the driver side, YarnSchedulerBackend::YarnSchedulerEndpoint).

You could just send a message to the driver and have YarnSchedulerBackend handle the removeExecutor call.

mridulm · 2015-07-07T22:14:08Z

@vanzin The endpoint is for messages from core to yarn if I am not wrong ?
This would flow the other way - from yarn to core.

Or am I missing something ?

vanzin · 2015-07-07T22:16:48Z

The endpoint is for messages from core to yarn if I am not wrong ?

If you look at AMEndpoint, it has a reference to the driver endpoint, so you should be able to send messages to the driver. (For example, check the onStart method which sends a message to the driver endpoint.)

mridulm · 2015-07-07T22:25:39Z

Hmm, I think I will punt on fixing the client and defend against NPE's with null checks - and leave it to someone more familiar with yarn-client mode to patch the support in.

Fixing the merge issue (our changes are on 1.3 and the port to master is causing the issues :-) ) and submitting momentarily.

…o fix_yarn_scheduler_bug

vanzin · 2015-07-07T22:38:29Z

Hmm, I think I will punt on fixing the client and defend against NPE's with null checks

The thing is, if you use the endpoint route, the same code would work for both client and cluster. You don't need to special case anything. You'd just send a message to the driver saying "removeExecutor" and the driver would handle it, regardless of where it is. No plumbing of the "backend" variable anywhere.

Your current code is explicitly coupled to cluster mode and cannot, ever, work in client mode. So fixing this for client mode would mean reverting your changes. So why not do the final change right now?

mridulm · 2015-07-07T23:03:18Z

The bug was filed a month back - and is a fairly critical issue but was seeing no progress/resolution.
This PR is a forward port of my fix from 1.3 to master from our internal codebase.
1.3 does not have the actor, etc changes - and since we do not use client mode in my team; the fix does not address those.

I am perfectly fine if anyone wants to use this as a template and fix it in a more principled manner aligned with master; in which case I can close the PR.

vanzin · 2015-07-07T23:13:28Z

It's ok if you don't want to work on the full fix, I'm just wary of pushing a change to master that only fixes part of the problem, when the change to fix everything would be even smaller (and, in my view, cleaner). But maybe others feel differently.

mridulm · 2015-07-07T23:44:00Z

@vanzin I do not disagree with you - it would indeed be cleaner, and I do agree with your point of view.
I had not looked at client mode when I fixed this, and my assumption that it would work for both was not correct - but as you pointed out.

So currently this change works only for yarn cluster mode - while still of value, it would require some rework when client mode needs to be fixed.

Unfortunately I am on 1.3 on cluster mode - and I cannot move to master or test against that.

Which is why I am perfectly fine with closing this PR in case anyone wants to fix it in a more principled manner. I just dont want the issue to not get resolved before next release (we missed it for current already).

vanzin · 2015-07-07T23:46:20Z

If you give me a few days I can take over your patch and update it (just don't delete your branch). But at this moment I'm swamped with other things here... :-/

SparkQA · 2015-07-08T00:41:00Z

Test build #36726 has finished for PR 7243 at commit 687790f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mridulm · 2015-07-08T05:26:22Z

@vanzin Oh great, that is a very generous offer !
I will keep it around as long as you need :-) I just want to get it into the next release.

tgravescs · 2015-07-16T15:57:22Z

@mridulm can you close this then in favor of #7431

Fix SPARK-8297

a3a0f01

Mridul Muralidharan added 4 commits July 6, 2015 15:21

Fix SPARK-8297

9ee1307

Merge branch 'fix_yarn_scheduler_bug' of github.com:mridulm/spark int…

bbf8811

…o fix_yarn_scheduler_bug

Merge branch 'fix_yarn_scheduler_bug' of github.com:mridulm/spark int…

62ad0cc

…o fix_yarn_scheduler_bug

Merge branch 'fix_yarn_scheduler_bug' of github.com:mridulm/spark int…

362d64a

…o fix_yarn_scheduler_bug

Fix failing testcase

9218fcc

vanzin reviewed Jul 7, 2015
View reviewed changes

Mridul Muralidharan added 2 commits July 7, 2015 15:28

Fix failing testcase, fix merge issue from our 1.3 -> master

e1b0067

Merge branch 'fix_yarn_scheduler_bug' of github.com:mridulm/spark int…

687790f

…o fix_yarn_scheduler_bug

vanzin mentioned this pull request Jul 15, 2015

[SPARK-8297] [YARN] Scheduler backend is not notified in case node fails in YARN #7431

Closed

mridulm closed this Jul 17, 2015

[SPARK-8297] [YARN] Scheduler backend is not notified in case node fails in YARN #7243

[SPARK-8297] [YARN] Scheduler backend is not notified in case node fails in YARN #7243

Uh oh!

Conversation

mridulm commented Jul 6, 2015

Uh oh!

SparkQA commented Jul 6, 2015

Uh oh!

mridulm commented Jul 6, 2015

Uh oh!

mridulm commented Jul 6, 2015

Uh oh!

SparkQA commented Jul 7, 2015

Uh oh!

mridulm commented Jul 7, 2015

Uh oh!

SparkQA commented Jul 7, 2015

Uh oh!

mridulm commented Jul 7, 2015

Uh oh!

mridulm commented Jul 7, 2015

Uh oh!

vanzin Jul 7, 2015

Choose a reason for hiding this comment

Uh oh!

mridulm Jul 7, 2015

Choose a reason for hiding this comment

Uh oh!

vanzin Jul 7, 2015

Choose a reason for hiding this comment

Uh oh!

mridulm Jul 7, 2015

Choose a reason for hiding this comment

Uh oh!

tgravescs commented Jul 7, 2015

Uh oh!

vanzin commented Jul 7, 2015

Uh oh!

mridulm commented Jul 7, 2015

Uh oh!

vanzin commented Jul 7, 2015

Uh oh!

mridulm commented Jul 7, 2015

Uh oh!

vanzin commented Jul 7, 2015

Uh oh!

mridulm commented Jul 7, 2015

Uh oh!

vanzin commented Jul 7, 2015

Uh oh!

mridulm commented Jul 7, 2015

Uh oh!

vanzin commented Jul 7, 2015

Uh oh!

SparkQA commented Jul 8, 2015

Uh oh!

mridulm commented Jul 8, 2015

Uh oh!

tgravescs commented Jul 16, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants