[SPARK-8297] [YARN] Scheduler backend is not notified in case node fails in YARN #7431

vanzin · 2015-07-15T21:42:33Z

This change adds code to notify the scheduler backend when a container dies in YARN.

…o fix_yarn_scheduler_bug

…RK-8297

vanzin · 2015-07-15T21:43:47Z

This is an updated version of #7243, which also works in yarn-client mode.

/cc @mridulm @tgravescs

I ran a bunch of tests on yarn client and cluster modes, but I don't have a test that specifically exercises the bug, so take that with a grain of salt.

mridulm · 2015-07-15T22:01:08Z

Looks good to me ! If Tom does not come back with objections I am +1 on this.

Thanks for patching this in @vanzin !

SparkQA · 2015-07-15T23:58:33Z

Test build #37412 has finished for PR 7431 at commit 04dc112.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Pmod(left: Expression, right: Expression) extends BinaryArithmetic
- final class SpecificRow extends $

tgravescs · 2015-07-16T14:55:22Z

The changes look good. To manually test this can't you just remove/kill a nodemanager that has an executor on it? Perhaps there is a race between when yarn tells us and when we recognize?

If you don't have a cluster I can try it out later today.

vanzin · 2015-07-16T17:51:36Z

I can try it out (I'll also add some logging in my internal build to make sure the code is actually being exercised).

vanzin · 2015-07-16T19:36:16Z

So now that I tried the new code path (which works), I'm a little skeptical that sending a message back to the driver is really needed. The driver already removes the executor when the RPC connection is reset:

15/07/16 12:30:15 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 4, vanzin-st1-3.vpc.cloudera.com): ExecutorLostFailure (executor 3 lost)
15/07/16 12:30:15 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://[email protected]:36279] has failed, address is now gated for [5000] ms. Reason: [Disassociated] 
15/07/16 12:30:15 INFO DAGScheduler: Executor lost: 3 (epoch 0)
15/07/16 12:30:15 INFO BlockManagerMasterEndpoint: Trying to remove executor 3 from BlockManagerMaster.
15/07/16 12:30:15 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(3, vanzin-st1-3.vpc.cloudera.com, 37469)
15/07/16 12:30:15 INFO BlockManagerMaster: Removed 3 successfully in removeExecutor

The new message ends up being a no-op:

15/07/16 12:30:18 ERROR YarnClientSchedulerBackend: Asked to remove non-existent executor 3

See CoarseGrainedSchedulerBackend::DriverEndpoint::removeExecutor.

So I'm a little confused about how this change is fixing anything. The bug talks about "repeated re-execution of stages" - isn't that the correct way of handling executor failures? You retry tasks or stages depending on what the failure is.

Perhaps the real issue you ran into is something like #6750 instead?

vanzin · 2015-07-16T19:39:48Z

(BTW to test this, killing the NM is not enough, since that doesn't seem to cause child containers to be killed.)

andrewor14 · 2015-07-17T00:31:40Z

yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala

aren't we already in an if (!alreadyReleased) up there? If I'm not wrong we don't need to check this here again.

no it comes out of first if !alreadyReleased at line 423 of the diff (418 or original) if you expand. The diff here just makes it look like it

andrewor14 · 2015-07-17T00:45:35Z

Yeah it seems like a race condition between the two detection code paths. If they are both guaranteed to be called, then we're always going to end up with ERROR: Asked to kill non-existent executor. Is there a case when the disconnection detection fails?

tgravescs · 2015-07-17T02:17:47Z

https://issues.apache.org/jira/browse/SPARK-8297?focusedCommentId=14583560&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14583560

has more information. (timeout's for heartbeat is increased due to gc issues we see in spark)

I'm guessing there might be situations it doesn't recognize the disconnect right away. Either way I don't think this change hurts anything and it covers any weird cases that might happen. The error message I assume only happens in a few cases when you hit the exact race. Otherwise I would expect one side or the other to win. Generally I would expect the spark scheduler recognizing disconnect first and cause it to be removed from our list and not even call it. I didn't verify that in the code though, @vanzin do you happen to know if that is true, otherwise I'll check tomorrow.

vanzin · 2015-07-17T02:58:50Z

I'm guessing there might be situations it doesn't recognize the disconnect right away.

That's the only case in which this code might help; but does that really happen? The executor process is already gone (otherwise YARN wouldn't notify the AM). The disconnection message has to go through the akka queues to get to the destination, but so do the messages being sent here.

So with this code, unless the disconnect message can somehow be lost, you'll always end up with that ERROR log, because the two code paths are sort of redundant.

tgravescs · 2015-07-17T20:12:50Z

thats not necessarily true. if a NM goes away for long enough the RM will assume all containers on it are gone too so you would get the notification saying they are gone even if they are still running. If that executor was somehow in a weird state we would want it removed.

vanzin · 2015-07-17T20:19:47Z

If that executor was somehow in a weird state we would want it removed.

Feels to me like Spark's own heartbeat would take care of that failure mode. The NM being in a bad state does not mean the executor is in a bad state.

I'm just trying to understand why the change is needed. I haven't seen anything that really requires it yet - seems like other safeguards in Spark already take care of the failure modes that have been identified so far. I'm also not against adding it, but perhaps if we do we should demote that log message to be less scary (error is too strong when we expect it to happen).

tgravescs · 2015-07-20T13:54:28Z

It doesn't necessarily mean executor is in bad state but it could be in a bad state. I've seen very weird things happen on machines before so personally I don't think it hurts putting this in and it covers those weird cases that could come up. I'm fine with changing log message to less scary.

SparkQA · 2015-07-20T19:43:22Z

Test build #37851 has finished for PR 7431 at commit 537da6f.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2015-07-21T17:38:32Z

Jenkins retest this please.

SparkQA · 2015-07-21T18:48:38Z

Test build #37967 has finished for PR 7431 at commit 537da6f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-21T18:51:18Z

Test build #44 has finished for PR 7431 at commit 537da6f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2015-07-23T15:13:19Z

Jenkins retest this please.

SparkQA · 2015-07-23T17:15:14Z

Test build #38233 has finished for PR 7431 at commit 537da6f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-23T17:27:16Z

Test build #81 has finished for PR 7431 at commit 537da6f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2015-07-27T13:08:42Z

the last change was just log level change so this lgtm.

Conflicts: yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala

vanzin · 2015-07-28T18:51:05Z

Jenkins retest this please.

SparkQA · 2015-07-28T20:44:30Z

Test build #139 has finished for PR 7431 at commit 3b262e8.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Abs(child: Expression) extends UnaryExpression with ExpectsInputTypes

SparkQA · 2015-07-28T20:50:34Z

Test build #38747 has finished for PR 7431 at commit 3b262e8.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2015-07-28T22:37:20Z

Jenkins retest this please.

SparkQA · 2015-07-29T00:28:43Z

Test build #38772 has finished for PR 7431 at commit 3b262e8.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-29T00:30:53Z

Test build #142 has finished for PR 7431 at commit 3b262e8.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- public static class StructWriter
- case class Abs(child: Expression) extends UnaryExpression with ExpectsInputTypes
- case class CreateStructUnsafe(children: Seq[Expression]) extends Expression
- case class CreateNamedStructUnsafe(children: Seq[Expression]) extends Expression
- case class TungstenProject(projectList: Seq[NamedExpression], child: SparkPlan) extends UnaryNode

SparkQA · 2015-07-29T04:31:41Z

Test build #38794 has finished for PR 7431 at commit d4adf4e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-30T00:49:41Z

Test build #38900 has finished for PR 7431 at commit 471e4a0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2015-07-30T17:38:20Z

Merged to master.

…ils in YARN This change adds code to notify the scheduler backend when a container dies in YARN. Author: Mridul Muralidharan <[email protected]> Author: Marcelo Vanzin <[email protected]> Closes #7431 from vanzin/SPARK-8297 and squashes the following commits: 471e4a0 [Marcelo Vanzin] Fix unit test after merge. d4adf4e [Marcelo Vanzin] Merge branch 'master' into SPARK-8297 3b262e8 [Marcelo Vanzin] Merge branch 'master' into SPARK-8297 537da6f [Marcelo Vanzin] Make an expected log less scary. 04dc112 [Marcelo Vanzin] Use driver <-> AM communication to send "remove executor" request. 8855b97 [Marcelo Vanzin] Merge remote-tracking branch 'mridul/fix_yarn_scheduler_bug' into SPARK-8297 687790f [Mridul Muralidharan] Merge branch 'fix_yarn_scheduler_bug' of github.com:mridulm/spark into fix_yarn_scheduler_bug e1b0067 [Mridul Muralidharan] Fix failing testcase, fix merge issue from our 1.3 -> master 9218fcc [Mridul Muralidharan] Fix failing testcase 362d64a [Mridul Muralidharan] Merge branch 'fix_yarn_scheduler_bug' of github.com:mridulm/spark into fix_yarn_scheduler_bug 62ad0cc [Mridul Muralidharan] Merge branch 'fix_yarn_scheduler_bug' of github.com:mridulm/spark into fix_yarn_scheduler_bug bbf8811 [Mridul Muralidharan] Merge branch 'fix_yarn_scheduler_bug' of github.com:mridulm/spark into fix_yarn_scheduler_bug 9ee1307 [Mridul Muralidharan] Fix SPARK-8297 a3a0f01 [Mridul Muralidharan] Fix SPARK-8297

Mridul Muralidharan and others added 10 commits July 6, 2015 12:41

Fix SPARK-8297

a3a0f01

Fix SPARK-8297

9ee1307

Merge branch 'fix_yarn_scheduler_bug' of github.com:mridulm/spark int…

bbf8811

…o fix_yarn_scheduler_bug

Merge branch 'fix_yarn_scheduler_bug' of github.com:mridulm/spark int…

62ad0cc

…o fix_yarn_scheduler_bug

Merge branch 'fix_yarn_scheduler_bug' of github.com:mridulm/spark int…

362d64a

…o fix_yarn_scheduler_bug

Fix failing testcase

9218fcc

Fix failing testcase, fix merge issue from our 1.3 -> master

e1b0067

Merge branch 'fix_yarn_scheduler_bug' of github.com:mridulm/spark int…

687790f

…o fix_yarn_scheduler_bug

Merge remote-tracking branch 'mridul/fix_yarn_scheduler_bug' into SPA…

8855b97

…RK-8297

Use driver <-> AM communication to send "remove executor" request.

04dc112

tgravescs mentioned this pull request Jul 16, 2015

[SPARK-8297] [YARN] Scheduler backend is not notified in case node fails in YARN #7243

Closed

andrewor14 reviewed Jul 17, 2015
View reviewed changes

Make an expected log less scary.

537da6f

Merge branch 'master' into SPARK-8297

3b262e8

Conflicts: yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala

Merge branch 'master' into SPARK-8297

d4adf4e

Fix unit test after merge.

471e4a0

vanzin closed this Jul 30, 2015

vanzin deleted the SPARK-8297 branch July 30, 2015 17:39

[SPARK-8297] [YARN] Scheduler backend is not notified in case node fails in YARN #7431

[SPARK-8297] [YARN] Scheduler backend is not notified in case node fails in YARN #7431

Uh oh!

Conversation

vanzin commented Jul 15, 2015

Uh oh!

vanzin commented Jul 15, 2015

Uh oh!

mridulm commented Jul 15, 2015

Uh oh!

SparkQA commented Jul 15, 2015

Uh oh!

tgravescs commented Jul 16, 2015

Uh oh!

vanzin commented Jul 16, 2015

Uh oh!

vanzin commented Jul 16, 2015

Uh oh!

vanzin commented Jul 16, 2015

Uh oh!

andrewor14 Jul 17, 2015

Choose a reason for hiding this comment

Uh oh!

tgravescs Jul 17, 2015

Choose a reason for hiding this comment

Uh oh!

andrewor14 commented Jul 17, 2015

Uh oh!

tgravescs commented Jul 17, 2015

Uh oh!

vanzin commented Jul 17, 2015

Uh oh!

tgravescs commented Jul 17, 2015

Uh oh!

vanzin commented Jul 17, 2015

Uh oh!

tgravescs commented Jul 20, 2015

Uh oh!

SparkQA commented Jul 20, 2015

Uh oh!

vanzin commented Jul 21, 2015

Uh oh!

SparkQA commented Jul 21, 2015

Uh oh!

SparkQA commented Jul 21, 2015

Uh oh!

vanzin commented Jul 23, 2015

Uh oh!

SparkQA commented Jul 23, 2015

Uh oh!

SparkQA commented Jul 23, 2015

Uh oh!

tgravescs commented Jul 27, 2015

Uh oh!

vanzin commented Jul 28, 2015

Uh oh!

SparkQA commented Jul 28, 2015

Uh oh!

SparkQA commented Jul 28, 2015

Uh oh!

vanzin commented Jul 28, 2015

Uh oh!

SparkQA commented Jul 29, 2015

Uh oh!

SparkQA commented Jul 29, 2015

Uh oh!

SparkQA commented Jul 29, 2015

Uh oh!

SparkQA commented Jul 30, 2015

Uh oh!

vanzin commented Jul 30, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants