Skip to content

Conversation

@vanzin
Copy link
Contributor

@vanzin vanzin commented Jul 15, 2015

This change adds code to notify the scheduler backend when a container dies in YARN.

@vanzin
Copy link
Contributor Author

vanzin commented Jul 15, 2015

This is an updated version of #7243, which also works in yarn-client mode.

/cc @mridulm @tgravescs

I ran a bunch of tests on yarn client and cluster modes, but I don't have a test that specifically exercises the bug, so take that with a grain of salt.

@mridulm
Copy link
Contributor

mridulm commented Jul 15, 2015

Looks good to me ! If Tom does not come back with objections I am +1 on this.

Thanks for patching this in @vanzin !

@SparkQA
Copy link

SparkQA commented Jul 15, 2015

Test build #37412 has finished for PR 7431 at commit 04dc112.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Pmod(left: Expression, right: Expression) extends BinaryArithmetic
    • final class SpecificRow extends $

@tgravescs
Copy link
Contributor

The changes look good. To manually test this can't you just remove/kill a nodemanager that has an executor on it? Perhaps there is a race between when yarn tells us and when we recognize?

If you don't have a cluster I can try it out later today.

@vanzin
Copy link
Contributor Author

vanzin commented Jul 16, 2015

I can try it out (I'll also add some logging in my internal build to make sure the code is actually being exercised).

@vanzin
Copy link
Contributor Author

vanzin commented Jul 16, 2015

So now that I tried the new code path (which works), I'm a little skeptical that sending a message back to the driver is really needed. The driver already removes the executor when the RPC connection is reset:

15/07/16 12:30:15 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 4, vanzin-st1-3.vpc.cloudera.com): ExecutorLostFailure (executor 3 lost)
15/07/16 12:30:15 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://[email protected]:36279] has failed, address is now gated for [5000] ms. Reason: [Disassociated] 
15/07/16 12:30:15 INFO DAGScheduler: Executor lost: 3 (epoch 0)
15/07/16 12:30:15 INFO BlockManagerMasterEndpoint: Trying to remove executor 3 from BlockManagerMaster.
15/07/16 12:30:15 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(3, vanzin-st1-3.vpc.cloudera.com, 37469)
15/07/16 12:30:15 INFO BlockManagerMaster: Removed 3 successfully in removeExecutor

The new message ends up being a no-op:

15/07/16 12:30:18 ERROR YarnClientSchedulerBackend: Asked to remove non-existent executor 3

See CoarseGrainedSchedulerBackend::DriverEndpoint::removeExecutor.

So I'm a little confused about how this change is fixing anything. The bug talks about "repeated re-execution of stages" - isn't that the correct way of handling executor failures? You retry tasks or stages depending on what the failure is.

Perhaps the real issue you ran into is something like #6750 instead?

@vanzin
Copy link
Contributor Author

vanzin commented Jul 16, 2015

(BTW to test this, killing the NM is not enough, since that doesn't seem to cause child containers to be killed.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aren't we already in an if (!alreadyReleased) up there? If I'm not wrong we don't need to check this here again.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no it comes out of first if !alreadyReleased at line 423 of the diff (418 or original) if you expand. The diff here just makes it look like it

@andrewor14
Copy link
Contributor

Yeah it seems like a race condition between the two detection code paths. If they are both guaranteed to be called, then we're always going to end up with ERROR: Asked to kill non-existent executor. Is there a case when the disconnection detection fails?

@tgravescs
Copy link
Contributor

https://issues.apache.org/jira/browse/SPARK-8297?focusedCommentId=14583560&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14583560

has more information. (timeout's for heartbeat is increased due to gc issues we see in spark)

I'm guessing there might be situations it doesn't recognize the disconnect right away. Either way I don't think this change hurts anything and it covers any weird cases that might happen. The error message I assume only happens in a few cases when you hit the exact race. Otherwise I would expect one side or the other to win. Generally I would expect the spark scheduler recognizing disconnect first and cause it to be removed from our list and not even call it. I didn't verify that in the code though, @vanzin do you happen to know if that is true, otherwise I'll check tomorrow.

@vanzin
Copy link
Contributor Author

vanzin commented Jul 17, 2015

I'm guessing there might be situations it doesn't recognize the disconnect right away.

That's the only case in which this code might help; but does that really happen? The executor process is already gone (otherwise YARN wouldn't notify the AM). The disconnection message has to go through the akka queues to get to the destination, but so do the messages being sent here.

So with this code, unless the disconnect message can somehow be lost, you'll always end up with that ERROR log, because the two code paths are sort of redundant.

@tgravescs
Copy link
Contributor

thats not necessarily true. if a NM goes away for long enough the RM will assume all containers on it are gone too so you would get the notification saying they are gone even if they are still running. If that executor was somehow in a weird state we would want it removed.

@vanzin
Copy link
Contributor Author

vanzin commented Jul 17, 2015

If that executor was somehow in a weird state we would want it removed.

Feels to me like Spark's own heartbeat would take care of that failure mode. The NM being in a bad state does not mean the executor is in a bad state.

I'm just trying to understand why the change is needed. I haven't seen anything that really requires it yet - seems like other safeguards in Spark already take care of the failure modes that have been identified so far. I'm also not against adding it, but perhaps if we do we should demote that log message to be less scary (error is too strong when we expect it to happen).

@tgravescs
Copy link
Contributor

It doesn't necessarily mean executor is in bad state but it could be in a bad state. I've seen very weird things happen on machines before so personally I don't think it hurts putting this in and it covers those weird cases that could come up. I'm fine with changing log message to less scary.

@SparkQA
Copy link

SparkQA commented Jul 20, 2015

Test build #37851 has finished for PR 7431 at commit 537da6f.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor Author

vanzin commented Jul 21, 2015

Jenkins retest this please.

@SparkQA
Copy link

SparkQA commented Jul 21, 2015

Test build #37967 has finished for PR 7431 at commit 537da6f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 21, 2015

Test build #44 has finished for PR 7431 at commit 537da6f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor Author

vanzin commented Jul 23, 2015

Jenkins retest this please.

@SparkQA
Copy link

SparkQA commented Jul 23, 2015

Test build #38233 has finished for PR 7431 at commit 537da6f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 23, 2015

Test build #81 has finished for PR 7431 at commit 537da6f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tgravescs
Copy link
Contributor

the last change was just log level change so this lgtm.

Conflicts:
	yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
@vanzin
Copy link
Contributor Author

vanzin commented Jul 28, 2015

Jenkins retest this please.

@SparkQA
Copy link

SparkQA commented Jul 28, 2015

Test build #139 has finished for PR 7431 at commit 3b262e8.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Abs(child: Expression) extends UnaryExpression with ExpectsInputTypes

@SparkQA
Copy link

SparkQA commented Jul 28, 2015

Test build #38747 has finished for PR 7431 at commit 3b262e8.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor Author

vanzin commented Jul 28, 2015

Jenkins retest this please.

@SparkQA
Copy link

SparkQA commented Jul 29, 2015

Test build #38772 has finished for PR 7431 at commit 3b262e8.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 29, 2015

Test build #142 has finished for PR 7431 at commit 3b262e8.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • public static class StructWriter
    • case class Abs(child: Expression) extends UnaryExpression with ExpectsInputTypes
    • case class CreateStructUnsafe(children: Seq[Expression]) extends Expression
    • case class CreateNamedStructUnsafe(children: Seq[Expression]) extends Expression
    • case class TungstenProject(projectList: Seq[NamedExpression], child: SparkPlan) extends UnaryNode

@SparkQA
Copy link

SparkQA commented Jul 29, 2015

Test build #38794 has finished for PR 7431 at commit d4adf4e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 30, 2015

Test build #38900 has finished for PR 7431 at commit 471e4a0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor Author

vanzin commented Jul 30, 2015

Merged to master.

@vanzin vanzin closed this Jul 30, 2015
@vanzin vanzin deleted the SPARK-8297 branch July 30, 2015 17:39
asfgit pushed a commit that referenced this pull request Jul 30, 2015
…ils in YARN

This change adds code to notify the scheduler backend when a container dies in YARN.

Author: Mridul Muralidharan <[email protected]>
Author: Marcelo Vanzin <[email protected]>

Closes #7431 from vanzin/SPARK-8297 and squashes the following commits:

471e4a0 [Marcelo Vanzin] Fix unit test after merge.
d4adf4e [Marcelo Vanzin] Merge branch 'master' into SPARK-8297
3b262e8 [Marcelo Vanzin] Merge branch 'master' into SPARK-8297
537da6f [Marcelo Vanzin] Make an expected log less scary.
04dc112 [Marcelo Vanzin] Use driver <-> AM communication to send "remove executor" request.
8855b97 [Marcelo Vanzin] Merge remote-tracking branch 'mridul/fix_yarn_scheduler_bug' into SPARK-8297
687790f [Mridul Muralidharan] Merge branch 'fix_yarn_scheduler_bug' of github.com:mridulm/spark into fix_yarn_scheduler_bug
e1b0067 [Mridul Muralidharan] Fix failing testcase, fix merge issue from our 1.3 -> master
9218fcc [Mridul Muralidharan] Fix failing testcase
362d64a [Mridul Muralidharan] Merge branch 'fix_yarn_scheduler_bug' of github.com:mridulm/spark into fix_yarn_scheduler_bug
62ad0cc [Mridul Muralidharan] Merge branch 'fix_yarn_scheduler_bug' of github.com:mridulm/spark into fix_yarn_scheduler_bug
bbf8811 [Mridul Muralidharan] Merge branch 'fix_yarn_scheduler_bug' of github.com:mridulm/spark into fix_yarn_scheduler_bug
9ee1307 [Mridul Muralidharan] Fix SPARK-8297
a3a0f01 [Mridul Muralidharan] Fix SPARK-8297
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants