Implement Automatic Killing of Blacklisted Executors #2

jsoltren · 2016-11-30T23:21:07Z

What changes were proposed in this pull request?

Implement Automatic Killing of Blacklisted Executors - work in progress

How was this patch tested?

testOnly org.apache.spark.scheduler.BlacklistTrackerSuite org.apache.spark.scheduler.TaskSetManagerSuite

This reverts commit 414f99f.

…. Still need to wire in configuration.

squito

This approach looks solid. The one tricky part that isn't covered yet is that race between blacklisting a node and then having an executor register on it afterwards.

CoarseGrainedSchedulerBackend can get that via the nodeBlacklist method in TaskScheduler, so hopefully doesn't require making things too tangled.

squito · 2016-12-01T15:07:27Z

core/src/main/scala/org/apache/spark/SparkContext.scala

  @DeveloperApi
  def killExecutor(executorId: String): Boolean = killExecutors(Seq(executorId))

+  /**


Whoops. Fixed.

squito · 2016-12-01T15:09:38Z

core/src/main/scala/org/apache/spark/SparkContext.scala

+   * :: DeveloperApo ::
+   * Request that the cluster manager kill all executors on the specified host.
+   *
+   * Note: This is an indication to the cluster manager that the application wishes to adjust


why adjust downwards? I would have expected kill and replace. That is what we want for blacklisting, anyway.

Yes, and this means that killExecutorsOnHost has to be updated as well.

squito · 2016-12-01T15:12:17Z

core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala

  @volatile protected var currentExecutorIdCounter = 0

+  // The set of executors we have on each host.
+  protected val hostToExecutors = new HashMap[String, HashSet[String]]


CoarseGrainedSchedulerBackend already has a TaskSchedulerImpl, which has this functionality in getExecutorsAliveOnHost. I think you should just call that.

Doesn't executorDataMap contain this information already?

Using executorDataMap requires walking through the list of all the executors and picking out the ones that happen to be on a particular host. This is unfortunate on a large cluster.

What are you calling "a large cluster"?

Let's say you have 5000 executors. How expensive is this walk? Compared to sending messages over the network to kill the matching executors? Taking into account how often you actually have to call this method?

squito · 2016-12-01T15:18:00Z

core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala

+            val host = e._1
+            hostToExecutors -= host
+          }
+        })


if you switch to using TaskSchedulerImpl.getExecutorsAliveOnHost you won't need this at all, but some comments anyhow on this version just for the thought experiment:

this should be in removeExecutor, so you do it in all cases when the executor is removed

you can use a scala's PartialFunction syntax to unravel the pair a little more cleanly:

hostToExecutors.foreach { case (host, execs) => ... }

I'll remember this for next time.

squito · 2016-12-01T15:24:25Z

core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala

 private[scheduler] class BlacklistTracker (
    private val listenerBus: LiveListenerBus,
    conf: SparkConf,
+    sc: Option[SparkContext],


can this take an ExecutorAllocationClient instead, to avoid exposing all of SparkContext? also doesn't seem like it should be an option. If the only reason is for tests, you can use a mock in those tests.

+1. I'd rather avoid making changes to SparkContext at all. Especially since this particular API doesn't seem particularly useful for an app developer.

It is done.

squito · 2016-12-01T15:25:35Z

core/src/test/scala/org/apache/spark/deploy/StandaloneDynamicAllocationSuite.scala

    }
  }

+  test("kill all executors on localhost") {


this doesn't really have anything to do with Dynamic Allocation, so this is a strange suite to put this test in. Though it does seem to have a useful framework -- perhaps there is a base class to pull out?

I'll refactor these tests when I start adding more tests for configuration parameters and the like.

squito · 2016-12-02T17:28:17Z

core/src/main/scala/org/apache/spark/SparkContext.scala

+   *
+   * @return whether the request is received.
+   */
+  @DeveloperApi


you don't need to add this method anymore

squito · 2016-12-02T17:37:39Z

core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala

+   */
+  final override def killExecutorsOnHost(host: String): Seq[String] = {
+    logInfo(s"Requesting to kill any and all executors on host ${host}")
+    killExecutors(scheduler.getExecutorsAliveOnHost(host).get.toSeq, replace = true, force = true)


it might be possible that scheduler.getExecutorsAliveOnHost returns None -- the executor happens to get removed between when it gets blacklisted and when you end up here.

We might prevent that by the locks we have & the way msgs are processed, but it would be better to be safe on this anyhow:

scheduler.getExecutorsAliveOnHost(host).foreach(killExecutors(_.toSeq, ...))

I am also now wondering if we should kill all the executors immediately, or if we should push a request into one of the internal queues in the driver. Killing itself is async, so maybe it fine like this. I'm not really sure yet, something to keep thinking about.

jsoltren · 2016-12-02T20:08:25Z

@squito and I chatted some offline about the race he mentioned. To quote:

There is also one race you will need to watch out for:

the driver requests a new executor

the driver then blacklists a node

the cluster manager responds to the earlier request for an executor by giving an executor on the node that is now blacklisted

the driver blacklists the new executor, but never kills it.

We noted that CoarseGrainedSchedulerBackend is itself a singleton class and a driver endpoint for RPC messages. Thus, the easiest way to avoid this race, is to perform the killing of blacklisted executors in CoarseGrainedSchedulerBackend.

squito

Don't you still need to add something in CoarseGrainedSchedulerBackend.receiveAndReply / RegisterExecutor, so that you reject executors if they are already blacklisted?

other than minor comments

squito · 2016-12-13T22:23:09Z

core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala

      if (newTotal >= MAX_FAILURES_PER_EXEC && !executorIdToBlacklistStatus.contains(exec)) {
        logInfo(s"Blacklisting executor id: $exec because it has $newTotal" +
          s" task failures in successful task sets")
+        conf.get(config.BLACKLIST_ENABLED) match {


BlacklistTracker.isBlacklistEnabled

squito · 2016-12-13T22:23:16Z

core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala

        if (blacklistedExecsOnNode.size >= MAX_FAILED_EXEC_PER_NODE) {
          logInfo(s"Blacklisting node $node because it has ${blacklistedExecsOnNode.size} " +
            s"executors blacklisted: ${blacklistedExecsOnNode}")
+          conf.get(config.BLACKLIST_ENABLED) match {


BlacklistTracker.isBlacklistEnabled

squito · 2016-12-13T22:24:50Z

core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala

-      Some(new BlacklistTracker(sc, scheduler))
+      val executorAllocClient: Option[ExecutorAllocationClient] = sc.schedulerBackend match {
+        case b: ExecutorAllocationClient => Some(b.asInstanceOf[ExecutorAllocationClient])
+        case _ => None


Maybe its best to just fail fast right here if blacklist.kill is enabled, but you don't have an ExecutorAllocationClient?

squito · 2016-12-13T22:25:55Z

core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala

+  final override def killExecutorsOnHost(host: String): Unit = {
    logInfo(s"Requesting to kill any and all executors on host ${host}")
-    killExecutors(scheduler.getExecutorsAliveOnHost(host).get.toSeq, replace = true, force = true)
+    driverEndpoint.send(KillExecutorsOnHost(host))


I'd include a comment here on why you do delegate this to driver endpoint, rather than doing it immediately here, something along the lines of:

We have to be careful that there isn't a race between killing all executors on the bad host, and another executor getting registered on the same host. We do that by doing it within the DriverEndpoint, which is guaranteed to handle one message at a time since its a ThreadSafeRPCEndpoint

squito · 2016-12-13T22:35:45Z

core/src/main/scala/org/apache/spark/ExecutorAllocationClient.scala

+   * Request that the cluster manager try harder to kill the specified executors,
+   * and maybe replace them.
+   * @return whether the request is acknowledged by the cluster manager.
+   */


I don't think you need this in the api, do you?

jsoltren · 2016-12-13T23:06:34Z

Don't you still need to add something in CoarseGrainedSchedulerBackend.receiveAndReply / RegisterExecutor, so that you reject executors if they are already blacklisted?

We had discussed this earlier. I had thought through this. I'm not sure if it makes sense since we're going through an RPC method that should atomically be killing executors on a node. As far as the RegisterExecutor response is concerned, either we'll have done the killing already, or doing so is pending.

I guess BlacklistTracker could try to update CoarseGrainedSchedulerManager ASAP, separately from the RPC mechanism. But again, I thought that going through RPC removed the need for modifying RegisterExecutor.

squito · 2016-12-14T15:14:03Z

Don't you still need to add something in CoarseGrainedSchedulerBackend.receiveAndReply / RegisterExecutor, so that you reject executors if they are already blacklisted?

We had discussed this earlier. I had thought through this. I'm not sure if it makes sense since we're going through an RPC method that should atomically be killing executors on a node. As far as the RegisterExecutor response is concerned, either we'll have done the killing already, or doing so is pending.

I guess BlacklistTracker could try to update CoarseGrainedSchedulerManager ASAP, separately from the RPC mechanism. But again, I thought that going through RPC removed the need for modifying RegisterExecutor.

I dont' think that is enough. There are multiple sources for the race. First, there are multiple threads in the driver touching shared memory. We need to make sure that there isn't a race within those threads -- one thread registers a new executor on the bad host, while at the same time another thread thinks its killing all executors on the new host, but it doesn't know about the new executor yet. By grabbing the list of executors within DriverEndpoint, we avoid that race.

But another race is that the actual creation of executors is happening in a different process, likely a different host. We don't want to make that distributed process serial. So you could have

Driver requests more executors from cluster manager
Driver decides host X is bad
Driver kills all executors on host X
Cluster manager gives driver a new executor on host X

We can't force (4) to happen before any of the other steps. That event might come in long after the driver is done killing all executors, but it still needs to reject the new executor.

José Hiram Soltren added 5 commits November 30, 2016 16:45

BlacklistTracker takes a SchedulerBackend as input

414f99f

Revert "BlacklistTracker takes a SchedulerBackend as input"

719f93c

This reverts commit 414f99f.

Add killExecutorsOnHost. Needs a unit test.

2566ce7

Add test case for killExecutorsOnHost

44cf4d8

BlacklistTracker can ask the SparkContext to kill executors on a host…

63da9a3

…. Still need to wire in configuration.

squito reviewed Dec 1, 2016

View reviewed changes

Respond to review feedback: basic changes

f4ad62d

squito reviewed Dec 2, 2016

View reviewed changes

José Hiram Soltren added 5 commits December 2, 2016 15:13

Add documentation for configuration.md

4f4d21d

First implementation of actual executor killing in BlacklistTracker

3140017

Additional updates. Not sure if this killing is thread or race safe.

e605eaf

Add some implementation thoughts in comments to BlacklistTracker

4fb6b12

Update killing of nodes to use an RPC method for synchronization

322a232

squito suggested changes Dec 13, 2016

View reviewed changes

Implement Automatic Killing of Blacklisted Executors #2

Are you sure you want to change the base?

Implement Automatic Killing of Blacklisted Executors #2

Uh oh!

Conversation

jsoltren commented Nov 30, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

squito left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jsoltren commented Dec 2, 2016

Uh oh!

squito left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jsoltren commented Dec 13, 2016

Uh oh!

squito commented Dec 14, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants