[SPARK-22148][SPARK-15815][Scheduler] Acquire new executors to avoid hang because of blacklisting #22288

dhruve · 2018-08-30T18:23:59Z

What changes were proposed in this pull request?

Every time a task is unschedulable because of the condition where no. of task failures < no. of executors available, we currently abort the taskSet - failing the job. This change tries to acquire new executors so that we can complete the job successfully. We try to acquire a new executor only when we can kill an existing idle executor. We fallback to the older implementation where we abort the job if we cannot find an idle executor.

How was this patch tested?

I performed some manual tests to check and validate the behavior.

val rdd = sc.parallelize(Seq(1 to 10), 3)

import org.apache.spark.TaskContext

val mapped = rdd.mapPartitionsWithIndex ( (index, iterator) => { if (index == 2) { Thread.sleep(30 * 1000); val attemptNum = TaskContext.get.attemptNumber; if (attemptNum < 3) throw new Exception("Fail for blacklisting")};  iterator.toList.map (x => x + " -> " + index).iterator } )

mapped.collect

…ting

SparkQA · 2018-08-30T23:01:42Z

Test build #95485 has finished for PR 22288 at commit 5253b31.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dhruve · 2018-08-31T13:53:13Z

@squito @tgravescs Can you review this PR? Thanks.

Ngone51 · 2018-09-03T15:19:11Z

core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala

+                  hostToExecutors.valuesIterator.foreach(executors => executors.foreach({
+                    executor =>
+                      logDebug("Killing executor because of task unschedulability: " + executor)
+                      blacklistTrackerOpt.foreach(blt => blt.killBlacklistedExecutor(executor))


Seriously? You killed all executors ? What if other taskSets' tasks are running on them ?

BTW, if you want to refresh executors, you have to enable spark.blacklist.killBlacklistedExecutors also.

To refresh executors, you need to enable spark.blacklist.killBlacklistedExecutors.

I was thinking about it, killing all the executors is a little too harsh. Killing only a single executor would help mitigate this, although this would also lead to failing the running tasks on that executor.

Ngone51 · 2018-09-03T15:21:00Z

core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala

+                  }, UNSCHEDULABLE_TASKSET_TIMEOUT_MS)
+                }
+              } else {
+                // TODO: try acquiring new executors for static allocation before aborting.


How ? Waiting for other tasks finish and release resources ?

SparkQA · 2018-09-05T22:23:25Z

Test build #95729 has finished for PR 22288 at commit 87c4e57.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

squito

Thanks for looking at this @dhruve, sorry for the delay. Just some first thoughts from me, I need to go read my thoughts on the related jiras a bit.

I think you can add tests for this in BlacklistIntegrationSuite, but you'd need to extend it to allow for executors to get added and removed.

squito · 2018-09-11T15:56:19Z

core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala

+              // unable to schedule any task from the taskSet.
+              // Note: We keep a track of schedulability on a per taskSet basis rather than on a
+              // per task basis.
+              val executor = hostToExecutors.valuesIterator.next().iterator.next()


hostToExecutors.head._2.head

just thinking "aloud" -- I guess taking an arbitrary executor here is OK, as we know there is some task that can't run on any executor. But I wonder if we could have some priority here -- eg. I'd much rather kill an executor which has been blacklisted for an entire stage or the whole app, rather than one that was blacklisted for just some task. Need to look into if there is an efficient way to keep that priority list, though.

That's a nice suggestion.

There was a case where you could have a few executors running, let's say just 3 of them and all are blacklisted but have some tasks running on them. To satisfy this, I had started modifying this to take down an executor with the least no. of tasks running on them. I'll check some more on this.

I'm wondering is it worth to kill someone executor which has some tasks running on it ? After all, a task blaklisted on all executors(currently allocated) can not be guaranteed to run on a new allocated executor.

squito · 2018-09-11T15:59:36Z

core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala

+              // Note: We keep a track of schedulability on a per taskSet basis rather than on a
+              // per task basis.
+              val executor = hostToExecutors.valuesIterator.next().iterator.next()
+              logDebug("Killing executor because of task unschedulability: " + executor)


I think this should probably be logInfo (unless there is something else similar at INFO level elsewhere)

squito · 2018-09-11T16:03:31Z

core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala

+              blacklistTrackerOpt.foreach(blt => blt.killBlacklistedExecutor(executor))
+
+              if (!unschedulableTaskSetToExpiryTime.contains(taskSet)) {
+                  unschedulableTaskSetToExpiryTime(taskSet) = clock.getTimeMillis()


I'd include a logInfo here that spark can't schedule anything because of blacklisting, but its going to try to kill blacklisted executors and acquire new ones. Also mention how long it will wait before giving up and the associated conf.

squito · 2018-09-11T16:05:13Z

core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala

+          } else {
+            // If a task was scheduled, we clear the expiry time for the taskSet. The abort timer
+            // checks this entry to decide if we want to abort the taskSet.
+            if (unschedulableTaskSetToExpiryTime.contains(taskSet)) {


you can move this up to the else so its an else if. Or you could also just call remove without checking contains, that avoids probing twice.

just calling the remove sounds like a good idea.

squito · 2018-09-11T16:06:29Z

core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala

-   * will hang.
+   * spark.task.maxFailures. We need to detect this so we can avoid the job from being hung.
+   * If dynamic allocation is enabled we try to acquire new executor/s by killing the existing one.
+   * In case of static allocation we abort the taskSet immediately to fail the job.


why do you want something different with static allocation? If you kill an executor, static allocation will also request a replacement.

Yes. The change of removing a single executor takes care of static allocation as well. I will update the comments.

squito · 2018-09-11T16:33:10Z

Ok I looked at jiras, and this looks it also covers SPARK-15815, right? you could add that to the summary too.

You mention some future improvements:

Taking into account static allocation

I mentioned this on an inline comment too, but now that I'm thinking about, it seems like this will be fine with static allocation as well. It just seems like the problem is the worst in DA, as you can end up with one executor left for the straggler task, and then that executor gets blacklisted. But, with static allocation, maybe you only requested a small number of executors on a large cluster, and by chance you get them all on a host with bad disks, so then everything starts failing. You could still just kill those executors and request new ones to keep things going. Anything I'm missing?

Querying the RM to figure out if its a small cluster, then try to wait some more or abort immediately.

what's the concern here -- that if you're on a small cluster, there is very little chance of getting a good replacement so you should go back to failing fast? I guess that would be nice, but much less important in my opinion.

Try to distinguish between waiting for time while you acquire an executor and time for being unable to schedule a task.

I don't understand this part -- do you mean for locality preferences?

dhruve

Yes. It covers SPARK-15815 as well - but you provided a fix for the hang by aborting immediately.

1 - This satisfies the condition with Static Allocation as well. I will remove the comment from the code.

2 - Failing fast is the intent here. Its a good to have so kept it as a todo if we really want it.

3 - If it takes more time to acquire a new executor after killing a blacklisted one and the abort timer is up, we end up aborting the TaskSet. This was to see if we want to account for the time elapsed which doesn't include the time it took to obtain a new executor. Or we could just set the abortTimer expiration interval to a higher default value which should cover most of the cases.

Ngone51 · 2018-09-12T02:04:49Z

As I mentioned at #22288 (comment), I'm quite worry about this killing behaviour. I thik we should kill a executor iff it is idle.

By looking through dissuction above, give my thoughts below:

with dynamic allocation

Maybe, we can add onTaskCompletelyBlacklisted() method in DA manager's Listener and pass a e.g. TaskCompletelyBlacklistedEvent to it. Thus, DA manger will allocate new executor for us.

with static allocation

Set spark.scheduler.unschedulableTaskSetTimeout for a TaskSet. If a task blacklisted completely,
kill some executors iff they're idle (Maybe, taking executors' allocation time into acount here, we should increase timeout upperbound for a little for this TaskSet.). Then, waiting until to be scheduled or timeout&abort.

squito · 2018-09-12T15:09:29Z

If it takes more time to acquire a new executor after killing a blacklisted one and the abort timer is up, we end up aborting the TaskSet. This was to see if we want to account for the time elapsed which doesn't include the time it took to obtain a new executor. Or we could just set the abortTimer expiration interval to a higher default value which should cover most of the cases.

yeah I'm not sure you can do much better. What if it takes forever to get a new executor? There's no guarantee you will get anything else. I don't see much value in adding another timer for that case, but happy to hear about an alternative.

squito · 2018-09-12T15:33:38Z

I'm quite worry about this killing behaviour. I thik we should kill a executor iff it is idle.

yes, you have a good point. So the two extremes we need to consider are:

One straggler task which happens to get stuck on a bad executor. The task will fail on the executor, then when the executor is blacklisted, you've got nowhere else to run the task because DA has released the other executors.
A flawed taskset, where the tasks will fail no matter what.

In (2), we should think about other jobs running concurrently. (You could have concurrent jobs with (1), but if there is another job you probably have other executors up, so its less likely.)

It would be bad for us to kill things in scenario (2), where one bad taskset leads to us killing executors for other jobs. But if we wait indefinitely for an idle executor to kill, then that taskset may wait indefinitely, which is also bad.

Maybe, we can add onTaskCompletelyBlacklisted() method in DA manager's Listener and pass a e.g. TaskCompletelyBlacklistedEvent to it. Thus, DA manger will allocate new executor for us.

maybe this would help, lemme think about it ... I'd rather avoid adding this to the Listener api just for this as it should be an entirely internal detail, but maybe that is all we can do. I guess this would let you bump up the request, as long as you're lower than the max executors, so it would solve the case when there is only one executor. But in case 2, you'd probably end up requesting a whole bunch more executors very briefly, until there are enough failures on one specific task. or maybe we can ensure that even if there are a huge number of unschedulable tasks, we only ever request one extra executor?

with static allocation
Set spark.scheduler.unschedulableTaskSetTimeout for a TaskSet. If a task blacklisted completely,
kill some executors iff they're idle (Maybe, taking executors' allocation time into acount here, we should increase timeout upperbound for a little for this TaskSet.). Then, waiting until to be scheduled or timeout&abort.

sorry I don't think I understand this part. Is this the same as the current pr, but just killing only if idle?

squito · 2018-09-12T16:00:21Z

cc @jiangxb1987 @attilapiros also for thoughts

Ngone51 · 2018-09-13T15:10:15Z

(I'm on a outside trip these days, so I have to use my mobile phone to type these words. Sorry for the format.)

Is this the same as the current pr, but just killing only if idle?

Yes, simillar. This avoids a TaskSet to wait to be scheduled indefinitely. So, in case 2, if we do not find a idle executor before timeout, the TaskSet would abort, rather than hang.

But in case 2, you'd probably end up requesting a whole bunch more executors very briefly, until there are enough failures on one specific task. or maybe we can ensure that even if there are a huge number of unschedulable tasks, we only ever request one extra executor?

I'm not sure I have understand this part totally. But I realized a fact that, by now, our DA' strategy is basically based on tasks' status, e.g. pending, specatulative. However, a executor whether to be blacklisted depends on a success TaskSet' status (IIRC). So this fact may introduce level mismatch when we want to introduce DA in TaskScheduleImpl. (hope I understood your main thought)

SparkQA · 2018-09-21T19:11:08Z

Test build #96443 has finished for PR 22288 at commit 4c88168.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-09-21T20:03:43Z

Test build #96432 has finished for PR 22288 at commit ffbc9c3.

This patch fails from timeout after a configured wait of `300m`.
This patch does not merge cleanly.
This patch adds no public classes.

dhruve · 2018-09-28T16:21:39Z

retest this please

dhruve · 2018-09-28T16:22:27Z

the failures seem to be unrelated. I wasn't able to reproduce them.

SparkQA · 2018-09-28T21:03:08Z

Test build #96767 has finished for PR 22288 at commit 4c88168.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2018-10-04T18:38:12Z

core/src/main/scala/org/apache/spark/internal/config/package.scala

+  // blacklisting.
+  private[spark] val UNSCHEDULABLE_TASKSET_TIMEOUT =
+    ConfigBuilder("spark.scheduler.unschedulableTaskSetTimeout")
+      .doc("The timeout in seconds to wait before aborting a TaskSet to acquire a new executor " +


reword to be timeout in seconds to wait to try to acquire a new executor and schedule a task before aborting....

add blacklist to the name of the config, since it really only applies to blacklisted executors

also need to document in the .md file

tgravescs · 2018-10-04T18:45:44Z

core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala

  protected val executorIdToHost = new HashMap[String, String]

+  private val abortTimer = new Timer(true)
+


remove unneeded newline

tgravescs · 2018-10-04T18:45:59Z

core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala

+  private val abortTimer = new Timer(true)
+
+  private val clock = new SystemClock
+


remove newline

tgravescs · 2018-10-04T20:07:39Z

core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala

+
+              // If the taskSet is unschedulable we try to find an existing idle blacklisted
+              // executor. If we cannot find one, we abort immediately. Else we kill the idle
+              // executor and kick off an abortTimer which after waiting will abort the taskSet if


which if it doesn't schedule a task within the timeout will abort the taskset

tgravescs · 2018-10-04T20:08:02Z

core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala

+              // executor. If we cannot find one, we abort immediately. Else we kill the idle
+              // executor and kick off an abortTimer which after waiting will abort the taskSet if
+              // we were unable to schedule any task from the taskSet.
+              // Note 1: We keep a track of schedulability on a per taskSet basis rather than on a


we keep track

tgravescs · 2018-10-04T20:22:10Z

core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala

+                  }
+                case _ => // Abort Immediately
+                  logInfo("Cannot schedule any task because of complete blacklisting. No idle" +
+                  s" executors could be found. Aborting $taskSet." )


can be found to kill

tgravescs · 2018-10-04T20:30:31Z

core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala

+                  s" executors could be found. Aborting $taskSet." )
+                  taskSet.abortSinceCompletelyBlacklisted(taskIndex.get)
+              }
+            case _ => // Do nothing.


perhaps expand to say do nothing if no tasks completely blacklisted. It looks like the indentation is off here too but it might just be because of the diff and comments

tgravescs · 2018-10-04T20:31:26Z

core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala

+          } else {
+            // If a task was scheduled, we clear the expiry time for the taskSet. The abort timer
+            // checks this entry to decide if we want to abort the taskSet.
+            if (unschedulableTaskSetToExpiryTime.contains(taskSet)) {


just calling the remove sounds like a good idea.

tgravescs · 2018-10-08T16:15:29Z

core/src/test/scala/org/apache/spark/scheduler/TaskSchedulerImplSuite.scala

+    )
+    // Wait for the failed task to propagate.
+    Thread.sleep(500)
+    //    taskScheduler.handleFailedTask(tsm, failedTask.taskId, TaskState.FAILED, TaskResultLost)


remove commented out code

tgravescs · 2018-10-08T16:16:14Z

core/src/test/scala/org/apache/spark/scheduler/TaskSchedulerImplSuite.scala

+  }
+
+  test("SPARK-22148 try to acquire a new executor when task is unschedulable with 1 executor") {
+


remove extra line

…ARK-22148

SparkQA · 2018-10-09T17:00:00Z

Test build #97165 has finished for PR 22288 at commit 43e0af2.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

dhruve · 2018-10-22T18:08:27Z

It applies to both DA and SA. I have updated the description.

squito

I had another thought I was looking through this again -- do we have any test cases for if you've got some task locality, and you blacklist the executors with preferred locality, but other executors are available and you just haven't crossed the locality delay time yet? I think everything will be OK, but would be nice to have a test case for it.

squito · 2018-10-22T19:43:44Z

core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala

+            case Some(taskIndex) => // Returns the taskIndex which was unschedulable
+
+              // If the taskSet is unschedulable we try to find an existing idle blacklisted
+              // executor. If we cannot find one, we abort immediately. Else we kill the idle


I don't think this is true -- if there is no idle executor here, you abort the taskset immediately, you're not starting any timer, from this case lower down: case _ => // Abort Immediately.

I think to do what you described, you would instead need to do something different in that case, like start the same abortTimer, and also set a flag needToKillIdleExecutor and then on every call to resourceOffer, check that flag and potentially find an executor to kill. (However I haven't totally thought through that, not sure if it would really work. again, I'm not saying this has to be addressed now, just thinking this through)

SparkQA · 2018-10-23T02:26:49Z

Test build #97879 has finished for PR 22288 at commit 4a5ea82.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dhruve · 2018-10-23T22:08:59Z

@squito for the locality wait, it would be the same as the condition where it is not completely blacklisted. I have added a test for this. If we want to ensure the sequence for the timeout expiring and the task being scheduled, we will have to add some more delay. Let me know if we want to do it, or the test seems to suffice.

SparkQA · 2018-10-24T02:04:42Z

Test build #97943 has finished for PR 22288 at commit 9b2aeaf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

squito

mostly minor stuff, but I did have one concern about jobs waiting indefinitely. (Perhaps I don't understand things properly yet)

squito · 2018-10-26T17:20:57Z

core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala

+                    s" executors can be found to kill. Aborting $taskSet." )
+                  taskSet.abortSinceCompletelyBlacklisted(taskIndex)
+              }
+            case _ => // Do nothing if no tasks completely blacklisted.


you can remove this case if instead above you do

taskSet.getCompletelyBlacklistedTaskIfAny(hostToExecutors).foreach { taskIndex =>

I have seen this style earlier in the code base. Is this a norm (just curious)? I read a few scenarios where this would be better. However, personally every time I read a foreach, its instinctive to think the entity on which its being invoked as an iterable rather than an option, so it feels a bit odd.

doesn't matter a ton, I think its just a scala-ism it takes a while to get used to. my rough guidline is: use pattern-matching if you're doing something distinct in both the Some and None cases, or if you can make use of more complex patterns to avoid more nesting (eg. case Some(x) if x.isFoo() =>). If you're only doing something in the Some branch, then generally prefer map, foreach, filter, etc.

My reason for wanting it here is that when I look at this code, I needed to scroll back to figure out what you were even matching on here and make sure you weren't ignoring something important. When I see the match up above, I assume something is going to happen in both branches. OTOH if there was a foreach, when I see the foreach I know right away you're ignoring None.

again this is really minor, I don't actually care that much, just explaining my thinking.

squito · 2018-10-26T17:21:48Z

core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala

+                    abortTimer.schedule(
+                      createUnschedulableTaskSetAbortTimer(taskSet, taskIndex), timeout)
+                  }
+                case _ => // Abort Immediately


really minor, I think its a bit more clear if you say case None here (otherwise I take a just a second to figure out what other patterns will fall under this catch-all)

Makes sense. Will update it.

squito · 2018-10-26T17:24:10Z

core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala

+          // We want to defer killing any taskSets as long as we have a non blacklisted executor
+          // which can be used to schedule a task from any active taskSets. This ensures that the
+          // job can make progress and if we encounter a flawed taskSet it will eventually either
+          // fail or abort due to being completely blacklisted.


I think you should say here that you may have a job wait indefinitely, if its effectively blacklisted the entire cluster, but other jobs keep coming in and keeping resources occupied so the cluster stays busy. So its not really accurate to say that it will be aborted eventually, we are actually not guaranteeing that (if I understood things correctly).

Since its folded now lemme reference the prior discussion on this: #22288 (comment)

Want to make sure I understand this part, and why you aren't only clearing the timer for the taskset you just scheduled a task for. If you have multiple tasksets running simultaneously, one is making progress but the other is totally blacklisted, I guess you do not want to kill anything, because that would mess with the taskset that is working correctly? Instead you'll just let the taskset which is totally blacklisted eventually fail from the timeout? I guess that makes sense, because if one taskset is progressing, it means the failing taskset probably is probably flawed, not the executors.

If that's right, would be good to include something along those lines in the comment (personally I don't find a comment about how its related to the timer that useful, that's obvious from the code).

dhruve 7 days ago Contributor
That is correct. It also covers other scenario that @tgravescs originally pointed out.

Lets say if you have multiple taskSets running which are completely blacklisted. If you were able to get an executor, you would just clear the timer for that specific taskSet. Now due to resource constraint, if you weren't able to obtain another executor within the timeout for the other taskSet, you would abort the other taskSet when you could actually wait for it to be scheduled on the newly obtained executor.

So clearing the timer for all the taskSets ensures that currently we aren't in a completely blacklisted state and should try to run to completion. However if the taskset itself is flawed, we would eventually fail. This could result in wasted effort, but we don't have a way to determine that yet, so this should be okay.

Your understanding is correct. I will update the comment.

@tgravescs since we've been back and forth on the discussion of the cases here, just want to make sure you're aware of the possibility for waiting indefinitely here.

Thanks for pointing this out, but if I'm reading the discussion properly, I don't think you will actually wait indefinitely. Eventually you will either abort immediately or you should fail due to max number of task failures. Let me know if I'm missing something from the scenario.

Lets say you have taskset1 that is blacklisted on all nodes (lets say we have 3). 3 cases can happen at this point:

taskset 2 hasn't started, so it tries to kill an executor and starts the timer.

taskset 2 has started, if its running on all nodes then we abort immediately because no executors to kill to kill

taskset 2 has started but its not running on all blacklisted nodes, then we will kill an executor

At this point lets say we didn't abort so we killed an executor. Taskset 1 will get a chance to run on the new executor and either work or have a task failure. If it has a task failure and it gets blacklisted, we go back into the case above. But the # of task failures gets one closer.

so it seems like eventually you would either abort immediately if there aren't any executors to kill or you would eventually fail with max number of task attempts.

Here's the scenario I'm worried about:

taskset1 and taskset2 are both running currently. taskset1 has enough failures to get blacklisted everywhere.

there is an idle executor, even though taskset2 is running (eg. the executor that is available doesn't meet the locality preferences of taskset2). So abort timer is started.

the idle executor is killed, and you get a new one.

just by luck, taskset2 gets a hold of the new idle executor (eg. the executor is on a node blacklisted by taskset1, or taskset2 just has a higher priority). abort timer is cleared

taskset2 finishes, but meanwhile taskset3 has been launched, and it uses the idle executor. etc. for taskSetN, so you keep launching tasks, abort timer gets cleared, but nothing even gets scheduled on taskset1.

admittedly this would not be the normal scenario -- you'll need more tasksets to keep coming, and you need tight enough resource constraints that taskset1 never get a hold of anything, even the new one.

ok, yeah it seems like it would have to be very timing dependent that taskset1 never got a chance for that executor, really that would just be a normal indefinite postponement problem in the scheduler regardless of blacklisting. I don't think with fifo its a problem as first taskset should always be first. With Fair scheduler perhaps it could but probably depends on much more specific scenario.

I guess I'm ok with this if you are.

so its a bit worse than regular starvation from having competing tasksets, as in this case you might actually have resources available on your cluster, but you never ask for them, because the executor allocation manager thinks you have enough based on the number of pending tasks.

In any case, I agree this is a stretch, and overall its an improvement, so I'm OK with it.

squito · 2018-10-26T17:29:14Z

core/src/test/scala/org/apache/spark/scheduler/TaskSchedulerImplSuite.scala

+      WorkerOffer("executor0", "host0", 1)
+    )).flatten.size === 0)
+    // Wait for the abort timer to kick in. Without sleep the test exits before the timer is
+    // triggered.


comment is out of date, there is no sleep anymore. But it is still worth explaining that even though there is no configured delay, we still have to wait for the abort timer to run in a separate thread.

Comment still out of date

core/src/test/scala/org/apache/spark/scheduler/TaskSchedulerImplSuite.scala

squito · 2018-10-26T17:37:24Z

core/src/test/scala/org/apache/spark/scheduler/TaskSchedulerImplSuite.scala

+    assert(taskScheduler.unschedulableTaskSetToExpiryTime.size == 0)
+
+    val tsm2 = stageToMockTaskSetManager(1)
+    val failedTask2 = secondTaskAttempts.find(_.executorId == "executor0").get


minor, you've only got one taskattempt here, you could just do

val failedTask2 = secondTaskAttempts.head

squito · 2018-10-26T17:38:11Z

core/src/test/scala/org/apache/spark/scheduler/TaskSchedulerImplSuite.scala

+    )).flatten
+
+    // Fail the running task
+    val failedTask = taskAttempts.find(_.executorId == "executor0").get


SparkQA · 2018-10-27T01:33:21Z

Test build #98107 has finished for PR 22288 at commit aac1e9e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2018-10-31T15:47:24Z

@dhruve is this ready to review again?

dhruve · 2018-10-31T17:00:39Z

@tgravescs I have fixed a nit and its good to be reviewed. @squito I have updated the comment, let me know if its okay.

Thanks for the reviews.

squito · 2018-10-31T19:22:00Z

You mentioned in the description that you did some manual testing -- since this has been through some changes since the initial versions, can you do that again? please be sure to run some manual tests with (a) flawed jobs on a small cluster, so it really should abort and (b) OK jobs but with a failed straggler when only one executor is still active, which should kill the executor and get a new one. If you've already done that on relatively recent revision, that's fine too.

SparkQA · 2018-10-31T21:07:58Z

Test build #98325 has finished for PR 22288 at commit 676be55.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dhruve · 2018-11-01T21:36:49Z

@squito I have tested it again with both scenarios and I was able to verify the expected behavior. For the cases that are not covered in the PR, i will mention them in the jira.

tgravescs

+1

squito

lgtm other than a minor comment on a test

squito · 2018-11-05T09:36:53Z

core/src/test/scala/org/apache/spark/scheduler/TaskSchedulerImplSuite.scala

+    // make an offer but we won't schedule anything yet as scheduler locality is still PROCESS_LOCAL
+    assert(taskScheduler.resourceOffers(IndexedSeq(
+      WorkerOffer("executor1", "host0", 1)
+    )).flatten.isEmpty)


this is dependent on the system clock not advancing past the locality timeout. I've seen pauses on jenkins over 5 seconds in flaky tests -- either put in a manual clock or just increase the locality timeout in this test to avoid flakiness here

SparkQA · 2018-11-05T20:23:01Z

Test build #98481 has finished for PR 22288 at commit a30276f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2018-11-06T14:23:05Z

+1

…hang because of blacklisting ## What changes were proposed in this pull request? Every time a task is unschedulable because of the condition where no. of task failures < no. of executors available, we currently abort the taskSet - failing the job. This change tries to acquire new executors so that we can complete the job successfully. We try to acquire a new executor only when we can kill an existing idle executor. We fallback to the older implementation where we abort the job if we cannot find an idle executor. ## How was this patch tested? I performed some manual tests to check and validate the behavior. ```scala val rdd = sc.parallelize(Seq(1 to 10), 3) import org.apache.spark.TaskContext val mapped = rdd.mapPartitionsWithIndex ( (index, iterator) => { if (index == 2) { Thread.sleep(30 * 1000); val attemptNum = TaskContext.get.attemptNumber; if (attemptNum < 3) throw new Exception("Fail for blacklisting")}; iterator.toList.map (x => x + " -> " + index).iterator } ) mapped.collect ``` Closes #22288 from dhruve/bug/SPARK-22148. Lead-authored-by: Dhruve Ashar <[email protected]> Co-authored-by: Dhruve Ashar <[email protected]> Co-authored-by: Tom Graves <[email protected]> Signed-off-by: Thomas Graves <[email protected]> (cherry picked from commit fdd3bac) Signed-off-by: Thomas Graves <[email protected]>

tgravescs · 2018-11-06T14:27:12Z

merged to master and 2.4 branch, thanks @dhruve

dhruve · 2018-11-06T15:54:13Z

Thanks for the reviews and feedback @tgravescs , @squito !

…hang because of blacklisting ## What changes were proposed in this pull request? Every time a task is unschedulable because of the condition where no. of task failures < no. of executors available, we currently abort the taskSet - failing the job. This change tries to acquire new executors so that we can complete the job successfully. We try to acquire a new executor only when we can kill an existing idle executor. We fallback to the older implementation where we abort the job if we cannot find an idle executor. ## How was this patch tested? I performed some manual tests to check and validate the behavior. ```scala val rdd = sc.parallelize(Seq(1 to 10), 3) import org.apache.spark.TaskContext val mapped = rdd.mapPartitionsWithIndex ( (index, iterator) => { if (index == 2) { Thread.sleep(30 * 1000); val attemptNum = TaskContext.get.attemptNumber; if (attemptNum < 3) throw new Exception("Fail for blacklisting")}; iterator.toList.map (x => x + " -> " + index).iterator } ) mapped.collect ``` Closes apache#22288 from dhruve/bug/SPARK-22148. Lead-authored-by: Dhruve Ashar <[email protected]> Co-authored-by: Dhruve Ashar <[email protected]> Co-authored-by: Tom Graves <[email protected]> Signed-off-by: Thomas Graves <[email protected]>

…hang because of blacklisting ## What changes were proposed in this pull request? Every time a task is unschedulable because of the condition where no. of task failures < no. of executors available, we currently abort the taskSet - failing the job. This change tries to acquire new executors so that we can complete the job successfully. We try to acquire a new executor only when we can kill an existing idle executor. We fallback to the older implementation where we abort the job if we cannot find an idle executor. ## How was this patch tested? I performed some manual tests to check and validate the behavior. ```scala val rdd = sc.parallelize(Seq(1 to 10), 3) import org.apache.spark.TaskContext val mapped = rdd.mapPartitionsWithIndex ( (index, iterator) => { if (index == 2) { Thread.sleep(30 * 1000); val attemptNum = TaskContext.get.attemptNumber; if (attemptNum < 3) throw new Exception("Fail for blacklisting")}; iterator.toList.map (x => x + " -> " + index).iterator } ) mapped.collect ``` Closes apache#22288 from dhruve/bug/SPARK-22148. Lead-authored-by: Dhruve Ashar <[email protected]> Co-authored-by: Dhruve Ashar <[email protected]> Co-authored-by: Tom Graves <[email protected]> Signed-off-by: Thomas Graves <[email protected]> (cherry picked from commit fdd3bac) Signed-off-by: Thomas Graves <[email protected]>

…hang because of blacklisting Ref: LIHADOOP-51946 Every time a task is unschedulable because of the condition where no. of task failures < no. of executors available, we currently abort the taskSet - failing the job. This change tries to acquire new executors so that we can complete the job successfully. We try to acquire a new executor only when we can kill an existing idle executor. We fallback to the older implementation where we abort the job if we cannot find an idle executor. I performed some manual tests to check and validate the behavior. ```scala val rdd = sc.parallelize(Seq(1 to 10), 3) import org.apache.spark.TaskContext val mapped = rdd.mapPartitionsWithIndex ( (index, iterator) => { if (index == 2) { Thread.sleep(30 * 1000); val attemptNum = TaskContext.get.attemptNumber; if (attemptNum < 3) throw new Exception("Fail for blacklisting")}; iterator.toList.map (x => x + " -> " + index).iterator } ) mapped.collect ``` Closes apache#22288 from dhruve/bug/SPARK-22148. Lead-authored-by: Dhruve Ashar <[email protected]> Co-authored-by: Dhruve Ashar <[email protected]> Co-authored-by: Tom Graves <[email protected]> Signed-off-by: Thomas Graves <[email protected]> (cherry picked from commit fdd3bac) Signed-off-by: Thomas Graves <[email protected]> RB=2001207 A=

[SPARK-22148] Acquire new executors to avoid hang because of blacklis…

5253b31

…ting

dhruve changed the title ~~[SPARK-22148] Acquire new executors to avoid hang because of blacklisting~~ [SPARK-22148][Scheduler] Acquire new executors to avoid hang because of blacklisting Aug 30, 2018

Ngone51 reviewed Sep 3, 2018

View reviewed changes

Kill only a single executor at a time.

87c4e57

squito reviewed Sep 11, 2018

View reviewed changes

dhruve changed the title ~~[SPARK-22148][Scheduler] Acquire new executors to avoid hang because of blacklisting~~ [SPARK-22148][SPARK-15815][Scheduler] Acquire new executors to avoid hang because of blacklisting Sep 11, 2018

dhruve commented Sep 11, 2018

View reviewed changes

dhruve and others added 2 commits September 21, 2018 09:58

Kill idle executor only else abort immediately

ffbc9c3

Merge branch 'master' into bug/SPARK-22148

4c88168

Merge branch 'master' of github.com:apache/spark into bug/SPARK-22148

640825a

tgravescs reviewed Oct 8, 2018

View reviewed changes

dhruve added 2 commits October 9, 2018 11:54

Merge branch 'bug/SPARK-22148' of github.com:dhruve/spark into bug/SP…

aae7e87

…ARK-22148

Address review comments + update docs

43e0af2

Fix scala style checks

2ac135b

Fix nits

4a5ea82

squito reviewed Oct 22, 2018

View reviewed changes

Add unit test for locality wait

9b2aeaf

squito reviewed Oct 26, 2018

View reviewed changes

Address review comments

aac1e9e

Update comment

676be55

tgravescs approved these changes Nov 2, 2018

View reviewed changes

squito approved these changes Nov 5, 2018

View reviewed changes

Increase locality timeout to avoid potential flakiness

a30276f

asfgit closed this in fdd3bac Nov 6, 2018

venkata91 mentioned this pull request Apr 24, 2020

[SPARK-31418][SCHEDULER] Request more executors in case of dynamic allocation is enabled and a task becomes unschedulable due to spark's blacklisting feature. #28287

Closed

		protected val executorIdToHost = new HashMap[String, String]

		private val abortTimer = new Timer(true)

		private val abortTimer = new Timer(true)

		private val clock = new SystemClock

		}

		test("SPARK-22148 try to acquire a new executor when task is unschedulable with 1 executor") {

[SPARK-22148][SPARK-15815][Scheduler] Acquire new executors to avoid hang because of blacklisting #22288

[SPARK-22148][SPARK-15815][Scheduler] Acquire new executors to avoid hang because of blacklisting #22288

Uh oh!

Conversation

dhruve commented Aug 30, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Aug 30, 2018

Uh oh!

dhruve commented Aug 31, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 5, 2018

Uh oh!

squito left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

squito commented Sep 11, 2018

Uh oh!

dhruve left a comment

Choose a reason for hiding this comment

Uh oh!

Ngone51 commented Sep 12, 2018

Uh oh!

squito commented Sep 12, 2018

Uh oh!

squito commented Sep 12, 2018

Uh oh!

squito commented Sep 12, 2018

Uh oh!

Ngone51 commented Sep 13, 2018

Uh oh!

SparkQA commented Sep 21, 2018

Uh oh!

SparkQA commented Sep 21, 2018

Uh oh!

dhruve commented Sep 28, 2018

Uh oh!

dhruve commented Sep 28, 2018

Uh oh!

SparkQA commented Sep 28, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

dhruve commented Aug 30, 2018 •

edited

Loading