-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-22148][SPARK-15815][Scheduler] Acquire new executors to avoid hang because of blacklisting #22288
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-22148][SPARK-15815][Scheduler] Acquire new executors to avoid hang because of blacklisting #22288
Changes from all commits
5253b31
87c4e57
ffbc9c3
4c88168
640825a
aae7e87
43e0af2
2ac135b
a12a3fb
4ce7610
c361693
2c5a753
69c156b
d2af73d
b2d0d40
ec38029
4a5ea82
9b2aeaf
aac1e9e
676be55
a30276f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -35,7 +35,7 @@ import org.apache.spark.rpc.RpcEndpoint | |
| import org.apache.spark.scheduler.SchedulingMode.SchedulingMode | ||
| import org.apache.spark.scheduler.TaskLocality.TaskLocality | ||
| import org.apache.spark.storage.BlockManagerId | ||
| import org.apache.spark.util.{AccumulatorV2, ThreadUtils, Utils} | ||
| import org.apache.spark.util.{AccumulatorV2, SystemClock, ThreadUtils, Utils} | ||
|
|
||
| /** | ||
| * Schedules tasks for multiple types of clusters by acting through a SchedulerBackend. | ||
|
|
@@ -117,6 +117,11 @@ private[spark] class TaskSchedulerImpl( | |
|
|
||
| protected val executorIdToHost = new HashMap[String, String] | ||
|
|
||
| private val abortTimer = new Timer(true) | ||
dhruve marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| private val clock = new SystemClock | ||
| // Exposed for testing | ||
| val unschedulableTaskSetToExpiryTime = new HashMap[TaskSetManager, Long] | ||
|
|
||
| // Listener object to pass upcalls into | ||
| var dagScheduler: DAGScheduler = null | ||
|
|
||
|
|
@@ -415,9 +420,53 @@ private[spark] class TaskSchedulerImpl( | |
| launchedAnyTask |= launchedTaskAtCurrentMaxLocality | ||
| } while (launchedTaskAtCurrentMaxLocality) | ||
| } | ||
|
|
||
| if (!launchedAnyTask) { | ||
| taskSet.abortIfCompletelyBlacklisted(hostToExecutors) | ||
| taskSet.getCompletelyBlacklistedTaskIfAny(hostToExecutors).foreach { taskIndex => | ||
| // If the taskSet is unschedulable we try to find an existing idle blacklisted | ||
| // executor. If we cannot find one, we abort immediately. Else we kill the idle | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm a little worried that the idle condition will be too strict in some scenarios, if there is a large backlog of tasks from another taskset, or whatever the error is, the tasks take a while to fail (eg., you've really got a bad executor, but its not apparent till after network timeouts or something). Eg. that could happen if you're doing a big join, and while preparing the input on the map-side, one side just has one straggler left but the other side still has a big backlog of tasks. Or, in a jobserver style situation, and there are always other tasksets coming in. that said, I don't have any better ideas at the moment, and I still think this is an improvement.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. By clearing the abort timer as soon as a task is launched we are relaxing this situation.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think this is true -- if there is no idle executor here, you abort the taskset immediately, you're not starting any timer, from this case lower down: I think to do what you described, you would instead need to do something different in that case, like start the same abortTimer, and also set a flag |
||
| // executor and kick off an abortTimer which if it doesn't schedule a task within the | ||
| // the timeout will abort the taskSet if we were unable to schedule any task from the | ||
| // taskSet. | ||
| // Note 1: We keep track of schedulability on a per taskSet basis rather than on a per | ||
| // task basis. | ||
| // Note 2: The taskSet can still be aborted when there are more than one idle | ||
| // blacklisted executors and dynamic allocation is on. This can happen when a killed | ||
| // idle executor isn't replaced in time by ExecutorAllocationManager as it relies on | ||
| // pending tasks and doesn't kill executors on idle timeouts, resulting in the abort | ||
| // timer to expire and abort the taskSet. | ||
| executorIdToRunningTaskIds.find(x => !isExecutorBusy(x._1)) match { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was preferring the code to be more readable. As this isn't a frequently running scenario, may be we could keep it. Thoughts?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. sure, I thought the name |
||
| case Some ((executorId, _)) => | ||
| if (!unschedulableTaskSetToExpiryTime.contains(taskSet)) { | ||
| blacklistTrackerOpt.foreach(blt => blt.killBlacklistedIdleExecutor(executorId)) | ||
|
|
||
| val timeout = conf.get(config.UNSCHEDULABLE_TASKSET_TIMEOUT) * 1000 | ||
| unschedulableTaskSetToExpiryTime(taskSet) = clock.getTimeMillis() + timeout | ||
| logInfo(s"Waiting for $timeout ms for completely " | ||
| + s"blacklisted task to be schedulable again before aborting $taskSet.") | ||
| abortTimer.schedule( | ||
| createUnschedulableTaskSetAbortTimer(taskSet, taskIndex), timeout) | ||
| } | ||
| case None => // Abort Immediately | ||
| logInfo("Cannot schedule any task because of complete blacklisting. No idle" + | ||
| s" executors can be found to kill. Aborting $taskSet." ) | ||
| taskSet.abortSinceCompletelyBlacklisted(taskIndex) | ||
| } | ||
| } | ||
| } else { | ||
| // We want to defer killing any taskSets as long as we have a non blacklisted executor | ||
| // which can be used to schedule a task from any active taskSets. This ensures that the | ||
| // job can make progress. | ||
| // Note: It is theoretically possible that a taskSet never gets scheduled on a | ||
| // non-blacklisted executor and the abort timer doesn't kick in because of a constant | ||
| // submission of new TaskSets. See the PR for more details. | ||
| if (unschedulableTaskSetToExpiryTime.nonEmpty) { | ||
| logInfo("Clearing the expiry times for all unschedulable taskSets as a task was " + | ||
| "recently scheduled.") | ||
| unschedulableTaskSetToExpiryTime.clear() | ||
| } | ||
| } | ||
|
|
||
| if (launchedAnyTask && taskSet.isBarrier) { | ||
| // Check whether the barrier tasks are partially launched. | ||
| // TODO SPARK-24818 handle the assert failure case (that can happen when some locality | ||
|
|
@@ -453,6 +502,23 @@ private[spark] class TaskSchedulerImpl( | |
| return tasks | ||
| } | ||
|
|
||
| private def createUnschedulableTaskSetAbortTimer( | ||
| taskSet: TaskSetManager, | ||
| taskIndex: Int): TimerTask = { | ||
| new TimerTask() { | ||
| override def run() { | ||
| if (unschedulableTaskSetToExpiryTime.contains(taskSet) && | ||
| unschedulableTaskSetToExpiryTime(taskSet) <= clock.getTimeMillis()) { | ||
| logInfo("Cannot schedule any task because of complete blacklisting. " + | ||
| s"Wait time for scheduling expired. Aborting $taskSet.") | ||
| taskSet.abortSinceCompletelyBlacklisted(taskIndex) | ||
| } else { | ||
| this.cancel() | ||
| } | ||
| } | ||
| } | ||
| } | ||
|
|
||
| /** | ||
| * Shuffle offers around to avoid always placing tasks on the same workers. Exposed to allow | ||
| * overriding in tests, so it can be deterministic. | ||
|
|
@@ -590,6 +656,7 @@ private[spark] class TaskSchedulerImpl( | |
| barrierCoordinator.stop() | ||
| } | ||
| starvationTimer.cancel() | ||
| abortTimer.cancel() | ||
| } | ||
|
|
||
| override def defaultParallelism(): Int = backend.defaultParallelism() | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.