Skip to content

Commit e8c06af

Browse files
wypoonsquito
authored andcommitted
[SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost
### What changes were proposed in this pull request? If an executor is lost, the `DAGScheduler` handles the executor loss by removing the executor but does not unregister its outputs if the external shuffle service is used. However, if the node on which the executor runs is lost, the shuffle service may not be able to serve the shuffle files. In such a case, when fetches from the executor's outputs fail in the same stage, the `DAGScheduler` again removes the executor and by right, should unregister its outputs. It doesn't because the epoch used to track the executor failure has not increased. We track the epoch for failed executors that result in lost file output separately, so we can unregister the outputs in this scenario. The idea to track a second epoch is due to Attila Zsolt Piros. ### Why are the changes needed? Without the changes, the loss of a node could require two stage attempts to recover instead of one. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New unit test. This test fails without the change and passes with it. Closes #28848 from wypoon/SPARK-32003. Authored-by: Wing Yew Poon <[email protected]> Signed-off-by: Imran Rashid <[email protected]>
1 parent 04bf351 commit e8c06af

File tree

2 files changed

+148
-50
lines changed

2 files changed

+148
-50
lines changed

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

Lines changed: 67 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -170,13 +170,34 @@ private[spark] class DAGScheduler(
170170
*/
171171
private val cacheLocs = new HashMap[Int, IndexedSeq[Seq[TaskLocation]]]
172172

173-
// For tracking failed nodes, we use the MapOutputTracker's epoch number, which is sent with
174-
// every task. When we detect a node failing, we note the current epoch number and failed
175-
// executor, increment it for new tasks, and use this to ignore stray ShuffleMapTask results.
176-
//
177-
// TODO: Garbage collect information about failure epochs when we know there are no more
178-
// stray messages to detect.
179-
private val failedEpoch = new HashMap[String, Long]
173+
/**
174+
* Tracks the latest epoch of a fully processed error related to the given executor. (We use
175+
* the MapOutputTracker's epoch number, which is sent with every task.)
176+
*
177+
* When an executor fails, it can affect the results of many tasks, and we have to deal with
178+
* all of them consistently. We don't simply ignore all future results from that executor,
179+
* as the failures may have been transient; but we also don't want to "overreact" to follow-
180+
* on errors we receive. Furthermore, we might receive notification of a task success, after
181+
* we find out the executor has actually failed; we'll assume those successes are, in fact,
182+
* simply delayed notifications and the results have been lost, if the tasks started in the
183+
* same or an earlier epoch. In particular, we use this to control when we tell the
184+
* BlockManagerMaster that the BlockManager has been lost.
185+
*/
186+
private val executorFailureEpoch = new HashMap[String, Long]
187+
188+
/**
189+
* Tracks the latest epoch of a fully processed error where shuffle files have been lost from
190+
* the given executor.
191+
*
192+
* This is closely related to executorFailureEpoch. They only differ for the executor when
193+
* there is an external shuffle service serving shuffle files and we haven't been notified that
194+
* the entire worker has been lost. In that case, when an executor is lost, we do not update
195+
* the shuffleFileLostEpoch; we wait for a fetch failure. This way, if only the executor
196+
* fails, we do not unregister the shuffle data as it can still be served; but if there is
197+
* a failure in the shuffle service (resulting in fetch failure), we unregister the shuffle
198+
* data only once, even if we get many fetch failures.
199+
*/
200+
private val shuffleFileLostEpoch = new HashMap[String, Long]
180201

181202
private [scheduler] val outputCommitCoordinator = env.outputCommitCoordinator
182203

@@ -1566,7 +1587,8 @@ private[spark] class DAGScheduler(
15661587
val status = event.result.asInstanceOf[MapStatus]
15671588
val execId = status.location.executorId
15681589
logDebug("ShuffleMapTask finished on " + execId)
1569-
if (failedEpoch.contains(execId) && smt.epoch <= failedEpoch(execId)) {
1590+
if (executorFailureEpoch.contains(execId) &&
1591+
smt.epoch <= executorFailureEpoch(execId)) {
15701592
logInfo(s"Ignoring possibly bogus $smt completion from executor $execId")
15711593
} else {
15721594
// The epoch of the task is acceptable (i.e., the task was launched after the most
@@ -1912,12 +1934,8 @@ private[spark] class DAGScheduler(
19121934
* modify the scheduler's internal state. Use executorLost() to post a loss event from outside.
19131935
*
19141936
* We will also assume that we've lost all shuffle blocks associated with the executor if the
1915-
* executor serves its own blocks (i.e., we're not using external shuffle), the entire executor
1916-
* process is lost (likely including the shuffle service), or a FetchFailed occurred, in which
1917-
* case we presume all shuffle data related to this executor to be lost.
1918-
*
1919-
* Optionally the epoch during which the failure was caught can be passed to avoid allowing
1920-
* stray fetch failures from possibly retriggering the detection of a node as lost.
1937+
* executor serves its own blocks (i.e., we're not using an external shuffle service), or the
1938+
* entire Standalone worker is lost.
19211939
*/
19221940
private[scheduler] def handleExecutorLost(
19231941
execId: String,
@@ -1933,29 +1951,44 @@ private[spark] class DAGScheduler(
19331951
maybeEpoch = None)
19341952
}
19351953

1954+
/**
1955+
* Handles removing an executor from the BlockManagerMaster as well as unregistering shuffle
1956+
* outputs for the executor or optionally its host.
1957+
*
1958+
* @param execId executor to be removed
1959+
* @param fileLost If true, indicates that we assume we've lost all shuffle blocks associated
1960+
* with the executor; this happens if the executor serves its own blocks (i.e., we're not
1961+
* using an external shuffle service), the entire Standalone worker is lost, or a FetchFailed
1962+
* occurred (in which case we presume all shuffle data related to this executor to be lost).
1963+
* @param hostToUnregisterOutputs (optional) executor host if we're unregistering all the
1964+
* outputs on the host
1965+
* @param maybeEpoch (optional) the epoch during which the failure was caught (this prevents
1966+
* reprocessing for follow-on fetch failures)
1967+
*/
19361968
private def removeExecutorAndUnregisterOutputs(
19371969
execId: String,
19381970
fileLost: Boolean,
19391971
hostToUnregisterOutputs: Option[String],
19401972
maybeEpoch: Option[Long] = None): Unit = {
19411973
val currentEpoch = maybeEpoch.getOrElse(mapOutputTracker.getEpoch)
1942-
if (!failedEpoch.contains(execId) || failedEpoch(execId) < currentEpoch) {
1943-
failedEpoch(execId) = currentEpoch
1944-
logInfo("Executor lost: %s (epoch %d)".format(execId, currentEpoch))
1974+
logDebug(s"Considering removal of executor $execId; " +
1975+
s"fileLost: $fileLost, currentEpoch: $currentEpoch")
1976+
if (!executorFailureEpoch.contains(execId) || executorFailureEpoch(execId) < currentEpoch) {
1977+
executorFailureEpoch(execId) = currentEpoch
1978+
logInfo(s"Executor lost: $execId (epoch $currentEpoch)")
19451979
blockManagerMaster.removeExecutor(execId)
1946-
if (fileLost) {
1947-
hostToUnregisterOutputs match {
1948-
case Some(host) =>
1949-
logInfo("Shuffle files lost for host: %s (epoch %d)".format(host, currentEpoch))
1950-
mapOutputTracker.removeOutputsOnHost(host)
1951-
case None =>
1952-
logInfo("Shuffle files lost for executor: %s (epoch %d)".format(execId, currentEpoch))
1953-
mapOutputTracker.removeOutputsOnExecutor(execId)
1954-
}
1955-
clearCacheLocs()
1956-
1957-
} else {
1958-
logDebug("Additional executor lost message for %s (epoch %d)".format(execId, currentEpoch))
1980+
clearCacheLocs()
1981+
}
1982+
if (fileLost &&
1983+
(!shuffleFileLostEpoch.contains(execId) || shuffleFileLostEpoch(execId) < currentEpoch)) {
1984+
shuffleFileLostEpoch(execId) = currentEpoch
1985+
hostToUnregisterOutputs match {
1986+
case Some(host) =>
1987+
logInfo(s"Shuffle files lost for host: $host (epoch $currentEpoch)")
1988+
mapOutputTracker.removeOutputsOnHost(host)
1989+
case None =>
1990+
logInfo(s"Shuffle files lost for executor: $execId (epoch $currentEpoch)")
1991+
mapOutputTracker.removeOutputsOnExecutor(execId)
19591992
}
19601993
}
19611994
}
@@ -1981,11 +2014,12 @@ private[spark] class DAGScheduler(
19812014
}
19822015

19832016
private[scheduler] def handleExecutorAdded(execId: String, host: String): Unit = {
1984-
// remove from failedEpoch(execId) ?
1985-
if (failedEpoch.contains(execId)) {
2017+
// remove from executorFailureEpoch(execId) ?
2018+
if (executorFailureEpoch.contains(execId)) {
19862019
logInfo("Host added was in lost list earlier: " + host)
1987-
failedEpoch -= execId
2020+
executorFailureEpoch -= execId
19882021
}
2022+
shuffleFileLostEpoch -= execId
19892023
}
19902024

19912025
private[scheduler] def handleStageCancellation(stageId: Int, reason: Option[String]): Unit = {

core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala

Lines changed: 81 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,9 @@ import scala.annotation.meta.param
2525
import scala.collection.mutable.{ArrayBuffer, HashMap, HashSet, Map}
2626
import scala.util.control.NonFatal
2727

28+
import org.mockito.Mockito.spy
29+
import org.mockito.Mockito.times
30+
import org.mockito.Mockito.verify
2831
import org.scalatest.concurrent.{Signaler, ThreadSignaler, TimeLimits}
2932
import org.scalatest.exceptions.TestFailedException
3033
import org.scalatest.time.SpanSugar._
@@ -235,6 +238,7 @@ class DAGSchedulerSuite extends SparkFunSuite with LocalSparkContext with TimeLi
235238

236239
var sparkListener: EventInfoRecordingListener = null
237240

241+
var blockManagerMaster: BlockManagerMaster = null
238242
var mapOutputTracker: MapOutputTrackerMaster = null
239243
var broadcastManager: BroadcastManager = null
240244
var securityMgr: SecurityManager = null
@@ -248,17 +252,18 @@ class DAGSchedulerSuite extends SparkFunSuite with LocalSparkContext with TimeLi
248252
*/
249253
val cacheLocations = new HashMap[(Int, Int), Seq[BlockManagerId]]
250254
// stub out BlockManagerMaster.getLocations to use our cacheLocations
251-
val blockManagerMaster = new BlockManagerMaster(null, null, conf, true) {
252-
override def getLocations(blockIds: Array[BlockId]): IndexedSeq[Seq[BlockManagerId]] = {
253-
blockIds.map {
254-
_.asRDDId.map(id => (id.rddId -> id.splitIndex)).flatMap(key => cacheLocations.get(key)).
255-
getOrElse(Seq())
256-
}.toIndexedSeq
257-
}
258-
override def removeExecutor(execId: String): Unit = {
259-
// don't need to propagate to the driver, which we don't have
260-
}
255+
class MyBlockManagerMaster(conf: SparkConf) extends BlockManagerMaster(null, null, conf, true) {
256+
override def getLocations(blockIds: Array[BlockId]): IndexedSeq[Seq[BlockManagerId]] = {
257+
blockIds.map {
258+
_.asRDDId.map { id => (id.rddId -> id.splitIndex)
259+
}.flatMap { key => cacheLocations.get(key)
260+
}.getOrElse(Seq())
261+
}.toIndexedSeq
261262
}
263+
override def removeExecutor(execId: String): Unit = {
264+
// don't need to propagate to the driver, which we don't have
265+
}
266+
}
262267

263268
/** The list of results that DAGScheduler has collected. */
264269
val results = new HashMap[Int, Any]()
@@ -276,6 +281,16 @@ class DAGSchedulerSuite extends SparkFunSuite with LocalSparkContext with TimeLi
276281
override def jobFailed(exception: Exception): Unit = { failure = exception }
277282
}
278283

284+
class MyMapOutputTrackerMaster(
285+
conf: SparkConf,
286+
broadcastManager: BroadcastManager)
287+
extends MapOutputTrackerMaster(conf, broadcastManager, true) {
288+
289+
override def sendTracker(message: Any): Unit = {
290+
// no-op, just so we can stop this to avoid leaking threads
291+
}
292+
}
293+
279294
override def beforeEach(): Unit = {
280295
super.beforeEach()
281296
init(new SparkConf())
@@ -293,11 +308,8 @@ class DAGSchedulerSuite extends SparkFunSuite with LocalSparkContext with TimeLi
293308
results.clear()
294309
securityMgr = new SecurityManager(conf)
295310
broadcastManager = new BroadcastManager(true, conf, securityMgr)
296-
mapOutputTracker = new MapOutputTrackerMaster(conf, broadcastManager, true) {
297-
override def sendTracker(message: Any): Unit = {
298-
// no-op, just so we can stop this to avoid leaking threads
299-
}
300-
}
311+
mapOutputTracker = spy(new MyMapOutputTrackerMaster(conf, broadcastManager))
312+
blockManagerMaster = spy(new MyBlockManagerMaster(conf))
301313
scheduler = new DAGScheduler(
302314
sc,
303315
taskScheduler,
@@ -548,6 +560,56 @@ class DAGSchedulerSuite extends SparkFunSuite with LocalSparkContext with TimeLi
548560
assert(mapStatus2(2).location.host === "hostB")
549561
}
550562

563+
test("SPARK-32003: All shuffle files for executor should be cleaned up on fetch failure") {
564+
// reset the test context with the right shuffle service config
565+
afterEach()
566+
val conf = new SparkConf()
567+
conf.set(config.SHUFFLE_SERVICE_ENABLED.key, "true")
568+
init(conf)
569+
570+
val shuffleMapRdd = new MyRDD(sc, 3, Nil)
571+
val shuffleDep = new ShuffleDependency(shuffleMapRdd, new HashPartitioner(3))
572+
val shuffleId = shuffleDep.shuffleId
573+
val reduceRdd = new MyRDD(sc, 3, List(shuffleDep), tracker = mapOutputTracker)
574+
575+
submit(reduceRdd, Array(0, 1, 2))
576+
// Map stage completes successfully,
577+
// two tasks are run on an executor on hostA and one on an executor on hostB
578+
completeShuffleMapStageSuccessfully(0, 0, 3, Seq("hostA", "hostA", "hostB"))
579+
// Now the executor on hostA is lost
580+
runEvent(ExecutorLost("hostA-exec", ExecutorExited(-100, false, "Container marked as failed")))
581+
// Executor is removed but shuffle files are not unregistered
582+
verify(blockManagerMaster, times(1)).removeExecutor("hostA-exec")
583+
verify(mapOutputTracker, times(0)).removeOutputsOnExecutor("hostA-exec")
584+
585+
// The MapOutputTracker has all the shuffle files
586+
val mapStatuses = mapOutputTracker.shuffleStatuses(shuffleId).mapStatuses
587+
assert(mapStatuses.count(_ != null) === 3)
588+
assert(mapStatuses.count(s => s != null && s.location.executorId == "hostA-exec") === 2)
589+
assert(mapStatuses.count(s => s != null && s.location.executorId == "hostB-exec") === 1)
590+
591+
// Now a fetch failure from the lost executor occurs
592+
complete(taskSets(1), Seq(
593+
(FetchFailed(makeBlockManagerId("hostA"), shuffleId, 0L, 0, 0, "ignored"), null)
594+
))
595+
// blockManagerMaster.removeExecutor is not called again
596+
// but shuffle files are unregistered
597+
verify(blockManagerMaster, times(1)).removeExecutor("hostA-exec")
598+
verify(mapOutputTracker, times(1)).removeOutputsOnExecutor("hostA-exec")
599+
600+
// Shuffle files for hostA-exec should be lost
601+
assert(mapStatuses.count(_ != null) === 1)
602+
assert(mapStatuses.count(s => s != null && s.location.executorId == "hostA-exec") === 0)
603+
assert(mapStatuses.count(s => s != null && s.location.executorId == "hostB-exec") === 1)
604+
605+
// Additional fetch failure from the executor does not result in further call to
606+
// mapOutputTracker.removeOutputsOnExecutor
607+
complete(taskSets(1), Seq(
608+
(FetchFailed(makeBlockManagerId("hostA"), shuffleId, 0L, 1, 0, "ignored"), null)
609+
))
610+
verify(mapOutputTracker, times(1)).removeOutputsOnExecutor("hostA-exec")
611+
}
612+
551613
test("zero split job") {
552614
var numResults = 0
553615
var failureReason: Option[Exception] = None
@@ -765,8 +827,7 @@ class DAGSchedulerSuite extends SparkFunSuite with LocalSparkContext with TimeLi
765827
complete(taskSets(1), Seq(
766828
(Success, 42),
767829
(FetchFailed(makeBlockManagerId("hostA"), shuffleId, 0L, 0, 0, "ignored"), null)))
768-
// this will get called
769-
// blockManagerMaster.removeExecutor("hostA-exec")
830+
verify(blockManagerMaster, times(1)).removeExecutor("hostA-exec")
770831
// ask the scheduler to try it again
771832
scheduler.resubmitFailedStages()
772833
// have the 2nd attempt pass
@@ -806,11 +867,14 @@ class DAGSchedulerSuite extends SparkFunSuite with LocalSparkContext with TimeLi
806867
submit(reduceRdd, Array(0))
807868
completeShuffleMapStageSuccessfully(0, 0, 1)
808869
runEvent(ExecutorLost("hostA-exec", event))
870+
verify(blockManagerMaster, times(1)).removeExecutor("hostA-exec")
809871
if (expectFileLoss) {
872+
verify(mapOutputTracker, times(1)).removeOutputsOnExecutor("hostA-exec")
810873
intercept[MetadataFetchFailedException] {
811874
mapOutputTracker.getMapSizesByExecutorId(shuffleId, 0)
812875
}
813876
} else {
877+
verify(mapOutputTracker, times(0)).removeOutputsOnExecutor("hostA-exec")
814878
assert(mapOutputTracker.getMapSizesByExecutorId(shuffleId, 0).map(_._1).toSet ===
815879
HashSet(makeBlockManagerId("hostA"), makeBlockManagerId("hostB")))
816880
}

0 commit comments

Comments
 (0)