[SPARK-30849][CORE][SHUFFLE]Fix application failed due to failed to get MapStatuses broadcast block #27604

liupc · 2020-02-17T11:15:12Z

What changes were proposed in this pull request?

As described in SPARK-30849, spark application will sometimes failed due to failed to get mapStatuses broadcast block.

Job aborted due to stage failure: Task 18 in stage 2.0 failed 4 times, most recent failure: Lost task 18.3 in stage 2.0 (TID 13819, xxxx , executor 8): java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_9_piece1 of broadcast_9
java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_9_piece1 of broadcast_9
	at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1287)
	at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206)
	at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
	at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
	at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
	at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
	at org.apache.spark.MapOutputTracker$$anonfun$deserializeMapStatuses$1.apply(MapOutputTracker.scala:775)
	at org.apache.spark.MapOutputTracker$$anonfun$deserializeMapStatuses$1.apply(MapOutputTracker.scala:775)
	at org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54)
	at org.apache.spark.MapOutputTracker$.logInfo(MapOutputTracker.scala:712)
	at org.apache.spark.MapOutputTracker$.deserializeMapStatuses(MapOutputTracker.scala:774)
	at org.apache.spark.MapOutputTrackerWorker.getStatuses(MapOutputTracker.scala:665)
	at org.apache.spark.MapOutputTrackerWorker.getMapSizesByExecutorId(MapOutputTracker.scala:603)
	at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:57)
	at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:109)

This is caused by the mapStatuses broadcast id is sent to executor, but was invalidated immediately by the driver before the real fetching of the broadcast.

This PR will try to fix this issue.

Why are the changes needed?

Bugfix

Does this PR introduce any user-facing change?

No

How was this patch tested?

UT

…et MapStatuses broadcast block

Ngone51

This looks like a valid concern to me.

After fetching a broadcast'ed MapStatus from driver, the MapStatus can be destroyed at the same time when error happens(e.g. a FetchFailed exception from a concurrent task would invalidate that MapStatus). So, at the time we call value on the broadcast'ed MapStatus, it will fail with an uncaught exception from Broadcast(says the block has lost) and fail the job.

We should catch this exception and throw FetchFailed instead, so that the stage can re-run.

But I'm also surprised and doubt after reading JIRA ticket and error log that how does a same task hit this issue continuously 4 times? Is it only a coincidence or I miss something? @liupc

also cc @cloud-fan @jiangxb1987

core/src/main/scala/org/apache/spark/broadcast/TorrentBroadcast.scala

core/src/main/scala/org/apache/spark/MapOutputTracker.scala

      s"partitions $startPartition-$endPartition")
-    val statuses = getStatuses(shuffleId, conf)
    try {
+      val statuses = getStatuses(shuffleId, conf)


jiangxb1987 · 2020-02-21T23:47:27Z

Could you try to reproduce this issue on master branch first?

liupc · 2020-02-24T13:27:26Z

But I'm also surprised and doubt after reading JIRA ticket and error log that how does a same task hit this issue continuously 4 times? Is it only a coincidence or I miss something?

currently, when registerMapOutput we will invalidate the cached mapStatus broadcast, that means this mapStatus broadcast will be destroyed when the retrying tasks of the parent stage finished. So if there are serval tasks retried for parent stage, then there are large possibility that we will encounter this issue.

liupc · 2020-02-24T13:32:05Z

Could you try to reproduce this issue on master branch first?

Yes, I found this case in earlier spark version, but I checked the newest codes, seems not fixed. I will try to reproduce this issue in master branch.

Ngone51 · 2020-02-24T14:13:42Z

So if there are serval tasks retried for parent stage, then there are large possibility that we will encounter this issue.

But the max task failure is for the same task rather than several tasks in a same stage?

liupc · 2020-03-05T13:14:54Z

But the max task failure is for the same task rather than several tasks in a same stage?

Yes，I mean the fetch failure from the executor may cause several mapStatus to be removed and recomputed, thus several tasks will be re-executed in parent stage, so each of those task finish the cached mapStatuses will be invalidated, so the task of child stage 2 described in the jira may repeatedly encounter the IOException. @Ngone51 , I will try to write some tests or do some hack to reproduce this issue.

Ngone51 · 2020-03-05T14:15:08Z

Ok, I get your point now. Let me paraphrase it to see if I understand correctly:

Assuming we have stage0 finished while stage1 and stage2 are running concurrently and both depend on stage0.

Task from stage1 hit FetchFailedException and causes stage0 to re-run. At the same time, task X in stage2 is still running. Since there's multiple tasks from stage0 are running at the same time and each time a task from stage0 finished will invalidate cached map status(destroy broadcast), thus, task X has high possibility to hit IOException(a.k.a Failed to get broadcast) after fetching broadcasted map status from driver(because tasks from stage0 are continuously destroying the broadcast at the same time).

Also, in TaskSetManager side, it treats the exception as a counted task failure(rather than FetchFailed) and retry the task and then hit the same exception again and again.

core/src/main/scala/org/apache/spark/MapOutputTracker.scala

cloud-fan · 2020-03-05T15:02:32Z

is it possible to add a UT for it?

liupc · 2020-03-09T10:33:59Z

Ok, I get your point now. Let me paraphrase it to see if I understand correctly:

Assuming we have stage0 finished while stage1 and stage2 are running concurrently and both depend on stage0.

Task from stage1 hit FetchFailedException and causes stage0 to re-run. At the same time, task X in stage2 is still running. Since there's multiple tasks from stage0 are running at the same time and each time a task from stage0 finished will invalidate cached map status(destroy broadcast), thus, task X has high possibility to hit IOException(a.k.a Failed to get broadcast) after fetching broadcasted map status from driver(because tasks from stage0 are continuously destroying the broadcast at the same time).

Also, in TaskSetManager side, it treats the exception as a counted task failure(rather than FetchFailed) and retry the task and then hit the same exception again and again.

That's it! Thanks for reviewing @Ngone51

cloud-fan · 2020-03-09T10:41:11Z

ok to test

cloud-fan · 2020-03-09T11:29:37Z

core/src/main/scala/org/apache/spark/MapOutputTracker.scala

+          } catch {
+            case e: IOException if
+            Throwables.getCausalChain(e).asScala.exists(_.isInstanceOf[BlockNotFoundException]) =>
+              mapStatuses.clear()


Is it OK to clear out all the map status? Shouldn't we only drop the data of the current shuffle id?

@cloud-fan Good question! yes, it's ok to clear all the map status, but I think maybe just drop the data of the current shuffle id is enough. But it seems that we currently bind an global epoch to the MapOutputTracker, if one stage FetchFailed, then the epoch will be updated, so that it will clear all the map statuses cache in the executor side.
Should we change this behavior? if so may be we can put another PR for that.

I think there's a potential assuming that shuffle data are aways randomly and evenly placed on nodes. That means, any shuffle fetch failure can imply the potential fetch failure for other shuffles in future. So, currently, we aways clear mapStatuses when fetch failure happens.

But here is the broadcast being invalid issue. I don't think it usually happens for a lot of shuffles at the same time.

SparkQA · 2020-03-09T13:27:40Z

Test build #119565 has finished for PR 27604 at commit 0a51c15.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987 · 2020-03-09T18:11:40Z

We need to write a UT for this case.

Ngone51 · 2020-03-10T01:58:14Z

core/src/main/scala/org/apache/spark/MapOutputTracker.scala

+            fetchedStatuses = MapOutputTracker.deserializeMapStatuses(fetchedBytes, conf)
+          } catch {
+            case e: IOException if
+            Throwables.getCausalChain(e).asScala.exists(_.isInstanceOf[BlockNotFoundException]) =>


nit: indent.

Shall we logError here?

Maybe print logs at the Executor exception handling? I checked the code, seems it will not print any logs for FetchFailedException now?

spark/core/src/main/scala/org/apache/spark/executor/Executor.scala

Line 618 in a229943

case t: Throwable if hasFetchFailure && !Utils.isFatalError(t) =>

Maybe print logs at the Executor exception handling?

I don't understand what do you mean by this.

I checked the code, seems it will not print any logs for FetchFailedException now?

Yeah, but I think this's one is different. For me, I'd like to have a way to distinguish these two fetch failure.

The MetadataFetchFailedException already contains the root cause message, I think we can just print logs when handling fetch failure exception in the Executor class? What do you think?

The MetadataFetchFailedException already contains the root cause message

Oh, I miss that..then, it should be fine.

liupc · 2020-03-10T03:02:48Z

We need to write a UT for this case.

I will add a UT later. @jiangxb1987

SparkQA · 2020-03-10T06:18:49Z

Test build #119603 has finished for PR 27604 at commit c8826c1.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

core/src/main/scala/org/apache/spark/broadcast/TorrentBroadcast.scala

jiangxb1987 · 2020-03-17T17:42:08Z

Should we also update

spark/core/src/main/scala/org/apache/spark/MapOutputTracker.scala

Line 808 in ef51ff9

val statuses = getStatuses(shuffleId, conf)

?

liupc · 2020-03-20T09:42:20Z

Should we also update

spark/core/src/main/scala/org/apache/spark/MapOutputTracker.scala

Line 808 in ef51ff9

val statuses = getStatuses(shuffleId, conf)

?

What do you mean?

jiangxb1987 · 2020-03-24T00:33:35Z

The getStatus method could throw a MetadataFetchFailedException now, you should move every method call of it into a try...catch block.

liupc · 2020-05-14T06:57:29Z

The getStatus method could throw a MetadataFetchFailedException now, you should move every method call of it into a try...catch block.

Done, thanks for review @jiangxb1987

SparkQA · 2020-05-14T07:00:12Z

Test build #122608 has finished for PR 27604 at commit e9c46ca.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-06-15T05:24:40Z

Test build #124026 has finished for PR 27604 at commit 2e11d1b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

github-actions · 2020-09-24T00:47:59Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

[SPARK-30849][CORE][SHUFFLE]Fix application failed due to failed to g…

c9e5daf

…et MapStatuses broadcast block

Ngone51 reviewed Feb 21, 2020

View reviewed changes

dongjoon-hyun added the SPARK CORE label Feb 28, 2020

cloud-fan reviewed Mar 5, 2020

View reviewed changes

core/src/main/scala/org/apache/spark/MapOutputTracker.scala Outdated Show resolved Hide resolved

Throws MedataFetchFailedException in getStatuses

0a51c15

cloud-fan reviewed Mar 9, 2020

View reviewed changes

Ngone51 reviewed Mar 10, 2020

View reviewed changes

Add UT

c8826c1

jiangxb1987 reviewed Mar 17, 2020

View reviewed changes

core/src/main/scala/org/apache/spark/broadcast/TorrentBroadcast.scala Show resolved Hide resolved

Update

e9c46ca

probot-autolabeler bot added the CORE label May 14, 2020

Fix style

2e11d1b

github-actions bot added the Stale label Sep 24, 2020

github-actions bot closed this Sep 25, 2020

Ngone51 mentioned this pull request Apr 2, 2021

[SPARK-34939][CORE] Throw fetch failure exception when unable to deserialize broadcasted map statuses #32033

Closed

[SPARK-30849][CORE][SHUFFLE]Fix application failed due to failed to get MapStatuses broadcast block #27604

[SPARK-30849][CORE][SHUFFLE]Fix application failed due to failed to get MapStatuses broadcast block #27604

Uh oh!

Conversation

liupc commented Feb 17, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Ngone51 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

This comment was marked as spam.

Uh oh!

jiangxb1987 commented Feb 21, 2020

Uh oh!

liupc commented Feb 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

liupc commented Feb 24, 2020

Uh oh!

Ngone51 commented Feb 24, 2020

Uh oh!

liupc commented Mar 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ngone51 commented Mar 5, 2020

Uh oh!

Uh oh!

cloud-fan commented Mar 5, 2020

Uh oh!

liupc commented Mar 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan commented Mar 9, 2020

Uh oh!

cloud-fan Mar 9, 2020

Choose a reason for hiding this comment

Uh oh!

liupc Mar 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ngone51 Mar 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Sep 25, 2020

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 9, 2020

Uh oh!

jiangxb1987 commented Mar 9, 2020

Uh oh!

Ngone51 Mar 10, 2020

Choose a reason for hiding this comment

Uh oh!

Ngone51 Mar 10, 2020

Choose a reason for hiding this comment

Uh oh!

liupc Mar 10, 2020

Choose a reason for hiding this comment

Uh oh!

Ngone51 Mar 10, 2020

Choose a reason for hiding this comment

Uh oh!

liupc Mar 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ngone51 Mar 10, 2020

Choose a reason for hiding this comment

Uh oh!

liupc commented Mar 10, 2020

Uh oh!

SparkQA commented Mar 10, 2020

Ngone51 left a comment •

edited

Loading

liupc commented Feb 24, 2020 •

edited

Loading

liupc commented Mar 5, 2020 •

edited

Loading

liupc commented Mar 9, 2020 •

edited

Loading

liupc Mar 10, 2020 •

edited

Loading

Ngone51 Mar 10, 2020 •

edited

Loading

liupc Mar 10, 2020 •

edited

Loading