-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-30849][CORE][SHUFFLE]Fix application failed due to failed to get MapStatuses broadcast block #27604
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…et MapStatuses broadcast block
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a valid concern to me.
After fetching a broadcast'ed MapStatus from driver, the MapStatus can be destroyed at the same time when error happens(e.g. a FetchFailed exception from a concurrent task would invalidate that MapStatus). So, at the time we call value on the broadcast'ed MapStatus, it will fail with an uncaught exception from Broadcast(says the block has lost) and fail the job.
We should catch this exception and throw FetchFailed instead, so that the stage can re-run.
But I'm also surprised and doubt after reading JIRA ticket and error log that how does a same task hit this issue continuously 4 times? Is it only a coincidence or I miss something? @liupc
also cc @cloud-fan @jiangxb1987
| s"partitions $startPartition-$endPartition") | ||
| val statuses = getStatuses(shuffleId, conf) | ||
| try { | ||
| val statuses = getStatuses(shuffleId, conf) |
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
|
Could you try to reproduce this issue on master branch first? |
currently, when |
Yes, I found this case in earlier spark version, but I checked the newest codes, seems not fixed. I will try to reproduce this issue in master branch. |
But the max task failure is for the same task rather than several tasks in a same stage? |
Yes,I mean the fetch failure from the executor may cause several mapStatus to be removed and recomputed, thus several tasks will be re-executed in parent stage, so each of those task finish the cached mapStatuses will be invalidated, so the task of child stage 2 described in the jira may repeatedly encounter the IOException. @Ngone51 , I will try to write some tests or do some hack to reproduce this issue. |
|
Ok, I get your point now. Let me paraphrase it to see if I understand correctly: Assuming we have stage0 finished while stage1 and stage2 are running concurrently and both depend on stage0. Task from stage1 hit Also, in |
|
is it possible to add a UT for it? |
That's it! Thanks for reviewing @Ngone51 |
|
ok to test |
| } catch { | ||
| case e: IOException if | ||
| Throwables.getCausalChain(e).asScala.exists(_.isInstanceOf[BlockNotFoundException]) => | ||
| mapStatuses.clear() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it OK to clear out all the map status? Shouldn't we only drop the data of the current shuffle id?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan Good question! yes, it's ok to clear all the map status, but I think maybe just drop the data of the current shuffle id is enough. But it seems that we currently bind an global epoch to the MapOutputTracker, if one stage FetchFailed, then the epoch will be updated, so that it will clear all the map statuses cache in the executor side.
Should we change this behavior? if so may be we can put another PR for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's a potential assuming that shuffle data are aways randomly and evenly placed on nodes. That means, any shuffle fetch failure can imply the potential fetch failure for other shuffles in future. So, currently, we aways clear mapStatuses when fetch failure happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But here is the broadcast being invalid issue. I don't think it usually happens for a lot of shuffles at the same time.
|
Test build #119565 has finished for PR 27604 at commit
|
|
We need to write a UT for this case. |
| fetchedStatuses = MapOutputTracker.deserializeMapStatuses(fetchedBytes, conf) | ||
| } catch { | ||
| case e: IOException if | ||
| Throwables.getCausalChain(e).asScala.exists(_.isInstanceOf[BlockNotFoundException]) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: indent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we logError here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe print logs at the Executor exception handling? I checked the code, seems it will not print any logs for FetchFailedException now?
| case t: Throwable if hasFetchFailure && !Utils.isFatalError(t) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe print logs at the
Executorexception handling?
I don't understand what do you mean by this.
I checked the code, seems it will not print any logs for FetchFailedException now?
Yeah, but I think this's one is different. For me, I'd like to have a way to distinguish these two fetch failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The MetadataFetchFailedException already contains the root cause message, I think we can just print logs when handling fetch failure exception in the Executor class? What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
MetadataFetchFailedExceptionalready contains the root cause message
Oh, I miss that..then, it should be fine.
I will add a UT later. @jiangxb1987 |
|
Test build #119603 has finished for PR 27604 at commit
|
|
Should we also update
|
What do you mean? |
|
The |
Done, thanks for review @jiangxb1987 |
|
Test build #122608 has finished for PR 27604 at commit
|
|
Test build #124026 has finished for PR 27604 at commit
|
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
As described in SPARK-30849, spark application will sometimes failed due to failed to get mapStatuses broadcast block.
This is caused by the mapStatuses broadcast id is sent to executor, but was invalidated immediately by the driver before the real fetching of the broadcast.
This PR will try to fix this issue.
Why are the changes needed?
Bugfix
Does this PR introduce any user-facing change?
No
How was this patch tested?
UT