[SPARK-17371] Resubmitted shuffle outputs can get deleted by zombie map tasks #14932

ericl · 2016-09-02T01:53:59Z

What changes were proposed in this pull request?

It seems that old shuffle map tasks hanging around after a stage resubmit will delete intended shuffle output files on stop(), causing downstream stages to fail even after successful resubmit completion. This can happen easily if the prior map task is waiting for a network timeout when its stage is resubmitted.

This can cause unnecessary stage resubmits, sometimes multiple times as fetch fails cause a cascade of shuffle file invalidations, and confusing FetchFailure messages that report shuffle index files missing from the local disk.

Given that IndexShuffleBlockResolver commits data atomically, it seems unnecessary to ever delete committed task output: even in the rare case that a task is failed after it finishes committing shuffle output, it should be safe to retain that output.

How was this patch tested?

Prior to the fix proposed in #14931, I was able to reproduce this behavior by killing slaves in the middle of a large shuffle. After this patch, stages were no longer resubmitted multiple times due to shuffle index loss.

cc @JoshRosen @vanzin

SparkQA · 2016-09-02T04:06:02Z

Test build #64824 has finished for PR 14932 at commit 350d3be.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2016-09-06T23:20:04Z

Context for other reviewers:

As of #9610 (Spark 1.5.3+), map tasks write their output to temporary files and then atomically rename those files to put them at the final destination path. The IndexShuffleBlockResolver.removeDataByMap() call deletes the final map output files, not the temporary files. Therefore if a map task fails before these files are written then it does not need to call this cleanup method.

Output files are renamed to their final locations via the IndexShuffleBlockResolver.writeIndexFileAndCommit() method which is called at the end of the shuffle write. If we assume that this method is atomic in case of failures (e.g. it doesn't leave any state around after failing) and also assume that no steps after writeIndexFileAndCommit() will fail then it should be fine to omit these removeDataByMap calls in the error-handling cases.

Looking at writeIndexFileAndCommit, I think I spot a rare corner-case where it might behave non-atomically:

else {
        // This is the first successful attempt in writing the map outputs for this task,
        // so override any existing index and data files with the ones we wrote.
        if (indexFile.exists()) {
          indexFile.delete()
        }
        if (dataFile.exists()) {
          dataFile.delete()
        }
        if (!indexTmp.renameTo(indexFile)) {
          throw new IOException("fail to rename file " + indexTmp + " to " + indexFile)
        }
        if (dataTmp != null && dataTmp.exists() && !dataTmp.renameTo(dataFile)) {
          throw new IOException("fail to rename file " + dataTmp + " to " + dataFile)
        }
      }

I suppose it's possible that the index rename could succeed while the data file rename could fail. In this case we should probably handle cleanup of the data temp file.

Similarly, the existing code won't necessarily clean up indexTmp or dataTmp in error-conditions.

I don't think that either of these are huge issues in practice, but they might be worth fixing as long as we're removing an additional layer of error-handling defense as part of this PR.

JoshRosen · 2016-09-06T23:53:09Z

@ericl and I discussed this and decided to address the file cleanup issues that I mentioned above in a separate PR: the issues that I outlined above can generally only occur in cases where we've already run out of disk space and the pre-existing leak issues with the greatest potential to leak disk space are unaffected by this patch's changes (e.g. not made any worse).

I'm going to merge this to master for now and will submit my own followup patch to address the temp file cleanup issues.

Thu Sep 1 18:38:38 PDT 2016

350d3be

asfgit closed this in c07cbb3 Sep 6, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-17371] Resubmitted shuffle outputs can get deleted by zombie map tasks #14932

[SPARK-17371] Resubmitted shuffle outputs can get deleted by zombie map tasks #14932

Uh oh!

ericl commented Sep 2, 2016 •

edited

Loading

Uh oh!

SparkQA commented Sep 2, 2016

Uh oh!

JoshRosen commented Sep 6, 2016

Uh oh!

JoshRosen commented Sep 6, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-17371] Resubmitted shuffle outputs can get deleted by zombie map tasks #14932

[SPARK-17371] Resubmitted shuffle outputs can get deleted by zombie map tasks #14932

Uh oh!

Conversation

ericl commented Sep 2, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Sep 2, 2016

Uh oh!

JoshRosen commented Sep 6, 2016

Uh oh!

JoshRosen commented Sep 6, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ericl commented Sep 2, 2016 •

edited

Loading