Skip to content

Conversation

@ericl
Copy link
Contributor

@ericl ericl commented Sep 2, 2016

What changes were proposed in this pull request?

It seems that old shuffle map tasks hanging around after a stage resubmit will delete intended shuffle output files on stop(), causing downstream stages to fail even after successful resubmit completion. This can happen easily if the prior map task is waiting for a network timeout when its stage is resubmitted.

This can cause unnecessary stage resubmits, sometimes multiple times as fetch fails cause a cascade of shuffle file invalidations, and confusing FetchFailure messages that report shuffle index files missing from the local disk.

Given that IndexShuffleBlockResolver commits data atomically, it seems unnecessary to ever delete committed task output: even in the rare case that a task is failed after it finishes committing shuffle output, it should be safe to retain that output.

How was this patch tested?

Prior to the fix proposed in #14931, I was able to reproduce this behavior by killing slaves in the middle of a large shuffle. After this patch, stages were no longer resubmitted multiple times due to shuffle index loss.

cc @JoshRosen @vanzin

@SparkQA
Copy link

SparkQA commented Sep 2, 2016

Test build #64824 has finished for PR 14932 at commit 350d3be.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JoshRosen
Copy link
Contributor

Context for other reviewers:

As of #9610 (Spark 1.5.3+), map tasks write their output to temporary files and then atomically rename those files to put them at the final destination path. The IndexShuffleBlockResolver.removeDataByMap() call deletes the final map output files, not the temporary files. Therefore if a map task fails before these files are written then it does not need to call this cleanup method.

Output files are renamed to their final locations via the IndexShuffleBlockResolver.writeIndexFileAndCommit() method which is called at the end of the shuffle write. If we assume that this method is atomic in case of failures (e.g. it doesn't leave any state around after failing) and also assume that no steps after writeIndexFileAndCommit() will fail then it should be fine to omit these removeDataByMap calls in the error-handling cases.

Looking at writeIndexFileAndCommit, I think I spot a rare corner-case where it might behave non-atomically:

else {
        // This is the first successful attempt in writing the map outputs for this task,
        // so override any existing index and data files with the ones we wrote.
        if (indexFile.exists()) {
          indexFile.delete()
        }
        if (dataFile.exists()) {
          dataFile.delete()
        }
        if (!indexTmp.renameTo(indexFile)) {
          throw new IOException("fail to rename file " + indexTmp + " to " + indexFile)
        }
        if (dataTmp != null && dataTmp.exists() && !dataTmp.renameTo(dataFile)) {
          throw new IOException("fail to rename file " + dataTmp + " to " + dataFile)
        }
      }

I suppose it's possible that the index rename could succeed while the data file rename could fail. In this case we should probably handle cleanup of the data temp file.

Similarly, the existing code won't necessarily clean up indexTmp or dataTmp in error-conditions.

I don't think that either of these are huge issues in practice, but they might be worth fixing as long as we're removing an additional layer of error-handling defense as part of this PR.

@JoshRosen
Copy link
Contributor

@ericl and I discussed this and decided to address the file cleanup issues that I mentioned above in a separate PR: the issues that I outlined above can generally only occur in cases where we've already run out of disk space and the pre-existing leak issues with the greatest potential to leak disk space are unaffected by this patch's changes (e.g. not made any worse).

I'm going to merge this to master for now and will submit my own followup patch to address the temp file cleanup issues.

@asfgit asfgit closed this in c07cbb3 Sep 6, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants