[SPARK-21942][CORE] Fix DiskBlockManager crashing when a root local folder has been externally deleted #19154
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
The problem:
DiskBlockManagerhas a notion of a "scratch" local folder(s), which can be configured viaspark.local.diroption, and which defaults to the system's/tmp. The hierarchy is two-level, e.g./blockmgr-XXX.../YY, where theYYpart is a hash bit, to spread files evenly.Function
DiskBlockManager.getFileexpects the top level directories (blockmgr-XXX...) to always exist (they get created once, when the spark context is first created), otherwise it would fail with a message like:However, this may not always be the case, in particular if it's the default
/tmpfolder - on certain operating systems it can be cleaned on a regular basis (e.g. once per day via a system cron job).The symptom is that after the process using spark is running for a while (a few days), it may not be able to load files anymore, since the top-level scratch directories are not there and
DiskBlockManager.getFilecrashes.The change/mitigation is simple: use
File.mkdirsinstead ofFile.mkdirinsidegetFile, so that we create the full path there, which will handle the case that parent directory is not there anymore.How was this patch tested?
I have added a falsifying unit test inside
DiskBlockManagerSuite, which gets fixed via this patch.