Fix deletion of files in current working directory by clearFiles() #345
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This addresses an issue where Spark could delete files in the current working directory that were added to the job using
addFile(). I encountered this issue when working on PySpark's code deployment mechanism, which is based onaddFile().From user-code's perspective (e.g. UDFs), files added through
addFile()are assumed to be in the current working directory. For jobs that are run locally usingDAGScheduler.runLocally(), tasks run with the driver's current working directory. As a result, files added throughaddFile()must be copied to the driver's current working directory. There's no mechanism to change the CWD in Java.clearFiles()andclearJars()clean up these files when the driver exits. This can be a problem if the original files that were added were in the driver's current working directory, because this will cause them to be deleted.A long-term fix would be to hide the location of fetched files from user code by requiring it to access files through an API like
SparkFiles.get("my-file-name.txt"). This will require changes to user code and may require changes to Shark.As a short-term fix, this pull request removes the code that deletes files in the current working directory and adds checks to
Utils.fetchFiles()to avoid overwriting existing local files with new data. The one downside of this change is that it may add junk to the current working directory. This is preferable to accidentally deleting files.I've also added
addFile()/addJar()to the Java API.I also added synchronization to
LocalScheduler.updateDependenciesto avoid performing multiple parallel fetches for the same file.