-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-23966][SS] Refactoring all checkpoint file writing logic in a common CheckpointFileManager interface #21048
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
brkyvz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Left a couple nits. This is awesome!
| */ | ||
| package org.apache.spark.sql.execution.streaming | ||
|
|
||
| import java.io.{FileSystem => _, _} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whoa what does FileSystem => _ do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whoa .. i dont know either... my intellij did some weird magic :/
| this(fm, path, generateTempPath(path), overwrite) | ||
| } | ||
|
|
||
| logInfo(s"Writing atomically to $finalPath using temp file $tempPath") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: logDebug
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking of having it as logInfo, so that we can debug stuff if the renaming goes wrong in some way.
| } | ||
|
|
||
| override def createAtomic( | ||
| path: Path, overwriteIfPossible: Boolean): CancellableFSDataOutputStream = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: separate lines please
| } | ||
|
|
||
| override def createAtomic( | ||
| path: Path, overwriteIfPossible: Boolean): CancellableFSDataOutputStream = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto on two lines
| } | ||
|
|
||
| try { | ||
| if (!fs.rename(srcPath, dstPath) && !overwriteIfPossible) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should ensure that these are file paths and not directories, in case someone else tries to use these APIs elsewhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, they should not. This is specifically designed for one purpose.
Also, generally in all implementations, rename fails if you are doing something non-sensical like rename a file to a path where a directory already exists.
Essentially, I want to keep this code absolutely minimal such that its easy to reason about and has minimal latency. Each check of whether its file or not will add to the latency. Most implementations will do those checks anyway. That's why there were patches in the past that removed unnecessary checks (e.g. e24f21b).
| } | ||
|
|
||
| private object FakeFileSystem { | ||
| val scheme = s"HDFSMetadataLogSuite${math.abs(Random.nextInt)}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
plagiarizer! :P
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hahaha.
| val store2 = provider.getStore(10) | ||
| put(store2, "11", 11) | ||
| store2.abort() | ||
| assert(TestCheckpointFileManager.cancelCalledInCreateAtomic) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you verify that it was false before the abort?
|
Test build #89223 has finished for PR 21048 at commit
|
|
Test build #89226 has finished for PR 21048 at commit
|
|
Test build #89227 has finished for PR 21048 at commit
|
|
Test build #89232 has finished for PR 21048 at commit
|
|
Test build #89282 has finished for PR 21048 at commit
|
|
Test build #89283 has finished for PR 21048 at commit
|
|
Test build #89287 has finished for PR 21048 at commit
|
|
|
||
| if (!fs.rename(srcPath, dstPath)) { | ||
| // If overwriteIfPossible = false, then we want to find out why the rename failed and | ||
| // try to throw the right error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix this comment.
|
Test build #89310 has finished for PR 21048 at commit
|
| s"Failed to rename $srcPath to $dstPath as destination already exists") | ||
| } | ||
|
|
||
| if (!fs.rename(srcPath, dstPath)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's a lot of ambiguity about "what does it mean if FileSystem.rename() returns false. FileContext.rename() is stricter here, One troublespot in particular is "what if the source file is not there?". If normally gets downgraded to a "return false".
Proposed: add a fs.getFileStatus(srcPath) before the rename call, as it will raise an FNFE if the source file isn't visible. Low cost against HDFS; more against remote object stores, but S3 isn't going to use this one, is it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, I would like the default implementation with rename to work correctly for all FileSystems, including S3AFileSystem (it may not be efficient, but it should be correct).
Also it important to note that this CheckpointFileManager is built not as a general-purpose common API across FileSystem and FileContext, but specifically for the purpose of checkpointing. So in this case, this absolutely not expected that the source path (i.e. the temp file just written by RenameBasedFSDataOutputStream) is not present. So I dont think its worth adding another RPC in the common path just to handle that unexpected case. What I can do is add another check below after the rename return false to try to throw FileNotFoundException instead of the fallback IOException.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense.
As you are doing a one-file rename, the rename() will have that atomic-operation semantics you want everywhere, it's just O(data) on s3 and swift. The direct write on both of those is what's critical to avoid checksum writes to block while that rename takes place, but not for the checksum output commit to take.
| fs.delete(path, true) | ||
| } catch { | ||
| case e: FileNotFoundException => | ||
| logInfo(s"Failed to delete $path as it does not exist") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should never be raised by a filesystem. Have you actually seen this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really. This is just precautionary, as I really havent seen all the code of all FileSystem implementations. Doesn't add any overhead by adding this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basic FS compliance tests, which even GCS now runs, requires that delete just returns false if there's no path. I wouldn't personally worry about it, as if it did ever arise, something has seriously broken.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, some people have to deal with older versions of FileSystems, not just the latest "good" ones. :)
As I said, since there isnt any overhead, I am inclined to keep it in there.
| } | ||
|
|
||
| override def exists(path: Path): Boolean = { | ||
| fs.exists(path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is tagged as deprecated in Hadoop 3 as too much code was doing some exists() check alongside some other reinvocation fo the same underlying code, e.g if (fs.exists(path) && fs.isFile(path)) { .,.. }. If you call fs.getFileStatus() direct you get an FNFE raised if it isn't and a FileStatus if is is there, which can often be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then basically I have to just copy the FileSystem.exists() code right here :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, tough call. Having looked through much of the spark/hive/ etc code, they to often do that exists() before delete, getFilestatus, ... which is why tag it as deprecated: to make clear that often the workflow is inefficient on non-HDFS, non-Posix stores (Where it's fine)
| assert(!fm.exists(dir)) | ||
| fm.mkdirs(dir) | ||
| assert(fm.exists(dir)) | ||
| fm.mkdirs(dir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if that fm.exists() call was replaced with an fm.getFileStatus() operation, as suggested earlier, this assert could be come assert(fm.getFileStatus(dir).isDirectory)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a test. I think it is fine.
|
LGTM |
|
@steveloughran @brkyvz @zsxwing Thanks for your comments :) |
|
Test build #89357 has finished for PR 21048 at commit
|
|
I am merging this to master. Once again, thank you for your reviews. |
| .doc("The class used to write checkpoint files atomically. This class must be a subclass " + | ||
| "of the interface CheckpointFileManager.") | ||
| .internal() | ||
| .stringConf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HeartSaVioR can you help to fix this minor issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not make any difference - we read the value from Hadoop Configuration instance (not sure why). Changing this to read from SQLConf is no longer minor one as CheckpointFileManager doesn't get SQLConf instance.
|
if this is being looked at again, it'd be nice to have a reference back end which just did the write straight to the destination: this is exactly what all the public cloud stores (s3, azure-*, gcs) offer as their atomic write. There'd be no need to make things clever and automatically switch; just make it something people can ask for. |
| fc.create(path, EnumSet.of(CREATE, OVERWRITE)) | ||
| import Options._ | ||
| fc.create( | ||
| path, EnumSet.of(CREATE, OVERWRITE), CreateOpts.checksumParam(ChecksumOpt.createDisabled())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tdas Is there any specific reason why we disabled the Checksum?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is was a long time ago, so i dont really remember. but my guess is that the presence of absence of checkpoint can produce unexpected errors.... hence we were avoiding that confusion,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for getting back on this. We hit a nasty bug in hadoop with files with Null checksum which wouldn't allow the NodeManager to start and wanted to disable this.
What changes were proposed in this pull request?
Checkpoint files (offset log files, state store files) in Structured Streaming must be written atomically such that no partial files are generated (would break fault-tolerance guarantees). Currently, there are 3 locations which try to do this individually, and in some cases, incorrectly.
FileSystemorFileContextAPIs. It preferably loadsFileContextimplementation as FileContext of HDFS has atomic renames.Current problems:
Solution:
This PR does that by introducing the interface
CheckpointFileManagerand modifyingHDFSMetadataLogandHDFSBackedStateStoreto use the interface. Similar to earlierFileManager, there are implementations based onFileSystemandFileContextAPIs, and the latter implementation is preferred to make it work correctly with HDFS.The key method this interface has is
createAtomic(path, overwrite)which returns aCancellableFSDataOutputStreamthat has the methodcancel(). All users of this method need to either callclose()to successfully write the file, orcancel()in case of an error.How was this patch tested?
New tests in
CheckpointFileManagerSuiteand slightly modified existing tests.