-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-41551][SQL] Dynamic/absolute path support in PathOutputCommitters #39185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-41551][SQL] Dynamic/absolute path support in PathOutputCommitters #39185
Conversation
faa30f3 to
d19510a
Compare
|
d19510a to
8ac2521
Compare
|
latest commit adds parallel file/dir preparation and rename, for performance on slow/very slow cloud stores. As a consequence, it now needs better tests, FileOutputCommitter can be used for the work, though it doesn't do the other feature: IOStatistics collection |
|
latest pr will save to a report dir aggregate IOStats collected from the task attempts, in addition any _SUCCESS reports created by the inner committer. This is to align later with collection of context IOStats, where all stream read/write stats are collected. For now, a json summary of the stats collected/reported by the commmitters are collected, for example (omitting all min/max/mean values where no results were collected) |
Follow on to SPARK-40034. Dynamic partitioning though the PathOutputCommitProtocol needs to add the dirs to the superclass's partition list else the partition delete doesn't take place. Fix: * add an addPartition() method subclasses can use * add a getPartitions method to return an immutable copy of the list for testing. * add tests to verify all of this. Also fix newTaskTempFileAbsPath to return a path, irrespective of committer type. In dynamic mode, because the parent dir of an absolute path is deleted, there's a safety check to reject any requests for a file in a parent dir. This is something which could be pulled up to HadoopMapReduceCommitProtocol because it needs the same check, if the risk is considered realistic. The patch now downgrades from failing on dynamic partitioning if the committer doesn't declare it supports it to printing a warning. Why this? well, it *does* work through the s3a committers, it's just O(data). If someone does want to do INSERT OVERWRITE then they can be allowed to, just warned about it. The outcome will be correct except in the case of: "if the driver fails partway through dir rename, only some of the files will be there" Google GCS has that same non-atomic rename issue. But: even on an FS with atomic dir rename, any job which fails partway through the overwrite process is going to leave the fs in an inconsistent state, such as * some parts with new data, some parts not yet overwritten * a directory deleted and the new data not instantiated So it's not that much worse. The patch tries to update the protocol spec in HadoopMapReduceCommitProtocol to cover both newFileAbsPath() semantics/commit and failure modes of dynamic partition commit. Change-Id: Ibdf1bd43c82d792d8fcf2cace417830663dcc541
Clarify that abs paths files aren't saved in the TA dir, but in the final the staging dir, which will be created even in a classic non-dynamic job just for these files. Change-Id: I86de8fb190a44bfc8c6e33ede163eebc1939e332
Moving PathOutputCommitProtocol off being a subclass of HadoopMapReduceCommitProtocol makes it possible to support dynamic overwrite without making changes to HadoopMapReduceCommitProtocol, and to add parallel file/dir commit to compensate for file/dir rename performance. * Copy HadoopMapReduceCommitProtocol methods into PathOutputCommitProtocol * Make PathOutputCommitProtocol extend FileCommitProtocol directly Change-Id: Ic8ee1b7917538da0f99434768df0aae8bdc12f01
Parallelize abs and dynamic dir delete/create and rename This addresses scale issues with google GCS. For S3 it marginally improves performance as directories will be copied in parallel -but time to copy objects in there is O(data). There's parallel copy in the S3A code too, but it is limited and not configurable. That is: this change does not speed up INSERT OVERWRITE on S3 to the point where it is fast, instead simply "less slow". Change-Id: I4396d3fe2562c753d5a54a9ecdb9be2877bd81b0
Tests to verify the dynamic overwrite are wired up. Done by subclassing PartitionedWriteSuite and reconfiguring it. It's hard to verify the committers are being picked up, which is addressed by * during dev: triggering errors * soon: looking for json iostats reports. Note we could be *very* clever in this test: if the ManifestCommitterFactory is on the classpath, we could use it as the committer. Change-Id: I47716c43f1d34f226f34bfbe330e862f101b73d2
If the hadoop version used in the tests contains the manifest committer, that is switched to. This will collect and report IO statistics and other summary information into the directory target/reports/summaries. The same information is also collected and aggregated through the task commit messages. As/when thread context IOStats are collected (especially read IO information), that will only be reported via the spark protocol. Change-Id: I60c0fcfe0a538c147207349fd15aa991b2f2a0f0
All stat reports use spark.sql.sources.writeJobUUID in filename if set. Change-Id: Ie32033dae6da9c6bd24c1927b15bab8d3b49458b
Some extra diligence: test ParquetV2QuerySuite through the protocol and on hadoop 3.3.5+, the manifest committer. Change-Id: Ia9b869c65dd97463b7af02990cef8582e7680046
56c77ef to
f4a3352
Compare
|
I've created a more minimal PR with the earlier production code changes and the newer tests: #40221 ; if all is happy there it can go in while this new protocol variant can be considered/stabilised. Once spark switches to 3.3.5 as the hadoop 3 version, testing all this stuff becomes easier, as the ManifestCommitter will always be there, which tests can then code for. |
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
|
Hi, after applying the patch with Spark 3.5.5 and Hadoop 3.3.6, I encountered an issue where using the Manifest committer for INSERT OVERWRITE with dynamic partitions leads to data being deleted during the cleanup phase, right after the rename step. Could this be a misuse on my part, or is Manifest committer not supported by this patch? |
What changes were proposed in this pull request?
Follow on to SPARK-40034.
Dynamic partitioning though the PathOutputCommitProtocol needs to add the dirs to the superclass's partition list else the partition delete doesn't
take place.
Fix:
copy of the list for testing.
Also fix newTaskTempFileAbsPath to return a path, irrespective of committer type.
In dynamic mode, because the parent dir of an absolute path is deleted, there's a safety check to reject any requests for a file in a parent dir. This
is something which could be pulled up to HadoopMapReduceCommitProtocol because it needs the same check, if the risk is considered realistic.
The patch now downgrades from failing on dynamic partitioning if the committer doesn't declare it supports it to printing a warning. Why this? well, it
does work through the s3a committers, it's just O(data). If someone does want to do INSERT OVERWRITE then they can be allowed to, just warned about
it. The outcome will be correct except in the case of: "if the driver fails partway through dir rename, only some of the files will be there"
Google GCS has that same non-atomic rename issue. But: even on an FS with atomic dir rename, any job which fails partway through the overwrite process
is going to leave the fs in an inconsistent state, such as
So it's not that much worse.
The patch tries to update the protocol spec in HadoopMapReduceCommitProtocol to cover both newFileAbsPath() semantics/commit and failure modes of
dynamic partition commit.
Why are the changes needed?
newFileAbsPath()code is required of all committers, even though (thankfully) it is not used much.This is an attempt to do a minimal commit for complete functionality, with no attempt to improve performance with parallel dir setup, rename, etc.
That'd really be damage limitation: if you want performance in cloud, use a manifest format.
Does this PR introduce any user-facing change?
Updates the cloud docs to say that dynamic partition overwrite does work
everywhere, just may be really slow.
How was this patch tested?
New unit tests; run through spark and hadoop 3.3.5 RC0
One test area which is not covered is "does the absolute path mapping work store the correct paths and marshal them correctly?"
I've just realised that I can test this without adding an accessor for the map just by calling
commitTask()and parsing theTaskCommitMessagewhich comes back.Shall I add that?