-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-30338][SQL] Avoid unnecessary InternalRow copies in ParquetRowConverter #26993
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-30338][SQL] Avoid unnecessary InternalRow copies in ParquetRowConverter #26993
Conversation
| // we don't need to copy because copying will be done in the final | ||
| // UnsafeProjection, or | ||
| // 2. The path from the schema root to this field contains a map or array, | ||
| // in which case we will perform a recursive defensive copy via the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correctness relies on the copy actually being a deep copy. Looking elsewhere in this file, we have comments like
// NOTE: We can't reuse the mutable Map here and must instantiate a new `Map` for the next
// value. `Row.copy()` only copies row cells, it doesn't do deep copy to objects stored in row
// cells.
which suggest that certain copying might be shallow, so it's important to double-check and make sure that the copies are indeed deep.
Here, the state being copied is an InternalRow. To be more specific, it's actually a SpecificInternalRow (I'll update the .asInstanceOf cast below to reflect this). SpecificInternalRow extends BaseGenericInternalRow and #18483 changed that to implement a deep-copy, recursively copying maps, arrays, and structs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the existing comment about Row.copy() is outdated, so we might be able to optimize those other parts of the code, too; I'm going to defer that to future work / another PR, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: in #27089 I'm removing these other unnecessary ArrayBuffer copies.
|
Test build #115677 has finished for PR 26993 at commit
|
|
Test build #115680 has finished for PR 26993 at commit
|
|
retest this please |
|
Test build #115702 has finished for PR 26993 at commit
|
|
retest this please |
|
Test build #115727 has finished for PR 26993 at commit
|
|
retest this please |
|
Test build #115742 has finished for PR 26993 at commit
|
|
Test build #115874 has finished for PR 26993 at commit
|
|
I've removed the I've updated the existing "map with struct values" test so that it uses maps with multiple values (previously, we only tested with maps that contained one entry and that's insufficient to detect struct-copying problems (i.e. the old test would still pass if I completely removed the |
| new ParquetMapConverter(parquetType.asGroupType(), t, updater) | ||
|
|
||
| case t: StructType => | ||
| val wrappedUpdater = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JoshRosen, no big deal at all but how about we put the JIRA ID somewhere in the comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea: I added a JIRA reference in e6945e8
|
I happened to take a cursory look and seems pretty fine. |
|
Test build #115981 has finished for PR 26993 at commit
|
|
retest this please |
|
Test build #116000 has finished for PR 26993 at commit
|
|
I ran a "read and hash all columns" benchmark on datasets with real-world schemas; these schemas contained ~50-300+ fields at various depths of nesting. The benchmark code looked roughly like val data = spark.read.parquet(args.list("input"): _*)
data.select(hash($"*").as("hash")).groupBy().sum("hash").collect()Comparing the map/scan stages' |
|
@cloud-fan @dongjoon-hyun @viirya, could you take a look at this PR optimizing nested struct handling in |
| // `updater` is a RowUpdater, implying that the parent container is a struct. | ||
| // We do NOT need to perform defensive copying here because either: | ||
| // | ||
| // 1. The path from the schema root to this field consists only of nested |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we have deeply nested struct inside an array, is it the first case here?
I think it is fine because at the element converter the top level struct inside an array element will do the defensive copying. So in nested struct converter, we will see RowUpdater from parent struct so don't need defensive copying too.
Just maybe good to also update it in the doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's right. After thinking about this some more, I think I've come up with a clearer explanation and have updated the code comment: 4651b2f
| } | ||
| } | ||
|
|
||
| testStandardAndLegacyModes("array of struct") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a test for array of struct of struct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a new test case for this in 0f1af94
viirya
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks correct and pretty good for performance improvement.
|
Test build #116188 has finished for PR 26993 at commit
|
|
Test build #116193 has finished for PR 26993 at commit
|
|
thanks, merging to master! |
…lus misc. constant factors ### What changes were proposed in this pull request? This PR implements multiple performance optimizations for `ParquetRowConverter`, achieving some modest constant-factor wins for all fields and larger wins for map and array fields: - Add `private[this]` to several `val`s (90cebf0) - Keep a `fieldUpdaters` array, saving two`.updater()` calls per field (7318785): I suspect that these are often megamorphic calls, so cutting these out seems like it could be a relatively large performance win. - Only call `currentRow.numFields` once per `start()` call (e05de15): previously we'd call it once per field and this had a significant enough cost that it was visible during profiling. - Reuse buffers in array and map converters (c7d1534, 6d16f59): previously we would create a brand-new Scala `ArrayBuffer` for each field read, but this isn't actually necessary because the data is already copied into a fresh array when `end()` constructs a `GenericArrayData`. ### Why are the changes needed? To improve Parquet read performance; this is complementary to #26993's (orthogonal) improvements for nested struct read performance. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests, plus manual benchmarking with both synthetic and realistic schemas (similar to the ones in #26993). I've seen ~10%+ improvements in scan performance on certain real-world datasets. Closes #27089 from JoshRosen/joshrosen/more-ParquetRowConverter-optimizations. Lead-authored-by: Josh Rosen <[email protected]> Co-authored-by: Josh Rosen <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
What changes were proposed in this pull request?
This PR modifies
ParquetRowConverterto remove unnecessaryInternalRow.copy()calls for structs that are directly nested in other structs.Why are the changes needed?
These changes can significantly improve performance when reading Parquet files that contain deeply-nested structs with many fields.
The
ParquetRowConverteruses per-fieldConverters for handling individual fields. Internally, these converters may have mutable state and may return mutable objects. In most cases, eachconverteris only invoked once per Parquet record (this is true for top-level fields, for example). However, arrays and maps may call their child element converters multiple times per Parquet record: in these cases we must be careful to copy any mutable outputs returned by child converters.In the existing code,
InternalRows are copied whenever they are stored into any parent container (not just maps and arrays). This copying can be especially expensive for deeply-nested fields, since a deep copy is performed at every level of nesting.This PR modifies the code to avoid copies for structs that are directly nested in structs; see inline code comments for an argument for why this is safe.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Correctness: I added new test cases to
ParquetIOSuiteto increase coverage of nested structs, including structs nested in arrays: previously this suite didn't test that case, so we used to lack mutation coverage of thiscopy()code (the suite's tests still passed if I incorrectly removed the.copy()in all cases). I also added a test for maps with struct keys and modified the existing "map with struct values" test case include maps with two elements (since the incorrect omission of acopy()can only be detected if the map has multiple elements).Performance: I put together a simple local benchmark demonstrating the performance problems:
First, construct a nested schema:
Wrapper3's schema looks like:Next, generate some fake data:
I then ran a simple benchmark consisting of
where the
hash(*)is designed to force decoding of all Parquet fields but avoidsRowEncodercosts in the.rdd.count()stage.In the old code, expensive copying takes place at every level of nesting; this is apparent in the following flame graph:
After this PR's changes, the above toy benchmark runs ~30% faster.