You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[SPARK-30414][SQL] ParquetRowConverter optimizations: arrays, maps, plus misc. constant factors
### What changes were proposed in this pull request?
This PR implements multiple performance optimizations for `ParquetRowConverter`, achieving some modest constant-factor wins for all fields and larger wins for map and array fields:
- Add `private[this]` to several `val`s (90cebf0)
- Keep a `fieldUpdaters` array, saving two`.updater()` calls per field (7318785): I suspect that these are often megamorphic calls, so cutting these out seems like it could be a relatively large performance win.
- Only call `currentRow.numFields` once per `start()` call (e05de15): previously we'd call it once per field and this had a significant enough cost that it was visible during profiling.
- Reuse buffers in array and map converters (c7d1534, 6d16f59): previously we would create a brand-new Scala `ArrayBuffer` for each field read, but this isn't actually necessary because the data is already copied into a fresh array when `end()` constructs a `GenericArrayData`.
### Why are the changes needed?
To improve Parquet read performance; this is complementary to #26993's (orthogonal) improvements for nested struct read performance.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing tests, plus manual benchmarking with both synthetic and realistic schemas (similar to the ones in #26993). I've seen ~10%+ improvements in scan performance on certain real-world datasets.
Closes#27089 from JoshRosen/joshrosen/more-ParquetRowConverter-optimizations.
Lead-authored-by: Josh Rosen <[email protected]>
Co-authored-by: Josh Rosen <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
0 commit comments