[SPARK-30414][SQL] ParquetRowConverter optimizations: arrays, maps, plus misc. constant factors #27089

JoshRosen · 2020-01-03T08:54:35Z

What changes were proposed in this pull request?

This PR implements multiple performance optimizations for ParquetRowConverter, achieving some modest constant-factor wins for all fields and larger wins for map and array fields:

Add private[this] to several vals (90cebf0)
Keep a fieldUpdaters array, saving two.updater() calls per field (7318785): I suspect that these are often megamorphic calls, so cutting these out seems like it could be a relatively large performance win.
Only call currentRow.numFields once per start() call (e05de15): previously we'd call it once per field and this had a significant enough cost that it was visible during profiling.
Reuse buffers in array and map converters (c7d1534, 6d16f59): previously we would create a brand-new Scala ArrayBuffer for each field read, but this isn't actually necessary because the data is already copied into a fresh array when end() constructs a GenericArrayData.

Why are the changes needed?

To improve Parquet read performance; this is complementary to #26993's (orthogonal) improvements for nested struct read performance.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing tests, plus manual benchmarking with both synthetic and realistic schemas (similar to the ones in #26993). I've seen ~10%+ improvements in scan performance on certain real-world datasets.

…e it

SparkQA · 2020-01-03T12:48:46Z

Test build #116087 has finished for PR 27089 at commit 4456f91.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

joshrosen-stripe · 2020-01-06T19:26:18Z

@cloud-fan @HyukjinKwon @dongjoon-hyun @viirya, could you take a look at this PR which implements several small performance optimizations in ParquetRowConverter? These changes are aimed at improving performance when scanning very wide datasets with large numbers of columns, plus datasets with small maps and arrays. These changes are complementary but orthogonal to the changes in #26967.

HyukjinKwon · 2020-01-07T00:56:42Z

.../src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala


  private trait RepeatedConverter {
-    private var currentArray: ArrayBuffer[Any] = _
+    private[this] val currentArray = new java.util.ArrayList[Any]()


@JoshRosen, sorry if I'm ignorant about this but why do we need to change ArrayBuffer -> ArrayList? Seems ArrayBuffer itself is mutable and can clear() too.

From prior experience I've found ArrayList to be marginally faster; I ran some quick-and-dirty non-Spark microbenchmarks and this is indeed still the case, but the gain is pretty marginal compared to other factors.

In the interests of code simplicity and clarity, I've backed out that part of the change: the code now uses and clear()s a mutable.ArrayBuffer: 6d16f59

HyukjinKwon

LGTM but one question

viirya · 2020-01-07T01:55:36Z

.../src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala

-    // NOTE: We can't reuse the mutable `ArrayBuffer` here and must instantiate a new buffer for the
-    // next value.  `Row.copy()` only copies row cells, it doesn't do deep copy to objects stored
-    // in row cells.
-    override def start(): Unit = currentArray = ArrayBuffer.empty[Any]


I think it depends on currentArray.toArray copies the elements or not?

ArrayBuffer.toArray should always return a fresh unshared array object (internally, it allocates a new array and then calls copyToArray).

It doesn't do copying / cloning of the array elements themselves, but that shouldn't be a problem: by design, the objects that are inserted into this array are unshared / immutable: the map and array converters always return unshared objects and we always .copy() rows when inserting them into a map or array parent container (this is still true after the changes in #26993).

I did a bit of archaeology and tracked down the source of the // NOTE comment here: it was added in #7231 and at that time it looks like we were actually passing the mutable.ArrayBuffer itself to updater: https://github.com/apache/spark/blame/360fe18a61538b03cac05da1c6d258e124df6feb/sql/core/src/main/scala/org/apache/spark/sql/parquet/CatalystRowConverter.scala#L321. The comment makes sense in that context: with that older code, we would wind up with Row() objects that contained mutable.ArrayBuffers.

Later, in #7724 this was changed to pass a new GenericArrayData(currentArray.toArray) to the parent updater: c0cc0ea#diff-1d6c363c04155a9328fe1f5bd08a2f90. At that point I think we could have safely made the change to begin reusing the mutable.ArrayBuffer since it no longer escaped its converter.

SparkQA · 2020-01-07T05:29:01Z

Test build #116199 has finished for PR 27089 at commit 6d16f59.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-01-07T05:29:56Z

Merged to master.

JoshRosen added 5 commits January 3, 2020 00:04

Add private[this] to vals

90cebf0

Save a .updater() call for each field

7318785

Only call currentRow.numFields once per loop

e05de15

Replace Scala ArrayBuffer with Java ArrayList; clear buffer and re-us…

c7d1534

…e it

Avoid GenericArrayData constructor perf. problems (see SPARK-30413)

4456f91

JoshRosen changed the title ~~[SPARK-30414][SQL][WIP] ParquetRowConverter optimizations: arrays, maps, and constant factors~~ [SPARK-30414][SQL][WIP] ParquetRowConverter optimizations: arrays, maps, plus misc. constant factors Jan 3, 2020

JoshRosen mentioned this pull request Jan 3, 2020

[SPARK-30338][SQL] Avoid unnecessary InternalRow copies in ParquetRowConverter #26993

Closed

JoshRosen added the SQL label Jan 3, 2020

JoshRosen changed the title ~~[SPARK-30414][SQL][WIP] ParquetRowConverter optimizations: arrays, maps, plus misc. constant factors~~ [SPARK-30414][SQL] ParquetRowConverter optimizations: arrays, maps, plus misc. constant factors Jan 6, 2020

HyukjinKwon reviewed Jan 7, 2020

View reviewed changes

HyukjinKwon approved these changes Jan 7, 2020

View reviewed changes

Roll back to using Scala ArrayBuffer, but continue using clear()

6d16f59

viirya reviewed Jan 7, 2020

View reviewed changes

cloud-fan approved these changes Jan 7, 2020

View reviewed changes

HyukjinKwon closed this in 7a1a5db Jan 7, 2020

viirya approved these changes Jan 7, 2020

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-30414][SQL] ParquetRowConverter optimizations: arrays, maps, plus misc. constant factors #27089

[SPARK-30414][SQL] ParquetRowConverter optimizations: arrays, maps, plus misc. constant factors #27089

Uh oh!

JoshRosen commented Jan 3, 2020 •

edited

Loading

Uh oh!

SparkQA commented Jan 3, 2020

Uh oh!

joshrosen-stripe commented Jan 6, 2020

Uh oh!

HyukjinKwon Jan 7, 2020

Uh oh!

JoshRosen Jan 7, 2020

Uh oh!

HyukjinKwon left a comment

Uh oh!

viirya Jan 7, 2020

Uh oh!

joshrosen-stripe Jan 7, 2020

Uh oh!

SparkQA commented Jan 7, 2020

Uh oh!

HyukjinKwon commented Jan 7, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

[SPARK-30414][SQL] ParquetRowConverter optimizations: arrays, maps, plus misc. constant factors #27089

[SPARK-30414][SQL] ParquetRowConverter optimizations: arrays, maps, plus misc. constant factors #27089

Uh oh!

Conversation

JoshRosen commented Jan 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Jan 3, 2020

Uh oh!

joshrosen-stripe commented Jan 6, 2020

Uh oh!

HyukjinKwon Jan 7, 2020

Choose a reason for hiding this comment

Uh oh!

JoshRosen Jan 7, 2020

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

viirya Jan 7, 2020

Choose a reason for hiding this comment

Uh oh!

joshrosen-stripe Jan 7, 2020

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 7, 2020

Uh oh!

HyukjinKwon commented Jan 7, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

JoshRosen commented Jan 3, 2020 •

edited

Loading