[SPARK-31606][SQL] Reduce the perf regression of vectorized parquet reader caused by datetime rebase #28406

cloud-fan · 2020-04-29T14:48:11Z

What changes were proposed in this pull request?

Push the rebase logic to the lower level of the parquet vectorized reader, to make the final code more vectorization-friendly.

Why are the changes needed?

Parquet vectorized reader is carefully implemented, to make it more likely to be vectorized by the JVM. However, the newly added datetime rebase degrade the performance a lot, as it breaks vectorization, even if the datetime values don't need to rebase (this is very likely as dates before 1582 is rare).

Does this PR introduce any user-facing change?

no

How was this patch tested?

Run part of the DateTimeRebaseBenchmark locally. The results:
before this patch

[info] Load dates from parquet:                  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1582, vec on, rebase off                     2677           2838         142         37.4          26.8       1.0X
[info] after 1582, vec on, rebase on                      3828           4331         805         26.1          38.3       0.7X
[info] before 1582, vec on, rebase off                    2903           2926          34         34.4          29.0       0.9X
[info] before 1582, vec on, rebase on                     4163           4197          38         24.0          41.6       0.6X

[info] Load timestamps from parquet:             Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1900, vec on, rebase off                     3537           3627         104         28.3          35.4       1.0X
[info] after 1900, vec on, rebase on                      6891           7010         105         14.5          68.9       0.5X
[info] before 1900, vec on, rebase off                    3692           3770          72         27.1          36.9       1.0X
[info] before 1900, vec on, rebase on                     7588           7610          30         13.2          75.9       0.5X

After this patch

[info] Load dates from parquet:                  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1582, vec on, rebase off                     2758           2944         197         36.3          27.6       1.0X
[info] after 1582, vec on, rebase on                      2908           2966          51         34.4          29.1       0.9X
[info] before 1582, vec on, rebase off                    2840           2878          37         35.2          28.4       1.0X
[info] before 1582, vec on, rebase on                     3407           3433          24         29.4          34.1       0.8X

[info] Load timestamps from parquet:             Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1900, vec on, rebase off                     3861           4003         139         25.9          38.6       1.0X
[info] after 1900, vec on, rebase on                      4194           4283          77         23.8          41.9       0.9X
[info] before 1900, vec on, rebase off                    3849           3937          79         26.0          38.5       1.0X
[info] before 1900, vec on, rebase on                     7512           7546          55         13.3          75.1       0.5X

Date type is 30% faster if the values don't need to rebase, 20% faster if need to rebase.
Timestamp type is 60% faster if the values don't need to rebase, no difference if need to rebase.

cloud-fan · 2020-04-29T14:48:44Z

cc @MaxGekk @HyukjinKwon @kiszk @rednaxelafx

kiszk · 2020-04-29T16:06:57Z

...ain/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedPlainValuesReader.java

+    ByteBuffer buffer = getBuffer(requiredBytes);
+    boolean rebase = false;
+    for (int i = 0; i < total; i += 1) {
+      rebase = buffer.getLong(buffer.position() + i * 8) < RebaseDateTime.lastSwitchJulianTs();


While I have not understood this logic yet, this code sees the result only at the last iteration.
May it be rebase |= ... or something?

good catch! I rerun the benchmark and no big difference. Because the benchmark data are either all values need to rebase, or none.

kiszk · 2020-04-29T16:11:43Z

...ain/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedPlainValuesReader.java

+    ByteBuffer buffer = getBuffer(requiredBytes);
+    boolean rebase = false;
+    for (int i = 0; i < total; i += 1) {
+      rebase = buffer.getInt(buffer.position() + i * 4) < RebaseDateTime.lastSwitchJulianDay();


same as at line 136.

…etime rebase

MaxGekk · 2020-04-29T17:06:58Z

@cloud-fan Please, review and merge cloud-fan#15

SparkQA · 2020-04-29T20:16:05Z

Test build #122068 has finished for PR 28406 at commit 056e7f0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-04-29T22:05:45Z

Test build #122076 has finished for PR 28406 at commit 4d15a49.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

* Re-gen on JDK 11 * Re-gen on JDK 8 * Re-gen on JDK 8 * Re-gen on JDK 11

SparkQA · 2020-04-30T18:29:14Z

Test build #122137 has finished for PR 28406 at commit 5a3009b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2020-04-30T18:48:23Z

...ain/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedPlainValuesReader.java

+    int requiredBytes = total * 8;
+    ByteBuffer buffer = getBuffer(requiredBytes);
+    boolean rebase = false;
+    for (int i = 0; i < total; i += 1) {


The byte code of the loop is:

LINENUMBER 136 L6 ILOAD 6 ALOAD 5 ALOAD 5 INVOKEVIRTUAL java/nio/ByteBuffer.position ()I ILOAD 7 BIPUSH 8 IMUL IADD INVOKEVIRTUAL java/nio/ByteBuffer.getLong (I)J INVOKESTATIC org/apache/spark/sql/catalyst/util/RebaseDateTime.lastSwitchJulianTs ()J LCMP IFGE L7 ICONST_1 GOTO L8

We could avoid mul like

int pos = buffer.position(); int endPos = pos + total * 8; long threshold = RebaseDateTime.lastSwitchJulianTs(); while (pos < endPos) { rebase |= buffer.getLong(pos) < threshold; pos += 8; }

Would it be faster?

I tried it and the perf has no difference.

MaxGekk · 2020-04-30T19:01:03Z

.../main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedRleValuesReader.java

+            c.putNulls(rowId, n);
+          }
+          break;
+        case PACKED:


Is it impossible to optimize the case too?

I didn't optimize this case because the no-rebase code path looks not very fast. It has a if-else in the loop.

The general idea is to add an extra loop to check if we need to rebase or not, and it's only worthwhile if the no-rebase code path is much faster than the rebase code path.

HyukjinKwon · 2020-05-04T06:28:45Z

Merged to master and branch-3.0.

…eader caused by datetime rebase ### What changes were proposed in this pull request? Push the rebase logic to the lower level of the parquet vectorized reader, to make the final code more vectorization-friendly. ### Why are the changes needed? Parquet vectorized reader is carefully implemented, to make it more likely to be vectorized by the JVM. However, the newly added datetime rebase degrade the performance a lot, as it breaks vectorization, even if the datetime values don't need to rebase (this is very likely as dates before 1582 is rare). ### Does this PR introduce any user-facing change? no ### How was this patch tested? Run part of the `DateTimeRebaseBenchmark` locally. The results: before this patch ``` [info] Load dates from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] after 1582, vec on, rebase off 2677 2838 142 37.4 26.8 1.0X [info] after 1582, vec on, rebase on 3828 4331 805 26.1 38.3 0.7X [info] before 1582, vec on, rebase off 2903 2926 34 34.4 29.0 0.9X [info] before 1582, vec on, rebase on 4163 4197 38 24.0 41.6 0.6X [info] Load timestamps from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] after 1900, vec on, rebase off 3537 3627 104 28.3 35.4 1.0X [info] after 1900, vec on, rebase on 6891 7010 105 14.5 68.9 0.5X [info] before 1900, vec on, rebase off 3692 3770 72 27.1 36.9 1.0X [info] before 1900, vec on, rebase on 7588 7610 30 13.2 75.9 0.5X ``` After this patch ``` [info] Load dates from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] after 1582, vec on, rebase off 2758 2944 197 36.3 27.6 1.0X [info] after 1582, vec on, rebase on 2908 2966 51 34.4 29.1 0.9X [info] before 1582, vec on, rebase off 2840 2878 37 35.2 28.4 1.0X [info] before 1582, vec on, rebase on 3407 3433 24 29.4 34.1 0.8X [info] Load timestamps from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] after 1900, vec on, rebase off 3861 4003 139 25.9 38.6 1.0X [info] after 1900, vec on, rebase on 4194 4283 77 23.8 41.9 0.9X [info] before 1900, vec on, rebase off 3849 3937 79 26.0 38.5 1.0X [info] before 1900, vec on, rebase on 7512 7546 55 13.3 75.1 0.5X ``` Date type is 30% faster if the values don't need to rebase, 20% faster if need to rebase. Timestamp type is 60% faster if the values don't need to rebase, no difference if need to rebase. Closes #28406 from cloud-fan/perf. Lead-authored-by: Wenchen Fan <[email protected]> Co-authored-by: Maxim Gekk <[email protected]> Signed-off-by: HyukjinKwon <[email protected]> (cherry picked from commit f72220b) Signed-off-by: HyukjinKwon <[email protected]>

…eader caused by datetime rebase ### What changes were proposed in this pull request? Push the rebase logic to the lower level of the parquet vectorized reader, to make the final code more vectorization-friendly. ### Why are the changes needed? Parquet vectorized reader is carefully implemented, to make it more likely to be vectorized by the JVM. However, the newly added datetime rebase degrade the performance a lot, as it breaks vectorization, even if the datetime values don't need to rebase (this is very likely as dates before 1582 is rare). ### Does this PR introduce any user-facing change? no ### How was this patch tested? Run part of the `DateTimeRebaseBenchmark` locally. The results: before this patch ``` [info] Load dates from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] after 1582, vec on, rebase off 2677 2838 142 37.4 26.8 1.0X [info] after 1582, vec on, rebase on 3828 4331 805 26.1 38.3 0.7X [info] before 1582, vec on, rebase off 2903 2926 34 34.4 29.0 0.9X [info] before 1582, vec on, rebase on 4163 4197 38 24.0 41.6 0.6X [info] Load timestamps from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] after 1900, vec on, rebase off 3537 3627 104 28.3 35.4 1.0X [info] after 1900, vec on, rebase on 6891 7010 105 14.5 68.9 0.5X [info] before 1900, vec on, rebase off 3692 3770 72 27.1 36.9 1.0X [info] before 1900, vec on, rebase on 7588 7610 30 13.2 75.9 0.5X ``` After this patch ``` [info] Load dates from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] after 1582, vec on, rebase off 2758 2944 197 36.3 27.6 1.0X [info] after 1582, vec on, rebase on 2908 2966 51 34.4 29.1 0.9X [info] before 1582, vec on, rebase off 2840 2878 37 35.2 28.4 1.0X [info] before 1582, vec on, rebase on 3407 3433 24 29.4 34.1 0.8X [info] Load timestamps from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] after 1900, vec on, rebase off 3861 4003 139 25.9 38.6 1.0X [info] after 1900, vec on, rebase on 4194 4283 77 23.8 41.9 0.9X [info] before 1900, vec on, rebase off 3849 3937 79 26.0 38.5 1.0X [info] before 1900, vec on, rebase on 7512 7546 55 13.3 75.1 0.5X ``` Date type is 30% faster if the values don't need to rebase, 20% faster if need to rebase. Timestamp type is 60% faster if the values don't need to rebase, no difference if need to rebase. Closes apache#28406 from cloud-fan/perf. Lead-authored-by: Wenchen Fan <[email protected]> Co-authored-by: Maxim Gekk <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

probot-autolabeler bot added the SQL label Apr 29, 2020

kiszk reviewed Apr 29, 2020

View reviewed changes

reduce the perf regression of vectorized parquet reader caused by dat…

4d15a49

…etime rebase

cloud-fan force-pushed the perf branch from 056e7f0 to 4d15a49 Compare April 29, 2020 16:37

kiszk approved these changes Apr 30, 2020

View reviewed changes

Results of DateTimeRebaseBenchmark on JDK 8 and 11 (#15)

5a3009b

* Re-gen on JDK 11 * Re-gen on JDK 8 * Re-gen on JDK 8 * Re-gen on JDK 11

MaxGekk reviewed Apr 30, 2020

View reviewed changes

MaxGekk approved these changes Apr 30, 2020

View reviewed changes

HyukjinKwon approved these changes May 4, 2020

View reviewed changes

HyukjinKwon closed this in f72220b May 4, 2020

[SPARK-31606][SQL] Reduce the perf regression of vectorized parquet reader caused by datetime rebase #28406

[SPARK-31606][SQL] Reduce the perf regression of vectorized parquet reader caused by datetime rebase #28406

Uh oh!

Conversation

cloud-fan commented Apr 29, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

cloud-fan commented Apr 29, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaxGekk commented Apr 29, 2020

Uh oh!

SparkQA commented Apr 29, 2020

Uh oh!

SparkQA commented Apr 29, 2020

Uh oh!

SparkQA commented Apr 30, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented May 4, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants