[SPARK-31443][SQL] Fix perf regression of toJavaDate #28212

MaxGekk · 2020-04-14T09:59:04Z

What changes were proposed in this pull request?

Optimise the toJavaDate() method of DateTimeUtils by:

Re-using rebaseGregorianToJulianDays optimised by [SPARK-31297][SQL] Speed up dates rebasing #28067
Creating java.sql.Date instances from milliseconds in UTC since the epoch instead of date-time fields. This allows to avoid "normalization" inside of java.sql.Date.

Also new benchmark for collecting dates is added to DateTimeBenchmark.

Why are the changes needed?

The changes fix the performance regression of collecting DATE values comparing to Spark 2.4 (see DateTimeBenchmark in MaxGekk#27):

Spark 2.4.6-SNAPSHOT:

To/from Java's date-time:                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
From java.sql.Date                                  559            603          38          8.9         111.8       1.0X
Collect dates                                      2306           3221        1558          2.2         461.1       0.2X

Before the changes:

To/from Java's date-time:                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
From java.sql.Date                                 1052           1130          73          4.8         210.3       1.0X
Collect dates                                      3251           4943        1624          1.5         650.2       0.3X

After:

To/from Java's date-time:                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
From java.sql.Date                                  416            419           3         12.0          83.2       1.0X
Collect dates                                      1928           2759        1180          2.6         385.6       0.2X

Does this PR introduce any user-facing change?

No

How was this patch tested?

By existing tests suites, in particular, DateTimeUtilsSuite, RebaseDateTimeSuite, DateFunctionsSuite, DateExpressionsSuite.
Re-run DateTimeBenchmark in the environment:

Item	Description
Region	us-west-2 (Oregon)
Instance	r3.xlarge
AMI	ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5)
Java	OpenJDK 64-Bit Server VM 1.8.0_242 and OpenJDK 64-Bit Server VM 11.0.6+10

MaxGekk · 2020-04-14T13:26:19Z

sql/core/benchmarks/DateTimeBenchmark-results.txt

-Collect longs                                      1336           2676        1201          3.7         267.2       0.3X
-Collect timestamps                                 2025           2091          65          2.5         405.0       0.2X
+From java.sql.Date                                  935            947          10          5.3         187.1       1.0X
+Collect dates                                      2427           3239        1338          2.1         485.3       0.4X


(485.3 - 187.1) = 298.2 ns/row after the changes
(461.1 - 111.8) = 349.3 ns/row on Spark 2.4.6-SNAPSHOT
/cc @cloud-fan @HyukjinKwon

MaxGekk · 2020-04-14T13:30:49Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala

+    val timeZoneOffset = TimeZone.getDefault match {
+      case zoneInfo: ZoneInfo => zoneInfo.getOffsetsByWall(localMillis, null)
+      case timeZone: TimeZone => timeZone.getOffset(localMillis - timeZone.getRawOffset)
+    }


The code is adopted from https://github.com/AdoptOpenJDK/openjdk-jdk8u/blob/aa318070b27849f1fe00d14684b2a40f7b29bf79/jdk/src/share/classes/java/util/GregorianCalendar.java#L2795-L2801

Shall we apply this change for spark 2.4 as well?

If we can do that in 2.4, sure but see #28216 (comment)

According to https://spark.apache.org/versioning-policy.html,, a maintenance release would not include performance improvement patch.

For 3.0, it looks good due to the small size of this change.

Yes, it's best to follow the guide.

I was asking for the timezone offset change, which seems wrong in 2.4: https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L1083-L1118

I can't tell that the 2 lines here behave the same as that long method in 2.4. If they are the same and it's just an improvement, then we shouldn't backport. If 2.4 is wrong and then we should fix it as it's a correctness issue.

SparkQA · 2020-04-14T15:43:09Z

Test build #121268 has finished for PR 28212 at commit 92253cf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-04-14T15:45:23Z

Test build #121277 has finished for PR 28212 at commit 1ff47ca.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-04-14T22:07:16Z

Test build #121286 has finished for PR 28212 at commit 614ec2a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-04-15T06:18:50Z

The PR itself looks good but we should also investigate if 2.4 has a correctness issue.

Thanks, merging to master/3.0!

### What changes were proposed in this pull request? Optimise the `toJavaDate()` method of `DateTimeUtils` by: 1. Re-using `rebaseGregorianToJulianDays` optimised by #28067 2. Creating `java.sql.Date` instances from milliseconds in UTC since the epoch instead of date-time fields. This allows to avoid "normalization" inside of `java.sql.Date`. Also new benchmark for collecting dates is added to `DateTimeBenchmark`. ### Why are the changes needed? The changes fix the performance regression of collecting `DATE` values comparing to Spark 2.4 (see `DateTimeBenchmark` in MaxGekk#27): Spark 2.4.6-SNAPSHOT: ``` To/from Java's date-time: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ From java.sql.Date 559 603 38 8.9 111.8 1.0X Collect dates 2306 3221 1558 2.2 461.1 0.2X ``` Before the changes: ``` To/from Java's date-time: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ From java.sql.Date 1052 1130 73 4.8 210.3 1.0X Collect dates 3251 4943 1624 1.5 650.2 0.3X ``` After: ``` To/from Java's date-time: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ From java.sql.Date 416 419 3 12.0 83.2 1.0X Collect dates 1928 2759 1180 2.6 385.6 0.2X ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? - By existing tests suites, in particular, `DateTimeUtilsSuite`, `RebaseDateTimeSuite`, `DateFunctionsSuite`, `DateExpressionsSuite`. - Re-run `DateTimeBenchmark` in the environment: | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge | | AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) | | Java | OpenJDK 64-Bit Server VM 1.8.0_242 and OpenJDK 64-Bit Server VM 11.0.6+10 | Closes #28212 from MaxGekk/optimize-toJavaDate. Authored-by: Max Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 744c248) Signed-off-by: Wenchen Fan <[email protected]>

MaxGekk · 2020-04-15T06:49:19Z

... we should also investigate if 2.4 has a correctness issue

Here is the JIRA ticket for investigation: SPARK-31449

MaxGekk added 2 commits April 14, 2020 12:52

Rewrite toJavaDate

12f88c8

Benchmark collecting dates

92253cf

probot-autolabeler bot added the SQL label Apr 14, 2020

MaxGekk added 2 commits April 14, 2020 11:13

Re-gen DateTimeBenchmark results on JDK 8

6295260

Re-gen DateTimeBenchmark results on JDK 11

1ff47ca

MaxGekk changed the title ~~[WIP][SQL] Fix perf regression of toJavaDate~~ [WIP][SPARK-31443][SQL] Fix perf regression of toJavaDate Apr 14, 2020

MaxGekk commented Apr 14, 2020

View reviewed changes

Merge remote-tracking branch 'origin/master' into optimize-toJavaDate

f5c97d8

MaxGekk added 2 commits April 14, 2020 16:03

Re-gen DateTimeBenchmark results on JDK 8

58156bf

Re-gen DateTimeBenchmark results on JDK 11

614ec2a

MaxGekk changed the title ~~[WIP][SPARK-31443][SQL] Fix perf regression of toJavaDate~~ [SPARK-31443][SQL] Fix perf regression of toJavaDate Apr 14, 2020

cloud-fan closed this in 744c248 Apr 15, 2020

MaxGekk deleted the optimize-toJavaDate branch June 5, 2020 19:47

[SPARK-31443][SQL] Fix perf regression of toJavaDate #28212

[SPARK-31443][SQL] Fix perf regression of toJavaDate #28212

Uh oh!

Conversation

MaxGekk commented Apr 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

MaxGekk Apr 14, 2020

Choose a reason for hiding this comment

Uh oh!

MaxGekk Apr 14, 2020

Choose a reason for hiding this comment

Uh oh!

cloud-fan Apr 14, 2020

Choose a reason for hiding this comment

Uh oh!

MaxGekk Apr 14, 2020

Choose a reason for hiding this comment

Uh oh!

kiszk Apr 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Apr 15, 2020

Choose a reason for hiding this comment

Uh oh!

cloud-fan Apr 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Apr 15, 2020

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 14, 2020

Uh oh!

SparkQA commented Apr 14, 2020

Uh oh!

SparkQA commented Apr 14, 2020

Uh oh!

cloud-fan commented Apr 15, 2020

Uh oh!

MaxGekk commented Apr 15, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

MaxGekk commented Apr 14, 2020 •

edited

Loading

kiszk Apr 14, 2020 •

edited

Loading

cloud-fan Apr 15, 2020 •

edited

Loading