-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-31443][SQL] Fix perf regression of toJavaDate #28212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| Collect longs 1336 2676 1201 3.7 267.2 0.3X | ||
| Collect timestamps 2025 2091 65 2.5 405.0 0.2X | ||
| From java.sql.Date 935 947 10 5.3 187.1 1.0X | ||
| Collect dates 2427 3239 1338 2.1 485.3 0.4X |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(485.3 - 187.1) = 298.2 ns/row after the changes
(461.1 - 111.8) = 349.3 ns/row on Spark 2.4.6-SNAPSHOT
/cc @cloud-fan @HyukjinKwon
| val timeZoneOffset = TimeZone.getDefault match { | ||
| case zoneInfo: ZoneInfo => zoneInfo.getOffsetsByWall(localMillis, null) | ||
| case timeZone: TimeZone => timeZone.getOffset(localMillis - timeZone.getRawOffset) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we apply this change for spark 2.4 as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we can do that in 2.4, sure but see #28216 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to https://spark.apache.org/versioning-policy.html,, a maintenance release would not include performance improvement patch.
For 3.0, it looks good due to the small size of this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's best to follow the guide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was asking for the timezone offset change, which seems wrong in 2.4: https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L1083-L1118
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't tell that the 2 lines here behave the same as that long method in 2.4. If they are the same and it's just an improvement, then we shouldn't backport. If 2.4 is wrong and then we should fix it as it's a correctness issue.
|
Test build #121268 has finished for PR 28212 at commit
|
|
Test build #121277 has finished for PR 28212 at commit
|
|
Test build #121286 has finished for PR 28212 at commit
|
|
The PR itself looks good but we should also investigate if 2.4 has a correctness issue. Thanks, merging to master/3.0! |
### What changes were proposed in this pull request? Optimise the `toJavaDate()` method of `DateTimeUtils` by: 1. Re-using `rebaseGregorianToJulianDays` optimised by #28067 2. Creating `java.sql.Date` instances from milliseconds in UTC since the epoch instead of date-time fields. This allows to avoid "normalization" inside of `java.sql.Date`. Also new benchmark for collecting dates is added to `DateTimeBenchmark`. ### Why are the changes needed? The changes fix the performance regression of collecting `DATE` values comparing to Spark 2.4 (see `DateTimeBenchmark` in MaxGekk#27): Spark 2.4.6-SNAPSHOT: ``` To/from Java's date-time: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ From java.sql.Date 559 603 38 8.9 111.8 1.0X Collect dates 2306 3221 1558 2.2 461.1 0.2X ``` Before the changes: ``` To/from Java's date-time: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ From java.sql.Date 1052 1130 73 4.8 210.3 1.0X Collect dates 3251 4943 1624 1.5 650.2 0.3X ``` After: ``` To/from Java's date-time: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ From java.sql.Date 416 419 3 12.0 83.2 1.0X Collect dates 1928 2759 1180 2.6 385.6 0.2X ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? - By existing tests suites, in particular, `DateTimeUtilsSuite`, `RebaseDateTimeSuite`, `DateFunctionsSuite`, `DateExpressionsSuite`. - Re-run `DateTimeBenchmark` in the environment: | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge | | AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) | | Java | OpenJDK 64-Bit Server VM 1.8.0_242 and OpenJDK 64-Bit Server VM 11.0.6+10 | Closes #28212 from MaxGekk/optimize-toJavaDate. Authored-by: Max Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 744c248) Signed-off-by: Wenchen Fan <[email protected]>
Here is the JIRA ticket for investigation: SPARK-31449 |
What changes were proposed in this pull request?
Optimise the
toJavaDate()method ofDateTimeUtilsby:rebaseGregorianToJulianDaysoptimised by [SPARK-31297][SQL] Speed up dates rebasing #28067java.sql.Dateinstances from milliseconds in UTC since the epoch instead of date-time fields. This allows to avoid "normalization" inside ofjava.sql.Date.Also new benchmark for collecting dates is added to
DateTimeBenchmark.Why are the changes needed?
The changes fix the performance regression of collecting
DATEvalues comparing to Spark 2.4 (seeDateTimeBenchmarkin MaxGekk#27):Spark 2.4.6-SNAPSHOT:
Before the changes:
After:
Does this PR introduce any user-facing change?
No
How was this patch tested?
DateTimeUtilsSuite,RebaseDateTimeSuite,DateFunctionsSuite,DateExpressionsSuite.DateTimeBenchmarkin the environment: