-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-31439][SQL] Fix perf regression of fromJavaDate #28205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@cloud-fan FYI |
|
Test build #121204 has finished for PR 28205 at commit
|
|
Test build #121219 has finished for PR 28205 at commit
|
|
@cloud-fan @HyukjinKwon Please, review the PR. |
|
can we fix the conflicts? |
| localDateToDays(localDate) | ||
| val millisUtc = date.getTime | ||
| val millisLocal = millisUtc + TimeZone.getDefault.getOffset(millisUtc) | ||
| val julianDays = Math.toIntExact(Math.floorDiv(millisLocal, MILLIS_PER_DAY)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan Spark 2.4 uses floating point ops, see https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L137
|
Test build #121261 has finished for PR 28205 at commit
|
|
jenkins, retest this, please |
|
+1. Looks good to me too from a cursory look. |
|
Test build #121266 has finished for PR 28205 at commit
|
|
thanks, merging to master/3.0! |
### What changes were proposed in this pull request? In the PR, I propose to re-use optimized implementation of days rebase function `rebaseJulianToGregorianDays()` introduced by the PR #28067 in conversion of `java.sql.Date` values to Catalyst's `DATE` values. The function `fromJavaDate` in `DateTimeUtils` was re-written by taking the implementation from Spark 2.4, and by rebasing the final results via `rebaseJulianToGregorianDays()`. Also I updated `DateTimeBenchmark`, and added a benchmark for conversion from `java.sql.Date`. ### Why are the changes needed? The PR fixes the regression of parallelizing a collection of `java.sql.Date` values, and improves performance of converting external values to Catalyst's `DATE` values: - x4 on the master branch - 30% against Spark 2.4.6-SNAPSHOT Spark 2.4.6-SNAPSHOT: ``` To/from java.sql.Timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ From java.sql.Date 614 655 43 8.1 122.8 1.0X ``` Before the changes: ``` To/from java.sql.Timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ From java.sql.Date 1154 1206 46 4.3 230.9 1.0X ``` After: ``` To/from java.sql.Timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ From java.sql.Date 427 434 7 11.7 85.3 1.0X ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? - By existing tests suites, in particular, `DateTimeUtilsSuite`, `RebaseDateTimeSuite`, `DateFunctionsSuite`, `DateExpressionsSuite`. - Re-run `DateTimeBenchmark` in the environment: | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge | | AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) | | Java | OpenJDK 64-Bit Server VM 1.8.0_242 and OpenJDK 64-Bit Server VM 11.0.6+10 | Closes #28205 from MaxGekk/optimize-fromJavaDate. Authored-by: Max Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 2c5d489) Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
In the PR, I propose to re-use optimized implementation of days rebase function
rebaseJulianToGregorianDays()introduced by the PR #28067 in conversion ofjava.sql.Datevalues to Catalyst'sDATEvalues. The functionfromJavaDateinDateTimeUtilswas re-written by taking the implementation from Spark 2.4, and by rebasing the final results viarebaseJulianToGregorianDays().Also I updated
DateTimeBenchmark, and added a benchmark for conversion fromjava.sql.Date.Why are the changes needed?
The PR fixes the regression of parallelizing a collection of
java.sql.Datevalues, and improves performance of converting external values to Catalyst'sDATEvalues:Spark 2.4.6-SNAPSHOT:
Before the changes:
After:
Does this PR introduce any user-facing change?
No
How was this patch tested?
DateTimeUtilsSuite,RebaseDateTimeSuite,DateFunctionsSuite,DateExpressionsSuite.DateTimeBenchmarkin the environment: