Skip to content

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Apr 13, 2020

What changes were proposed in this pull request?

In the PR, I propose to re-use optimized implementation of days rebase function rebaseJulianToGregorianDays() introduced by the PR #28067 in conversion of java.sql.Date values to Catalyst's DATE values. The function fromJavaDate in DateTimeUtils was re-written by taking the implementation from Spark 2.4, and by rebasing the final results via rebaseJulianToGregorianDays().

Also I updated DateTimeBenchmark, and added a benchmark for conversion from java.sql.Date.

Why are the changes needed?

The PR fixes the regression of parallelizing a collection of java.sql.Date values, and improves performance of converting external values to Catalyst's DATE values:

  • x4 on the master branch
  • 30% against Spark 2.4.6-SNAPSHOT

Spark 2.4.6-SNAPSHOT:

To/from java.sql.Timestamp:               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
From java.sql.Date                                  614            655          43          8.1         122.8       1.0X

Before the changes:

To/from java.sql.Timestamp:               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
From java.sql.Date                                 1154           1206          46          4.3         230.9       1.0X

After:

To/from java.sql.Timestamp:               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
From java.sql.Date                                  427            434           7         11.7          85.3       1.0X

Does this PR introduce any user-facing change?

No

How was this patch tested?

  • By existing tests suites, in particular, DateTimeUtilsSuite, RebaseDateTimeSuite, DateFunctionsSuite, DateExpressionsSuite.
  • Re-run DateTimeBenchmark in the environment:
Item Description
Region us-west-2 (Oregon)
Instance r3.xlarge
AMI ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5)
Java OpenJDK 64-Bit Server VM 1.8.0_242 and OpenJDK 64-Bit Server VM 11.0.6+10

@MaxGekk
Copy link
Member Author

MaxGekk commented Apr 13, 2020

@cloud-fan FYI

@SparkQA
Copy link

SparkQA commented Apr 13, 2020

Test build #121204 has finished for PR 28205 at commit b174c1a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk MaxGekk changed the title [WIP][SQL] Fix perf regression of fromJavaDate [SPARK-31439][SQL] Fix perf regression of fromJavaDate Apr 13, 2020
@SparkQA
Copy link

SparkQA commented Apr 13, 2020

Test build #121219 has finished for PR 28205 at commit bc8110b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Apr 14, 2020

@cloud-fan @HyukjinKwon Please, review the PR.

@cloud-fan
Copy link
Contributor

can we fix the conflicts?

localDateToDays(localDate)
val millisUtc = date.getTime
val millisLocal = millisUtc + TimeZone.getDefault.getOffset(millisUtc)
val julianDays = Math.toIntExact(Math.floorDiv(millisLocal, MILLIS_PER_DAY))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SparkQA
Copy link

SparkQA commented Apr 14, 2020

Test build #121261 has finished for PR 28205 at commit ee1d788.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Apr 14, 2020

jenkins, retest this, please

@HyukjinKwon
Copy link
Member

+1. Looks good to me too from a cursory look.

@SparkQA
Copy link

SparkQA commented Apr 14, 2020

Test build #121266 has finished for PR 28205 at commit ee1d788.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master/3.0!

@cloud-fan cloud-fan closed this in 2c5d489 Apr 14, 2020
cloud-fan pushed a commit that referenced this pull request Apr 14, 2020
### What changes were proposed in this pull request?
In the PR, I propose to re-use optimized implementation of days rebase function `rebaseJulianToGregorianDays()` introduced by the PR #28067 in conversion of `java.sql.Date` values to Catalyst's `DATE` values. The function `fromJavaDate` in `DateTimeUtils` was re-written by taking the implementation from Spark 2.4, and by rebasing the final results via `rebaseJulianToGregorianDays()`.

Also I updated `DateTimeBenchmark`, and added a benchmark for conversion from `java.sql.Date`.

### Why are the changes needed?
The PR fixes the regression of parallelizing a collection of `java.sql.Date` values, and improves performance of converting external values to Catalyst's `DATE` values:
- x4 on the master branch
- 30% against Spark 2.4.6-SNAPSHOT

Spark 2.4.6-SNAPSHOT:
```
To/from java.sql.Timestamp:               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
From java.sql.Date                                  614            655          43          8.1         122.8       1.0X
```

Before the changes:
```
To/from java.sql.Timestamp:               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
From java.sql.Date                                 1154           1206          46          4.3         230.9       1.0X
```

After:
```
To/from java.sql.Timestamp:               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
From java.sql.Date                                  427            434           7         11.7          85.3       1.0X
```

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
- By existing tests suites, in particular, `DateTimeUtilsSuite`, `RebaseDateTimeSuite`, `DateFunctionsSuite`, `DateExpressionsSuite`.
- Re-run `DateTimeBenchmark` in the environment:

| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge |
| AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) |
| Java | OpenJDK 64-Bit Server VM 1.8.0_242 and OpenJDK 64-Bit Server VM 11.0.6+10 |

Closes #28205 from MaxGekk/optimize-fromJavaDate.

Authored-by: Max Gekk <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 2c5d489)
Signed-off-by: Wenchen Fan <[email protected]>
@MaxGekk MaxGekk deleted the optimize-fromJavaDate branch June 5, 2020 19:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants