Skip to content

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Apr 14, 2020

What changes were proposed in this pull request?

Optimise the toJavaDate() method of DateTimeUtils by:

  1. Re-using rebaseGregorianToJulianDays optimised by [SPARK-31297][SQL] Speed up dates rebasing #28067
  2. Creating java.sql.Date instances from milliseconds in UTC since the epoch instead of date-time fields. This allows to avoid "normalization" inside of java.sql.Date.

Also new benchmark for collecting dates is added to DateTimeBenchmark.

Why are the changes needed?

The changes fix the performance regression of collecting DATE values comparing to Spark 2.4 (see DateTimeBenchmark in MaxGekk#27):

Spark 2.4.6-SNAPSHOT:

To/from Java's date-time:                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
From java.sql.Date                                  559            603          38          8.9         111.8       1.0X
Collect dates                                      2306           3221        1558          2.2         461.1       0.2X

Before the changes:

To/from Java's date-time:                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
From java.sql.Date                                 1052           1130          73          4.8         210.3       1.0X
Collect dates                                      3251           4943        1624          1.5         650.2       0.3X

After:

To/from Java's date-time:                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
From java.sql.Date                                  416            419           3         12.0          83.2       1.0X
Collect dates                                      1928           2759        1180          2.6         385.6       0.2X

Does this PR introduce any user-facing change?

No

How was this patch tested?

  • By existing tests suites, in particular, DateTimeUtilsSuite, RebaseDateTimeSuite, DateFunctionsSuite, DateExpressionsSuite.
  • Re-run DateTimeBenchmark in the environment:
Item Description
Region us-west-2 (Oregon)
Instance r3.xlarge
AMI ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5)
Java OpenJDK 64-Bit Server VM 1.8.0_242 and OpenJDK 64-Bit Server VM 11.0.6+10

@MaxGekk MaxGekk changed the title [WIP][SQL] Fix perf regression of toJavaDate [WIP][SPARK-31443][SQL] Fix perf regression of toJavaDate Apr 14, 2020
Collect longs 1336 2676 1201 3.7 267.2 0.3X
Collect timestamps 2025 2091 65 2.5 405.0 0.2X
From java.sql.Date 935 947 10 5.3 187.1 1.0X
Collect dates 2427 3239 1338 2.1 485.3 0.4X
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(485.3 - 187.1) = 298.2 ns/row after the changes
(461.1 - 111.8) = 349.3 ns/row on Spark 2.4.6-SNAPSHOT
/cc @cloud-fan @HyukjinKwon

Comment on lines +134 to +137
val timeZoneOffset = TimeZone.getDefault match {
case zoneInfo: ZoneInfo => zoneInfo.getOffsetsByWall(localMillis, null)
case timeZone: TimeZone => timeZone.getOffset(localMillis - timeZone.getRawOffset)
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we apply this change for spark 2.4 as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can do that in 2.4, sure but see #28216 (comment)

Copy link
Member

@kiszk kiszk Apr 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to https://spark.apache.org/versioning-policy.html,, a maintenance release would not include performance improvement patch.

For 3.0, it looks good due to the small size of this change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's best to follow the guide.

Copy link
Contributor

@cloud-fan cloud-fan Apr 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't tell that the 2 lines here behave the same as that long method in 2.4. If they are the same and it's just an improvement, then we shouldn't backport. If 2.4 is wrong and then we should fix it as it's a correctness issue.

@SparkQA
Copy link

SparkQA commented Apr 14, 2020

Test build #121268 has finished for PR 28212 at commit 92253cf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 14, 2020

Test build #121277 has finished for PR 28212 at commit 1ff47ca.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk MaxGekk changed the title [WIP][SPARK-31443][SQL] Fix perf regression of toJavaDate [SPARK-31443][SQL] Fix perf regression of toJavaDate Apr 14, 2020
@SparkQA
Copy link

SparkQA commented Apr 14, 2020

Test build #121286 has finished for PR 28212 at commit 614ec2a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

The PR itself looks good but we should also investigate if 2.4 has a correctness issue.

Thanks, merging to master/3.0!

@cloud-fan cloud-fan closed this in 744c248 Apr 15, 2020
cloud-fan pushed a commit that referenced this pull request Apr 15, 2020
### What changes were proposed in this pull request?
Optimise the `toJavaDate()` method of `DateTimeUtils` by:
1. Re-using `rebaseGregorianToJulianDays` optimised by #28067
2. Creating `java.sql.Date` instances from milliseconds in UTC since the epoch instead of date-time fields. This allows to avoid "normalization" inside of  `java.sql.Date`.

Also new benchmark for collecting dates is added to `DateTimeBenchmark`.

### Why are the changes needed?
The changes fix the performance regression of collecting `DATE` values comparing to Spark 2.4 (see `DateTimeBenchmark` in MaxGekk#27):

Spark 2.4.6-SNAPSHOT:
```
To/from Java's date-time:                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
From java.sql.Date                                  559            603          38          8.9         111.8       1.0X
Collect dates                                      2306           3221        1558          2.2         461.1       0.2X
```
Before the changes:
```
To/from Java's date-time:                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
From java.sql.Date                                 1052           1130          73          4.8         210.3       1.0X
Collect dates                                      3251           4943        1624          1.5         650.2       0.3X
```
After:
```
To/from Java's date-time:                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
From java.sql.Date                                  416            419           3         12.0          83.2       1.0X
Collect dates                                      1928           2759        1180          2.6         385.6       0.2X
```

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
- By existing tests suites, in particular, `DateTimeUtilsSuite`, `RebaseDateTimeSuite`, `DateFunctionsSuite`, `DateExpressionsSuite`.
- Re-run `DateTimeBenchmark` in the environment:

| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge |
| AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) |
| Java | OpenJDK 64-Bit Server VM 1.8.0_242 and OpenJDK 64-Bit Server VM 11.0.6+10 |

Closes #28212 from MaxGekk/optimize-toJavaDate.

Authored-by: Max Gekk <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 744c248)
Signed-off-by: Wenchen Fan <[email protected]>
@MaxGekk
Copy link
Member Author

MaxGekk commented Apr 15, 2020

... we should also investigate if 2.4 has a correctness issue

Here is the JIRA ticket for investigation: SPARK-31449

@MaxGekk MaxGekk deleted the optimize-toJavaDate branch June 5, 2020 19:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants