[SPARK-31296][SQL][TESTS] Benchmark date-time rebasing in Parquet datasource #28057

MaxGekk · 2020-03-27T21:00:28Z

What changes were proposed in this pull request?

In the PR, I propose to add new benchmark DateTimeRebaseBenchmark which should measure the performance of rebasing of dates/timestamps from/to to the hybrid calendar (Julian+Gregorian) to/from Proleptic Gregorian calendar:

In write, it saves separately dates and timestamps before and after 1582 year w/ and w/o rebasing.
In read, it loads previously saved parquet files by vectorized reader and by regular reader.

Here is the summary of benchmarking:

Saving timestamps is ~6 times slower
Loading timestamps w/ vectorized off is ~4 times slower
Loading timestamps w/ vectorized on is ~10 times slower

Why are the changes needed?

To know the impact of date-time rebasing introduced by #27915, #27953, #27807.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Run the DateTimeRebaseBenchmark benchmark using Amazon EC2:

Item	Description
Region	us-west-2 (Oregon)
Instance	r3.xlarge
AMI	ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5)
Java	OpenJDK8/11

MaxGekk · 2020-03-27T21:02:03Z

@cloud-fan @HyukjinKwon @dongjoon-hyun Here are intermediate results of benchmarking of timestamps rebasing in parquet.

dongjoon-hyun · 2020-03-27T21:43:52Z

Thank you for the benchmark. Ya. It's an expected drawback.

SparkQA · 2020-03-28T01:09:22Z

Test build #120515 has finished for PR 28057 at commit e217139.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2020-03-28T06:00:09Z

It's an expected drawback.

Parquet and Avro perform rebasing only if a SQL config enabled (and the config is off by default). ORC does rebasing always. I would expect some slowdown in ORC too.

SparkQA · 2020-03-28T22:05:11Z

Test build #120535 has finished for PR 28057 at commit 8703579.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-03-29T01:11:40Z

Test build #120538 has finished for PR 28057 at commit 912dee4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-03-29T06:09:56Z

Yes, thank you so much for the benchamrks.

SparkQA · 2020-03-29T12:25:59Z

Test build #120552 has finished for PR 28057 at commit c89f2c9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-03-29T18:47:58Z

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DateTimeRebaseBenchmark.scala

+
+  override def runBenchmarkSuite(mainArgs: Array[String]): Unit = {
+    withTempPath { path =>
+      runBenchmark("Parquet read/write") {


Could you use more specific benchmark title because this is used in the generate files?

Isn't the name scoped by concrete benchmark?

I will address the comments together with other comments because launching EC2 instance and re-running the benchmark twice for jdk 8 & 11 is time consuming process.

+1 to make the title mention second rebase.

I am going to replace it by "Rebasing dates/timestamps in Parquet datasource"

cloud-fan · 2020-03-30T03:35:44Z

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DateTimeRebaseBenchmark.scala

+        Seq("date", "timestamp").foreach { dateTime =>
+          val benchmark = new Benchmark(s"Save ${dateTime}s to parquet", rowsNum, output = output)
+          benchmark.addCase("after 1582, noop", 1) { _ =>
+            genDF(rowsNum, dateTime, after1582 = true).noop()


do you include the dataframe generation in the benchmark number? I think it should be excluded.

We have already discussed this in PRs for another benchmarks. The overhead of preparing input dataframe is assumed to be subtracted from other numbers.

For example:

after 1582, noop 9272 9272 0 10.8 92.7 1.0X

after 1582, rebase off 21841 21841 0 4.6 218.4 0.4X

The noop benchmark shows non-avoidable overhead. If we subtract it, we get 21841 - 9272 = 12569. So, overhead of preparing input data is roughly 45%. I do believe this is important info, and we should keep in the benchmark results.

SparkQA · 2020-03-30T07:05:01Z

Test build #120577 has finished for PR 28057 at commit e0aedf5.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-03-30T07:57:27Z

retest this please

cloud-fan · 2020-03-30T08:46:59Z

it's just benchmark so no need to wait for jenkins.

Thanks, merging to master/3.0!

…asource ### What changes were proposed in this pull request? In the PR, I propose to add new benchmark `DateTimeRebaseBenchmark` which should measure the performance of rebasing of dates/timestamps from/to to the hybrid calendar (Julian+Gregorian) to/from Proleptic Gregorian calendar: 1. In write, it saves separately dates and timestamps before and after 1582 year w/ and w/o rebasing. 2. In read, it loads previously saved parquet files by vectorized reader and by regular reader. Here is the summary of benchmarking: - Saving timestamps is **~6 times slower** - Loading timestamps w/ vectorized **off** is **~4 times slower** - Loading timestamps w/ vectorized **on** is **~10 times slower** ### Why are the changes needed? To know the impact of date-time rebasing introduced by #27915, #27953, #27807. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Run the `DateTimeRebaseBenchmark` benchmark using Amazon EC2: | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge | | AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) | | Java | OpenJDK8/11 | Closes #28057 from MaxGekk/rebase-bechmark. Lead-authored-by: Maxim Gekk <[email protected]> Co-authored-by: Max Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit a1dbcd1) Signed-off-by: Wenchen Fan <[email protected]>

SparkQA · 2020-03-30T12:35:51Z

Test build #120580 has finished for PR 28057 at commit e0aedf5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…asource ### What changes were proposed in this pull request? In the PR, I propose to add new benchmark `DateTimeRebaseBenchmark` which should measure the performance of rebasing of dates/timestamps from/to to the hybrid calendar (Julian+Gregorian) to/from Proleptic Gregorian calendar: 1. In write, it saves separately dates and timestamps before and after 1582 year w/ and w/o rebasing. 2. In read, it loads previously saved parquet files by vectorized reader and by regular reader. Here is the summary of benchmarking: - Saving timestamps is **~6 times slower** - Loading timestamps w/ vectorized **off** is **~4 times slower** - Loading timestamps w/ vectorized **on** is **~10 times slower** ### Why are the changes needed? To know the impact of date-time rebasing introduced by apache#27915, apache#27953, apache#27807. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Run the `DateTimeRebaseBenchmark` benchmark using Amazon EC2: | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge | | AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) | | Java | OpenJDK8/11 | Closes apache#28057 from MaxGekk/rebase-bechmark. Lead-authored-by: Maxim Gekk <[email protected]> Co-authored-by: Max Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

Add benchmarks for rebasing date-time in parquet

e217139

Minor change: after -> before

24476a3

MaxGekk added 9 commits March 28, 2020 19:26

Benchmarks for dates

f3c30b7

Remove LocalDate

0734e46

Re-gen results of DateTimeRebaseBenchmark on JDK 8

417001b

Re-gen results of DateTimeRebaseBenchmark on JDK 11

50247e5

Merge branch 'rebase-bechmark-run' into rebase-bechmark

8703579

Minor change of load bench name

27a75a6

Fix date generation

69bb4ed

Re-gen results of DateTimeRebaseBenchmark on JDK 11

10382e2

Re-gen results of DateTimeRebaseBenchmark on JDK 8

912dee4

dongjoon-hyun changed the title ~~[WIP][SQL] Benchmark rebasing of dates/timestamps~~ [WIP][SPARK-31294][SQL] Benchmark rebasing of dates/timestamps Mar 29, 2020

Regen benchmark results on Linux JDK 8

c2cc385

MaxGekk changed the title ~~[WIP][SPARK-31294][SQL] Benchmark rebasing of dates/timestamps~~ [WIP][SPARK-31296][SQL] Benchmark rebasing of dates/timestamps Mar 29, 2020

MaxGekk changed the title ~~[WIP][SPARK-31296][SQL] Benchmark rebasing of dates/timestamps~~ [WIP][SPARK-31296][SQL] Benchmark date-time rebasing to/from Julian calendar Mar 29, 2020

Regen benchmark results on Linux JDK 11

c89f2c9

MaxGekk changed the title ~~[WIP][SPARK-31296][SQL] Benchmark date-time rebasing to/from Julian calendar~~ [SPARK-31296][SQL] Benchmark date-time rebasing in Parquet datasource Mar 29, 2020

dongjoon-hyun changed the title ~~[SPARK-31296][SQL] Benchmark date-time rebasing in Parquet datasource~~ [SPARK-31296][SQL][TESTS] Benchmark date-time rebasing in Parquet datasource Mar 29, 2020

dongjoon-hyun added SQL TESTS labels Mar 29, 2020

dongjoon-hyun reviewed Mar 29, 2020

View reviewed changes

cloud-fan reviewed Mar 30, 2020

View reviewed changes

Rename benchmark

e0aedf5

MaxGekk mentioned this pull request Mar 30, 2020

[SPARK-31297][SQL] Speed up dates rebasing #28067

Closed

cloud-fan approved these changes Mar 30, 2020

View reviewed changes

cloud-fan closed this in a1dbcd1 Mar 30, 2020

MaxGekk deleted the rebase-bechmark branch June 5, 2020 19:46

[SPARK-31296][SQL][TESTS] Benchmark date-time rebasing in Parquet datasource #28057

[SPARK-31296][SQL][TESTS] Benchmark date-time rebasing in Parquet datasource #28057

Uh oh!

Conversation

MaxGekk commented Mar 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

MaxGekk commented Mar 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Mar 27, 2020

Uh oh!

SparkQA commented Mar 28, 2020

Uh oh!

MaxGekk commented Mar 28, 2020

Uh oh!

SparkQA commented Mar 28, 2020

Uh oh!

SparkQA commented Mar 29, 2020

Uh oh!

HyukjinKwon commented Mar 29, 2020

Uh oh!

SparkQA commented Mar 29, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 30, 2020

Uh oh!

HyukjinKwon commented Mar 30, 2020

Uh oh!

cloud-fan commented Mar 30, 2020

Uh oh!

SparkQA commented Mar 30, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

MaxGekk commented Mar 27, 2020 •

edited

Loading

MaxGekk commented Mar 27, 2020 •

edited

Loading