Skip to content

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Mar 27, 2020

What changes were proposed in this pull request?

In the PR, I propose to add new benchmark DateTimeRebaseBenchmark which should measure the performance of rebasing of dates/timestamps from/to to the hybrid calendar (Julian+Gregorian) to/from Proleptic Gregorian calendar:

  1. In write, it saves separately dates and timestamps before and after 1582 year w/ and w/o rebasing.
  2. In read, it loads previously saved parquet files by vectorized reader and by regular reader.

Here is the summary of benchmarking:

  • Saving timestamps is ~6 times slower
  • Loading timestamps w/ vectorized off is ~4 times slower
  • Loading timestamps w/ vectorized on is ~10 times slower

Why are the changes needed?

To know the impact of date-time rebasing introduced by #27915, #27953, #27807.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Run the DateTimeRebaseBenchmark benchmark using Amazon EC2:

Item Description
Region us-west-2 (Oregon)
Instance r3.xlarge
AMI ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5)
Java OpenJDK8/11

@MaxGekk
Copy link
Member Author

MaxGekk commented Mar 27, 2020

@cloud-fan @HyukjinKwon @dongjoon-hyun Here are intermediate results of benchmarking of timestamps rebasing in parquet.

@dongjoon-hyun
Copy link
Member

Thank you for the benchmark. Ya. It's an expected drawback.

@SparkQA
Copy link

SparkQA commented Mar 28, 2020

Test build #120515 has finished for PR 28057 at commit e217139.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Mar 28, 2020

It's an expected drawback.

Parquet and Avro perform rebasing only if a SQL config enabled (and the config is off by default). ORC does rebasing always. I would expect some slowdown in ORC too.

@SparkQA
Copy link

SparkQA commented Mar 28, 2020

Test build #120535 has finished for PR 28057 at commit 8703579.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 29, 2020

Test build #120538 has finished for PR 28057 at commit 912dee4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun dongjoon-hyun changed the title [WIP][SQL] Benchmark rebasing of dates/timestamps [WIP][SPARK-31294][SQL] Benchmark rebasing of dates/timestamps Mar 29, 2020
@HyukjinKwon
Copy link
Member

Yes, thank you so much for the benchamrks.

@MaxGekk MaxGekk changed the title [WIP][SPARK-31294][SQL] Benchmark rebasing of dates/timestamps [WIP][SPARK-31296][SQL] Benchmark rebasing of dates/timestamps Mar 29, 2020
@MaxGekk MaxGekk changed the title [WIP][SPARK-31296][SQL] Benchmark rebasing of dates/timestamps [WIP][SPARK-31296][SQL] Benchmark date-time rebasing to/from Julian calendar Mar 29, 2020
@MaxGekk MaxGekk changed the title [WIP][SPARK-31296][SQL] Benchmark date-time rebasing to/from Julian calendar [SPARK-31296][SQL] Benchmark date-time rebasing in Parquet datasource Mar 29, 2020
@SparkQA
Copy link

SparkQA commented Mar 29, 2020

Test build #120552 has finished for PR 28057 at commit c89f2c9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-31296][SQL] Benchmark date-time rebasing in Parquet datasource [SPARK-31296][SQL][TESTS] Benchmark date-time rebasing in Parquet datasource Mar 29, 2020

override def runBenchmarkSuite(mainArgs: Array[String]): Unit = {
withTempPath { path =>
runBenchmark("Parquet read/write") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use more specific benchmark title because this is used in the generate files?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the name scoped by concrete benchmark?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will address the comments together with other comments because launching EC2 instance and re-running the benchmark twice for jdk 8 & 11 is time consuming process.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to make the title mention second rebase.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am going to replace it by "Rebasing dates/timestamps in Parquet datasource"

Seq("date", "timestamp").foreach { dateTime =>
val benchmark = new Benchmark(s"Save ${dateTime}s to parquet", rowsNum, output = output)
benchmark.addCase("after 1582, noop", 1) { _ =>
genDF(rowsNum, dateTime, after1582 = true).noop()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you include the dataframe generation in the benchmark number? I think it should be excluded.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have already discussed this in PRs for another benchmarks. The overhead of preparing input dataframe is assumed to be subtracted from other numbers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example:

after 1582, noop                                   9272           9272           0         10.8          92.7       1.0X
after 1582, rebase off                            21841          21841           0          4.6         218.4       0.4X

The noop benchmark shows non-avoidable overhead. If we subtract it, we get 21841 - 9272 = 12569. So, overhead of preparing input data is roughly 45%. I do believe this is important info, and we should keep in the benchmark results.

@SparkQA
Copy link

SparkQA commented Mar 30, 2020

Test build #120577 has finished for PR 28057 at commit e0aedf5.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@cloud-fan cloud-fan closed this in a1dbcd1 Mar 30, 2020
@cloud-fan
Copy link
Contributor

it's just benchmark so no need to wait for jenkins.

Thanks, merging to master/3.0!

cloud-fan pushed a commit that referenced this pull request Mar 30, 2020
…asource

### What changes were proposed in this pull request?
In the PR, I propose to add new benchmark `DateTimeRebaseBenchmark` which should measure the performance of rebasing of dates/timestamps from/to to the hybrid calendar (Julian+Gregorian) to/from Proleptic Gregorian calendar:
1. In write, it saves separately dates and timestamps before and after 1582 year w/ and w/o rebasing.
2. In read, it loads previously saved parquet files by vectorized reader and by regular reader.

Here is the summary of benchmarking:
- Saving timestamps is **~6 times slower**
- Loading timestamps w/ vectorized **off** is **~4 times slower**
- Loading timestamps w/ vectorized **on** is **~10 times slower**

### Why are the changes needed?
To know the impact of date-time rebasing introduced by #27915, #27953, #27807.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Run the `DateTimeRebaseBenchmark` benchmark using Amazon EC2:

| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge |
| AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) |
| Java | OpenJDK8/11 |

Closes #28057 from MaxGekk/rebase-bechmark.

Lead-authored-by: Maxim Gekk <[email protected]>
Co-authored-by: Max Gekk <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit a1dbcd1)
Signed-off-by: Wenchen Fan <[email protected]>
@SparkQA
Copy link

SparkQA commented Mar 30, 2020

Test build #120580 has finished for PR 28057 at commit e0aedf5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
…asource

### What changes were proposed in this pull request?
In the PR, I propose to add new benchmark `DateTimeRebaseBenchmark` which should measure the performance of rebasing of dates/timestamps from/to to the hybrid calendar (Julian+Gregorian) to/from Proleptic Gregorian calendar:
1. In write, it saves separately dates and timestamps before and after 1582 year w/ and w/o rebasing.
2. In read, it loads previously saved parquet files by vectorized reader and by regular reader.

Here is the summary of benchmarking:
- Saving timestamps is **~6 times slower**
- Loading timestamps w/ vectorized **off** is **~4 times slower**
- Loading timestamps w/ vectorized **on** is **~10 times slower**

### Why are the changes needed?
To know the impact of date-time rebasing introduced by apache#27915, apache#27953, apache#27807.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Run the `DateTimeRebaseBenchmark` benchmark using Amazon EC2:

| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge |
| AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) |
| Java | OpenJDK8/11 |

Closes apache#28057 from MaxGekk/rebase-bechmark.

Lead-authored-by: Maxim Gekk <[email protected]>
Co-authored-by: Max Gekk <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
@MaxGekk MaxGekk deleted the rebase-bechmark branch June 5, 2020 19:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants