-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-31296][SQL][TESTS] Benchmark date-time rebasing in Parquet datasource #28057
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@cloud-fan @HyukjinKwon @dongjoon-hyun Here are intermediate results of benchmarking of timestamps rebasing in parquet. |
|
Thank you for the benchmark. Ya. It's an expected drawback. |
|
Test build #120515 has finished for PR 28057 at commit
|
Parquet and Avro perform rebasing only if a SQL config enabled (and the config is off by default). ORC does rebasing always. I would expect some slowdown in ORC too. |
|
Test build #120535 has finished for PR 28057 at commit
|
|
Test build #120538 has finished for PR 28057 at commit
|
|
Yes, thank you so much for the benchamrks. |
|
Test build #120552 has finished for PR 28057 at commit
|
|
|
||
| override def runBenchmarkSuite(mainArgs: Array[String]): Unit = { | ||
| withTempPath { path => | ||
| runBenchmark("Parquet read/write") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you use more specific benchmark title because this is used in the generate files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the name scoped by concrete benchmark?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will address the comments together with other comments because launching EC2 instance and re-running the benchmark twice for jdk 8 & 11 is time consuming process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to make the title mention second rebase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am going to replace it by "Rebasing dates/timestamps in Parquet datasource"
| Seq("date", "timestamp").foreach { dateTime => | ||
| val benchmark = new Benchmark(s"Save ${dateTime}s to parquet", rowsNum, output = output) | ||
| benchmark.addCase("after 1582, noop", 1) { _ => | ||
| genDF(rowsNum, dateTime, after1582 = true).noop() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you include the dataframe generation in the benchmark number? I think it should be excluded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have already discussed this in PRs for another benchmarks. The overhead of preparing input dataframe is assumed to be subtracted from other numbers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example:
after 1582, noop 9272 9272 0 10.8 92.7 1.0X
after 1582, rebase off 21841 21841 0 4.6 218.4 0.4X
The noop benchmark shows non-avoidable overhead. If we subtract it, we get 21841 - 9272 = 12569. So, overhead of preparing input data is roughly 45%. I do believe this is important info, and we should keep in the benchmark results.
|
Test build #120577 has finished for PR 28057 at commit
|
|
retest this please |
|
it's just benchmark so no need to wait for jenkins. Thanks, merging to master/3.0! |
…asource ### What changes were proposed in this pull request? In the PR, I propose to add new benchmark `DateTimeRebaseBenchmark` which should measure the performance of rebasing of dates/timestamps from/to to the hybrid calendar (Julian+Gregorian) to/from Proleptic Gregorian calendar: 1. In write, it saves separately dates and timestamps before and after 1582 year w/ and w/o rebasing. 2. In read, it loads previously saved parquet files by vectorized reader and by regular reader. Here is the summary of benchmarking: - Saving timestamps is **~6 times slower** - Loading timestamps w/ vectorized **off** is **~4 times slower** - Loading timestamps w/ vectorized **on** is **~10 times slower** ### Why are the changes needed? To know the impact of date-time rebasing introduced by #27915, #27953, #27807. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Run the `DateTimeRebaseBenchmark` benchmark using Amazon EC2: | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge | | AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) | | Java | OpenJDK8/11 | Closes #28057 from MaxGekk/rebase-bechmark. Lead-authored-by: Maxim Gekk <[email protected]> Co-authored-by: Max Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit a1dbcd1) Signed-off-by: Wenchen Fan <[email protected]>
|
Test build #120580 has finished for PR 28057 at commit
|
…asource ### What changes were proposed in this pull request? In the PR, I propose to add new benchmark `DateTimeRebaseBenchmark` which should measure the performance of rebasing of dates/timestamps from/to to the hybrid calendar (Julian+Gregorian) to/from Proleptic Gregorian calendar: 1. In write, it saves separately dates and timestamps before and after 1582 year w/ and w/o rebasing. 2. In read, it loads previously saved parquet files by vectorized reader and by regular reader. Here is the summary of benchmarking: - Saving timestamps is **~6 times slower** - Loading timestamps w/ vectorized **off** is **~4 times slower** - Loading timestamps w/ vectorized **on** is **~10 times slower** ### Why are the changes needed? To know the impact of date-time rebasing introduced by apache#27915, apache#27953, apache#27807. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Run the `DateTimeRebaseBenchmark` benchmark using Amazon EC2: | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge | | AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) | | Java | OpenJDK8/11 | Closes apache#28057 from MaxGekk/rebase-bechmark. Lead-authored-by: Maxim Gekk <[email protected]> Co-authored-by: Max Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
In the PR, I propose to add new benchmark
DateTimeRebaseBenchmarkwhich should measure the performance of rebasing of dates/timestamps from/to to the hybrid calendar (Julian+Gregorian) to/from Proleptic Gregorian calendar:Here is the summary of benchmarking:
Why are the changes needed?
To know the impact of date-time rebasing introduced by #27915, #27953, #27807.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Run the
DateTimeRebaseBenchmarkbenchmark using Amazon EC2: