-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-32130][SQL] Disable the JSON option inferTimestamp by default
#28966
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-32130][SQL] Disable the JSON option inferTimestamp by default
#28966
Conversation
|
I am running JSONBenchmark now. If it will show significantly different numbers, I will push the changes to this PR. |
|
Shall we add migration guide as well? |
Some numbers are different on JDK 8, I am generating benchmark results on JDK 11, and will push the changes soon. |
|
Thank you, @MaxGekk ! |
|
I am going to add this to the SQL migration guide: The question is about the version from which we are going to disable the inference - Spark 3.1 or Spark 3.0.1. |
|
I set the target version to 3.0.1. This should go to 3.0.1. |
|
Yea this should go to branch-3.0 to fix the perf regression. AFAIK migration guide is only for major/feature releases, so there is no migration guide for 3.0.1. Ideally 3.0.1 should be Spark 3.0 when it's released, so we can simply remove that item from the 3.0 migration guide. |
|
@cloud-fan There are items for minor versions too |
|
Test build #124770 has finished for PR 28966 at commit
|
|
This #28966 (comment) took 7 hr 37 min :-( |
| from_json(date) 42465 42481 13 0.2 4246.5 0.1X | ||
| read timestamp text from files 2598 2613 18 3.8 259.8 1.0X | ||
| read timestamps from files 42007 42028 19 0.2 4200.7 0.1X | ||
| infer timestamps from files 18102 18120 28 0.6 1810.2 0.1X |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, but this test case seems to be supposed to run infer timestamp always. Why does this become faster?
- infer timestamps from files 84621 84939 526 0.1 8462.1 0.0X
+ infer timestamps from files 18102 18120 28 0.6 1810.2 0.1X
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that we need to update JsonBenchmark together.
readBench.addCase("infer timestamps from files", numIters) { _ =>
spark.read.json(timestampDir).noop()
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never mind. Let's update the benchmark suite later as a follow-up.
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Merged to master/3.0. Thank you all.
(All UTs including R already passed in the on-going run.)
Set the JSON option `inferTimestamp` to `false` if an user don't pass it as datasource option. To prevent perf regression while inferring schemas from JSON with potential timestamps fields. Yes - Modified existing tests in `JsonSuite` and `JsonInferSchemaSuite`. - Regenerated results of `JsonBenchmark` in the environment: | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge | | AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) | | Java | OpenJDK 64-Bit Server VM 1.8.0_252 and OpenJDK 64-Bit Server VM 11.0.7+10 | Closes #28966 from MaxGekk/json-inferTimestamps-disable-by-default. Authored-by: Max Gekk <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit bcf2330) Signed-off-by: Dongjoon Hyun <[email protected]>
|
Test build #124800 has finished for PR 28966 at commit
|
|
It takes 7 hours ..? I think we should really revisit the test and cut down. Or maybe it's time to think about splitting and migrating to Github Actions completely .. |
…mark ### What changes were proposed in this pull request? Set the JSON option `inferTimestamp` to `true` for the cases that measure perf of timestamp inference. ### Why are the changes needed? The PR #28966 disabled timestamp inference by default. As a consequence, some benchmarks don't measure perf of timestamp inference from JSON fields. This PR explicitly enable such inference. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By re-generating results of `JsonBenchmark`. Closes #28981 from MaxGekk/json-inferTimestamps-disable-by-default-followup. Authored-by: Max Gekk <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…mark ### What changes were proposed in this pull request? Set the JSON option `inferTimestamp` to `true` for the cases that measure perf of timestamp inference. ### Why are the changes needed? The PR #28966 disabled timestamp inference by default. As a consequence, some benchmarks don't measure perf of timestamp inference from JSON fields. This PR explicitly enable such inference. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By re-generating results of `JsonBenchmark`. Closes #28981 from MaxGekk/json-inferTimestamps-disable-by-default-followup. Authored-by: Max Gekk <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit 42f01e3) Signed-off-by: Dongjoon Hyun <[email protected]>
What changes were proposed in this pull request?
Set the JSON option
inferTimestamptofalseif an user don't pass it as datasource option.Why are the changes needed?
To prevent perf regression while inferring schemas from JSON with potential timestamps fields.
Does this PR introduce any user-facing change?
Yes
How was this patch tested?
JsonSuiteandJsonInferSchemaSuite.JsonBenchmarkin the environment: