Skip to content

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Jul 1, 2020

What changes were proposed in this pull request?

Set the JSON option inferTimestamp to false if an user don't pass it as datasource option.

Why are the changes needed?

To prevent perf regression while inferring schemas from JSON with potential timestamps fields.

Does this PR introduce any user-facing change?

Yes

How was this patch tested?

  • Modified existing tests in JsonSuite and JsonInferSchemaSuite.
  • Regenerated results of JsonBenchmark in the environment:
Item Description
Region us-west-2 (Oregon)
Instance r3.xlarge
AMI ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5)
Java OpenJDK 64-Bit Server VM 1.8.0_252 and OpenJDK 64-Bit Server VM 11.0.7+10

@MaxGekk
Copy link
Member Author

MaxGekk commented Jul 1, 2020

I am running JSONBenchmark now. If it will show significantly different numbers, I will push the changes to this PR.

@gengliangwang
Copy link
Member

Shall we add migration guide as well?

@MaxGekk
Copy link
Member Author

MaxGekk commented Jul 1, 2020

If it will show significantly different numbers ...

Some numbers are different on JDK 8, I am generating benchmark results on JDK 11, and will push the changes soon.

@dongjoon-hyun
Copy link
Member

Thank you, @MaxGekk !

@MaxGekk
Copy link
Member Author

MaxGekk commented Jul 1, 2020

I am going to add this to the SQL migration guide:

  - In Spark 3.0, JSON datasource and JSON function `schema_of_json` infer TimestampType from string values if they match to the pattern defined by the JSON option `timestampFormat`. Since version 3.1, the timestamp type inference is disabled by default. Set the JSON option `inferTimestamp` to `true` to enable such type inference.

The question is about the version from which we are going to disable the inference - Spark 3.1 or Spark 3.0.1.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jul 1, 2020

I set the target version to 3.0.1. This should go to 3.0.1.
This is a regression due to SPARK-26246, isn't it?

@cloud-fan
Copy link
Contributor

Yea this should go to branch-3.0 to fix the perf regression.

AFAIK migration guide is only for major/feature releases, so there is no migration guide for 3.0.1. Ideally 3.0.1 should be Spark 3.0 when it's released, so we can simply remove that item from the 3.0 migration guide.

@MaxGekk
Copy link
Member Author

MaxGekk commented Jul 1, 2020

@cloud-fan There are items for minor versions too Upgrading from Spark SQL 2.4 to 2.4.1. See at docs/sql-migration-guide.md. I am going to add a separate section Upgrading from Spark SQL 3.0 to 3.0.1

@SparkQA
Copy link

SparkQA commented Jul 1, 2020

Test build #124770 has finished for PR 28966 at commit 277a679.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Jul 1, 2020

This #28966 (comment) took 7 hr 37 min :-(

from_json(date) 42465 42481 13 0.2 4246.5 0.1X
read timestamp text from files 2598 2613 18 3.8 259.8 1.0X
read timestamps from files 42007 42028 19 0.2 4200.7 0.1X
infer timestamps from files 18102 18120 28 0.6 1810.2 0.1X
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, but this test case seems to be supposed to run infer timestamp always. Why does this become faster?

- infer timestamps from files                       84621          84939         526          0.1        8462.1       0.0X
+ infer timestamps from files                       18102          18120          28          0.6        1810.2       0.1X

Copy link
Member

@dongjoon-hyun dongjoon-hyun Jul 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that we need to update JsonBenchmark together.

readBench.addCase("infer timestamps from files", numIters) { _ =>
    spark.read.json(timestampDir).noop()
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind. Let's update the benchmark suite later as a follow-up.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Merged to master/3.0. Thank you all.
(All UTs including R already passed in the on-going run.)

dongjoon-hyun pushed a commit that referenced this pull request Jul 1, 2020
Set the JSON option `inferTimestamp` to `false` if an user don't pass it as datasource option.

To prevent perf regression while inferring schemas from JSON with potential timestamps fields.

Yes

- Modified existing tests in `JsonSuite` and `JsonInferSchemaSuite`.
- Regenerated results of `JsonBenchmark` in the environment:

| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge |
| AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) |
| Java | OpenJDK 64-Bit Server VM 1.8.0_252 and OpenJDK 64-Bit Server VM 11.0.7+10 |

Closes #28966 from MaxGekk/json-inferTimestamps-disable-by-default.

Authored-by: Max Gekk <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit bcf2330)
Signed-off-by: Dongjoon Hyun <[email protected]>
@SparkQA
Copy link

SparkQA commented Jul 1, 2020

Test build #124800 has finished for PR 28966 at commit 43fc9aa.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

It takes 7 hours ..? I think we should really revisit the test and cut down. Or maybe it's time to think about splitting and migrating to Github Actions completely ..

dongjoon-hyun pushed a commit that referenced this pull request Jul 2, 2020
…mark

### What changes were proposed in this pull request?
Set the JSON option `inferTimestamp` to `true` for the cases that measure perf of timestamp inference.

### Why are the changes needed?
The PR #28966 disabled timestamp inference by default. As a consequence, some benchmarks don't measure perf of timestamp inference from JSON fields. This PR explicitly enable such inference.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By re-generating results of `JsonBenchmark`.

Closes #28981 from MaxGekk/json-inferTimestamps-disable-by-default-followup.

Authored-by: Max Gekk <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
dongjoon-hyun pushed a commit that referenced this pull request Jul 2, 2020
…mark

### What changes were proposed in this pull request?
Set the JSON option `inferTimestamp` to `true` for the cases that measure perf of timestamp inference.

### Why are the changes needed?
The PR #28966 disabled timestamp inference by default. As a consequence, some benchmarks don't measure perf of timestamp inference from JSON fields. This PR explicitly enable such inference.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By re-generating results of `JsonBenchmark`.

Closes #28981 from MaxGekk/json-inferTimestamps-disable-by-default-followup.

Authored-by: Max Gekk <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 42f01e3)
Signed-off-by: Dongjoon Hyun <[email protected]>
@MaxGekk MaxGekk deleted the json-inferTimestamps-disable-by-default branch December 11, 2020 20:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants