Skip to content

Commit bdbfe6e

Browse files
MaxGekkdongjoon-hyun
authored andcommitted
[SPARK-32130][SQL] Disable the JSON option inferTimestamp by default
Set the JSON option `inferTimestamp` to `false` if an user don't pass it as datasource option. To prevent perf regression while inferring schemas from JSON with potential timestamps fields. Yes - Modified existing tests in `JsonSuite` and `JsonInferSchemaSuite`. - Regenerated results of `JsonBenchmark` in the environment: | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge | | AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) | | Java | OpenJDK 64-Bit Server VM 1.8.0_252 and OpenJDK 64-Bit Server VM 11.0.7+10 | Closes #28966 from MaxGekk/json-inferTimestamps-disable-by-default. Authored-by: Max Gekk <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit bcf2330) Signed-off-by: Dongjoon Hyun <[email protected]>
1 parent fef3379 commit bdbfe6e

File tree

6 files changed

+129
-111
lines changed

6 files changed

+129
-111
lines changed

docs/sql-migration-guide.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,10 @@ license: |
2222
* Table of contents
2323
{:toc}
2424

25+
## Upgrading from Spark SQL 3.0 to 3.0.1
26+
27+
- In Spark 3.0, JSON datasource and JSON function `schema_of_json` infer TimestampType from string values if they match to the pattern defined by the JSON option `timestampFormat`. Since version 3.0.1, the timestamp type inference is disabled by default. Set the JSON option `inferTimestamp` to `true` to enable such type inference.
28+
2529
## Upgrading from Spark SQL 2.4 to 3.0
2630

2731
### Dataset/DataFrame APIs

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -133,7 +133,7 @@ private[sql] class JSONOptions(
133133
* Enables inferring of TimestampType from strings matched to the timestamp pattern
134134
* defined by the timestampFormat option.
135135
*/
136-
val inferTimestamp: Boolean = parameters.get("inferTimestamp").map(_.toBoolean).getOrElse(true)
136+
val inferTimestamp: Boolean = parameters.get("inferTimestamp").map(_.toBoolean).getOrElse(false)
137137

138138
/** Build a Jackson [[JsonFactory]] using JSON options. */
139139
def buildJsonFactory(): JsonFactory = {

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/json/JsonInferSchemaSuite.scala

Lines changed: 33 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -35,22 +35,29 @@ class JsonInferSchemaSuite extends SparkFunSuite with SQLHelper {
3535
assert(inferSchema.inferField(parser) === expectedType)
3636
}
3737

38-
def checkTimestampType(pattern: String, json: String): Unit = {
39-
checkType(Map("timestampFormat" -> pattern), json, TimestampType)
38+
def checkTimestampType(pattern: String, json: String, inferTimestamp: Boolean): Unit = {
39+
checkType(
40+
Map("timestampFormat" -> pattern, "inferTimestamp" -> inferTimestamp.toString),
41+
json,
42+
if (inferTimestamp) TimestampType else StringType)
4043
}
4144

4245
test("inferring timestamp type") {
43-
Seq("legacy", "corrected").foreach { legacyParserPolicy =>
44-
withSQLConf(SQLConf.LEGACY_TIME_PARSER_POLICY.key -> legacyParserPolicy) {
45-
checkTimestampType("yyyy", """{"a": "2018"}""")
46-
checkTimestampType("yyyy=MM", """{"a": "2018=12"}""")
47-
checkTimestampType("yyyy MM dd", """{"a": "2018 12 02"}""")
48-
checkTimestampType(
49-
"yyyy-MM-dd'T'HH:mm:ss.SSS",
50-
"""{"a": "2018-12-02T21:04:00.123"}""")
51-
checkTimestampType(
52-
"yyyy-MM-dd'T'HH:mm:ss.SSSSSSXXX",
53-
"""{"a": "2018-12-02T21:04:00.123567+01:00"}""")
46+
Seq(true, false).foreach { inferTimestamp =>
47+
Seq("legacy", "corrected").foreach { legacyParserPolicy =>
48+
withSQLConf(SQLConf.LEGACY_TIME_PARSER_POLICY.key -> legacyParserPolicy) {
49+
checkTimestampType("yyyy", """{"a": "2018"}""", inferTimestamp)
50+
checkTimestampType("yyyy=MM", """{"a": "2018=12"}""", inferTimestamp)
51+
checkTimestampType("yyyy MM dd", """{"a": "2018 12 02"}""", inferTimestamp)
52+
checkTimestampType(
53+
"yyyy-MM-dd'T'HH:mm:ss.SSS",
54+
"""{"a": "2018-12-02T21:04:00.123"}""",
55+
inferTimestamp)
56+
checkTimestampType(
57+
"yyyy-MM-dd'T'HH:mm:ss.SSSSSSXXX",
58+
"""{"a": "2018-12-02T21:04:00.123567+01:00"}""",
59+
inferTimestamp)
60+
}
5461
}
5562
}
5663
}
@@ -71,16 +78,19 @@ class JsonInferSchemaSuite extends SparkFunSuite with SQLHelper {
7178
}
7279

7380
test("skip decimal type inferring") {
74-
Seq("legacy", "corrected").foreach { legacyParserPolicy =>
75-
withSQLConf(SQLConf.LEGACY_TIME_PARSER_POLICY.key -> legacyParserPolicy) {
76-
checkType(
77-
options = Map(
78-
"prefersDecimal" -> "false",
79-
"timestampFormat" -> "yyyyMMdd.HHmmssSSS"
80-
),
81-
json = """{"a": "20181202.210400123"}""",
82-
dt = TimestampType
83-
)
81+
Seq(true, false).foreach { inferTimestamp =>
82+
Seq("legacy", "corrected").foreach { legacyParserPolicy =>
83+
withSQLConf(SQLConf.LEGACY_TIME_PARSER_POLICY.key -> legacyParserPolicy) {
84+
checkType(
85+
options = Map(
86+
"prefersDecimal" -> "false",
87+
"timestampFormat" -> "yyyyMMdd.HHmmssSSS",
88+
"inferTimestamp" -> inferTimestamp.toString
89+
),
90+
json = """{"a": "20181202.210400123"}""",
91+
dt = if (inferTimestamp) TimestampType else StringType
92+
)
93+
}
8494
}
8595
}
8696
}

sql/core/benchmarks/JsonBenchmark-jdk11-results.txt

Lines changed: 43 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -7,106 +7,106 @@ OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-106
77
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
88
JSON schema inferring: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
99
------------------------------------------------------------------------------------------------------------------------
10-
No encoding 68879 68993 116 1.5 688.8 1.0X
11-
UTF-8 is set 115270 115602 455 0.9 1152.7 0.6X
10+
No encoding 69219 69342 116 1.4 692.2 1.0X
11+
UTF-8 is set 143950 143986 55 0.7 1439.5 0.5X
1212

1313
Preparing data for benchmarking ...
1414
OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws
1515
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
1616
count a short column: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
1717
------------------------------------------------------------------------------------------------------------------------
18-
No encoding 47452 47538 113 2.1 474.5 1.0X
19-
UTF-8 is set 77330 77354 30 1.3 773.3 0.6X
18+
No encoding 57828 57913 136 1.7 578.3 1.0X
19+
UTF-8 is set 83649 83711 60 1.2 836.5 0.7X
2020

2121
Preparing data for benchmarking ...
2222
OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws
2323
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
2424
count a wide column: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
2525
------------------------------------------------------------------------------------------------------------------------
26-
No encoding 60470 60900 534 0.2 6047.0 1.0X
27-
UTF-8 is set 104733 104931 189 0.1 10473.3 0.6X
26+
No encoding 64560 65193 1023 0.2 6456.0 1.0X
27+
UTF-8 is set 102925 103174 216 0.1 10292.5 0.6X
2828

2929
Preparing data for benchmarking ...
3030
OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws
3131
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
3232
select wide row: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
3333
------------------------------------------------------------------------------------------------------------------------
34-
No encoding 130302 131072 976 0.0 260604.6 1.0X
35-
UTF-8 is set 150860 151284 377 0.0 301720.1 0.9X
34+
No encoding 131002 132316 1160 0.0 262003.1 1.0X
35+
UTF-8 is set 152128 152371 332 0.0 304256.5 0.9X
3636

3737
Preparing data for benchmarking ...
3838
OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws
3939
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
4040
Select a subset of 10 columns: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
4141
------------------------------------------------------------------------------------------------------------------------
42-
Select 10 columns 18619 18684 99 0.5 1861.9 1.0X
43-
Select 1 column 24227 24270 38 0.4 2422.7 0.8X
42+
Select 10 columns 19376 19514 160 0.5 1937.6 1.0X
43+
Select 1 column 24089 24156 58 0.4 2408.9 0.8X
4444

4545
Preparing data for benchmarking ...
4646
OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws
4747
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
4848
creation of JSON parser per line: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
4949
------------------------------------------------------------------------------------------------------------------------
50-
Short column without encoding 7947 7971 21 1.3 794.7 1.0X
51-
Short column with UTF-8 12700 12753 58 0.8 1270.0 0.6X
52-
Wide column without encoding 92632 92955 463 0.1 9263.2 0.1X
53-
Wide column with UTF-8 147013 147170 188 0.1 14701.3 0.1X
50+
Short column without encoding 8131 8219 103 1.2 813.1 1.0X
51+
Short column with UTF-8 13464 13508 44 0.7 1346.4 0.6X
52+
Wide column without encoding 108012 108598 914 0.1 10801.2 0.1X
53+
Wide column with UTF-8 150988 151369 412 0.1 15098.8 0.1X
5454

5555
Preparing data for benchmarking ...
5656
OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws
5757
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
5858
JSON functions: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
5959
------------------------------------------------------------------------------------------------------------------------
60-
Text read 713 734 19 14.0 71.3 1.0X
61-
from_json 22019 22429 456 0.5 2201.9 0.0X
62-
json_tuple 27987 28047 74 0.4 2798.7 0.0X
63-
get_json_object 21468 21870 350 0.5 2146.8 0.0X
60+
Text read 753 765 18 13.3 75.3 1.0X
61+
from_json 23182 23446 230 0.4 2318.2 0.0X
62+
json_tuple 31129 31304 181 0.3 3112.9 0.0X
63+
get_json_object 22821 23073 225 0.4 2282.1 0.0X
6464

6565
Preparing data for benchmarking ...
6666
OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws
6767
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
6868
Dataset of json strings: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
6969
------------------------------------------------------------------------------------------------------------------------
70-
Text read 2887 2910 24 17.3 57.7 1.0X
71-
schema inferring 31793 31843 43 1.6 635.9 0.1X
72-
parsing 36791 37104 294 1.4 735.8 0.1X
70+
Text read 3078 3101 26 16.2 61.6 1.0X
71+
schema inferring 30225 30434 333 1.7 604.5 0.1X
72+
parsing 32237 32308 63 1.6 644.7 0.1X
7373

7474
Preparing data for benchmarking ...
7575
OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws
7676
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
7777
Json files in the per-line mode: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
7878
------------------------------------------------------------------------------------------------------------------------
79-
Text read 10570 10611 45 4.7 211.4 1.0X
80-
Schema inferring 48729 48763 41 1.0 974.6 0.2X
81-
Parsing without charset 35490 35648 141 1.4 709.8 0.3X
82-
Parsing with UTF-8 63853 63994 163 0.8 1277.1 0.2X
79+
Text read 10835 10900 86 4.6 216.7 1.0X
80+
Schema inferring 37720 37805 110 1.3 754.4 0.3X
81+
Parsing without charset 35464 35538 100 1.4 709.3 0.3X
82+
Parsing with UTF-8 67311 67738 381 0.7 1346.2 0.2X
8383

8484
OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws
8585
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
8686
Write dates and timestamps: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
8787
------------------------------------------------------------------------------------------------------------------------
88-
Create a dataset of timestamps 2187 2190 5 4.6 218.7 1.0X
89-
to_json(timestamp) 16262 16503 323 0.6 1626.2 0.1X
90-
write timestamps to files 11679 11692 12 0.9 1167.9 0.2X
91-
Create a dataset of dates 2297 2310 12 4.4 229.7 1.0X
92-
to_json(date) 10904 10956 46 0.9 1090.4 0.2X
93-
write dates to files 6610 6645 35 1.5 661.0 0.3X
88+
Create a dataset of timestamps 2208 2222 14 4.5 220.8 1.0X
89+
to_json(timestamp) 14299 14570 285 0.7 1429.9 0.2X
90+
write timestamps to files 12955 12969 13 0.8 1295.5 0.2X
91+
Create a dataset of dates 2297 2323 30 4.4 229.7 1.0X
92+
to_json(date) 8509 8561 74 1.2 850.9 0.3X
93+
write dates to files 6786 6827 45 1.5 678.6 0.3X
9494

9595
OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws
9696
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
9797
Read dates and timestamps: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
9898
------------------------------------------------------------------------------------------------------------------------
99-
read timestamp text from files 2524 2530 9 4.0 252.4 1.0X
100-
read timestamps from files 41002 41052 59 0.2 4100.2 0.1X
101-
infer timestamps from files 84621 84939 526 0.1 8462.1 0.0X
102-
read date text from files 2292 2302 9 4.4 229.2 1.1X
103-
read date from files 16954 16976 21 0.6 1695.4 0.1X
104-
timestamp strings 3067 3077 13 3.3 306.7 0.8X
105-
parse timestamps from Dataset[String] 48690 48971 243 0.2 4869.0 0.1X
106-
infer timestamps from Dataset[String] 97463 97786 338 0.1 9746.3 0.0X
107-
date strings 3952 3956 3 2.5 395.2 0.6X
108-
parse dates from Dataset[String] 24210 24241 30 0.4 2421.0 0.1X
109-
from_json(timestamp) 71710 72242 629 0.1 7171.0 0.0X
110-
from_json(date) 42465 42481 13 0.2 4246.5 0.1X
99+
read timestamp text from files 2598 2613 18 3.8 259.8 1.0X
100+
read timestamps from files 42007 42028 19 0.2 4200.7 0.1X
101+
infer timestamps from files 18102 18120 28 0.6 1810.2 0.1X
102+
read date text from files 2355 2360 5 4.2 235.5 1.1X
103+
read date from files 17420 17458 33 0.6 1742.0 0.1X
104+
timestamp strings 3099 3101 3 3.2 309.9 0.8X
105+
parse timestamps from Dataset[String] 48188 48215 25 0.2 4818.8 0.1X
106+
infer timestamps from Dataset[String] 22929 22988 102 0.4 2292.9 0.1X
107+
date strings 4090 4103 11 2.4 409.0 0.6X
108+
parse dates from Dataset[String] 24952 25068 139 0.4 2495.2 0.1X
109+
from_json(timestamp) 66038 66352 413 0.2 6603.8 0.0X
110+
from_json(date) 43755 43782 27 0.2 4375.5 0.1X
111111

112112

0 commit comments

Comments
 (0)