Commit c2536a7
[SPARK-39469][SQL] Infer date type for CSV schema inference
### What changes were proposed in this pull request?
1. Add a new `inferDate` option to CSV Options. The description is:
> Whether or not to infer columns that satisfy the `dateFormat` option as `Date`. Requires `inferSchema` to be true. When `false`, columns with dates will be inferred as `String` (or as `Timestamp` if it fits the `timestampFormat`) Legacy date formats in `Timestamp` columns cannot be parsed with this option.
An error will be thrown if `inferDate` is true when SQL Configuration LegacyTimeParserPolicy is `LEGACY`. This is to avoid incorrect schema inferences from legacy time parsers not doing strict parsing.
The `inferDate` option should prevent performance degradation for users who don't opt-in.
2. Modify InferField in CSVInferSchema.scala to include Date type.
If `typeSoFar` in `inferField` is Date, Timestamp or TimstampNTZ, we will first attempt to parse Date and then parse Timestamp/TimestampNTZ. The reason why we attempt to parse date for `typeSoFar`=Timestamp/TimestampNTZ is because of the case where a column contains a timestamp entry and then a date entry - we should detect both of the data types and infer the column as a timestamp type.
Example:
```
Seq("2010|10|10", "2010_10_10")
.toDF.repartition(1).write.mode("overwrite").text("/tmp/foo")
spark.read
.option("inferSchema", "true")
.option("header", "false")
.option("dateFormat", "yyyy|MM|dd")
.option("timestampFormat", "yyyy_MM_dd").csv("/tmp/foo").printSchema()
```
Result:
```
root
|-- _c0: timestamp (nullable = true)
```
3. Also modified `makeConverter` in `UnivocityParser` to handle Date type entries in a timestamp type column to properly parse the above example.
### Does this PR introduce _any_ user-facing change?
The new behavior of schema inference when `inferDate = true`:
1. If a column contains only dates, it should be of “date” type in the inferred schema
--> If the date format and the timestamp format are identical (e.g. both are yyyy/mm/dd), entries will default to being interpreted as Date
3. If a column contains dates and timestamps, it should be of “timestamp” type in the inferred schema
### How was this patch tested?
Unit tests were added to `CSVInferSchemaSuite` and `UnivocityParserSuite`. An end to end test is added to `CSVSuite`
### Benchmarks:
`inferDate` increases parsing/inference time in general. The impact scales with the number of rows (and not the number of columns). For columns of date type (which would be inferred as timestamp when `inferDate=false`), inference and parsing takes 30% longer. The performance impact is much greater on columns of timestamp type (taking 30x longer than `inferDate=false`) - due to trying each timestamp as a date (and throwing an error) during the inference step.
#### Number of seconds taken to parse each CSV file with `inferDate true` and `inferDate false`
| | inferDate=False | inferDate=True | master branch |
|---------------------------------------------|-----------------|----------------|---------------|
| Small file (<100 row/col). Mixed data types | 0.32 | 0.33 | |
| 100K rows. 4 columns. Mixed data types. | 0.70 | 2.80 | 0.70 |
| 20k columns. 4 rows. Mixed Data types. | 16.32 | 15.90 | 13.5 |
| Large file. Only date type. | 2.15 | 3.70 | 2.10 |
| Large file. Only timestamp type. | 2.60 | 77.00 | 2.30 |
Results are the average of 3 trials with the same machine.
Over multiple runs, master branch benchmark times have also shown results that are slower than `inferDate=false` (although the average is slightly faster). Given the +/- 20% variance in results between trials, master branch benchmark results are roughly similar to `inferDate=False` results.
Closes #36871 from Jonathancui123/SPARK-39469-date-infer.
Lead-authored-by: Jonathan Cui <[email protected]>
Co-authored-by: Jonathan Cui <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>1 parent 66b1f79 commit c2536a7
File tree
10 files changed
+219
-15
lines changed- core/src/main/resources/error
- docs
- sql
- catalyst/src
- main/scala/org/apache/spark/sql
- catalyst/csv
- errors
- test/scala/org/apache/spark/sql/catalyst/csv
- core/src/test
- resources/test-data
- scala/org/apache/spark/sql/execution/datasources/csv
10 files changed
+219
-15
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
23 | 23 | | |
24 | 24 | | |
25 | 25 | | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
26 | 32 | | |
27 | 33 | | |
28 | 34 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
108 | 108 | | |
109 | 109 | | |
110 | 110 | | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
111 | 117 | | |
112 | 118 | | |
113 | 119 | | |
| |||
Lines changed: 20 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
24 | 24 | | |
25 | 25 | | |
26 | 26 | | |
| 27 | + | |
27 | 28 | | |
28 | | - | |
29 | 29 | | |
30 | 30 | | |
31 | 31 | | |
| |||
46 | 46 | | |
47 | 47 | | |
48 | 48 | | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
49 | 55 | | |
50 | 56 | | |
51 | 57 | | |
| |||
117 | 123 | | |
118 | 124 | | |
119 | 125 | | |
| 126 | + | |
| 127 | + | |
120 | 128 | | |
| 129 | + | |
121 | 130 | | |
122 | 131 | | |
123 | 132 | | |
| |||
169 | 178 | | |
170 | 179 | | |
171 | 180 | | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
172 | 191 | | |
173 | 192 | | |
174 | 193 | | |
| |||
Lines changed: 22 additions & 2 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
148 | 148 | | |
149 | 149 | | |
150 | 150 | | |
151 | | - | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
152 | 173 | | |
153 | 174 | | |
154 | 175 | | |
| |||
195 | 216 | | |
196 | 217 | | |
197 | 218 | | |
198 | | - | |
199 | 219 | | |
200 | 220 | | |
201 | 221 | | |
| |||
Lines changed: 24 additions & 11 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
28 | 28 | | |
29 | 29 | | |
30 | 30 | | |
| 31 | + | |
31 | 32 | | |
32 | 33 | | |
33 | 34 | | |
| |||
197 | 198 | | |
198 | 199 | | |
199 | 200 | | |
200 | | - | |
| 201 | + | |
201 | 202 | | |
202 | 203 | | |
203 | | - | |
| 204 | + | |
204 | 205 | | |
205 | 206 | | |
206 | 207 | | |
207 | 208 | | |
208 | 209 | | |
209 | | - | |
| 210 | + | |
210 | 211 | | |
211 | 212 | | |
212 | 213 | | |
213 | | - | |
214 | | - | |
215 | | - | |
216 | | - | |
217 | | - | |
218 | | - | |
| 214 | + | |
219 | 215 | | |
220 | 216 | | |
221 | | - | |
| 217 | + | |
222 | 218 | | |
223 | 219 | | |
224 | 220 | | |
225 | 221 | | |
226 | 222 | | |
227 | | - | |
| 223 | + | |
| 224 | + | |
| 225 | + | |
| 226 | + | |
| 227 | + | |
| 228 | + | |
| 229 | + | |
| 230 | + | |
| 231 | + | |
| 232 | + | |
| 233 | + | |
| 234 | + | |
| 235 | + | |
| 236 | + | |
| 237 | + | |
| 238 | + | |
| 239 | + | |
| 240 | + | |
228 | 241 | | |
229 | 242 | | |
230 | 243 | | |
| |||
Lines changed: 7 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
32 | 32 | | |
33 | 33 | | |
34 | 34 | | |
35 | | - | |
| 35 | + | |
36 | 36 | | |
37 | 37 | | |
38 | 38 | | |
| |||
528 | 528 | | |
529 | 529 | | |
530 | 530 | | |
| 531 | + | |
| 532 | + | |
| 533 | + | |
| 534 | + | |
| 535 | + | |
| 536 | + | |
531 | 537 | | |
532 | 538 | | |
533 | 539 | | |
| |||
Lines changed: 55 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
109 | 109 | | |
110 | 110 | | |
111 | 111 | | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
112 | 118 | | |
113 | 119 | | |
114 | 120 | | |
| |||
192 | 198 | | |
193 | 199 | | |
194 | 200 | | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
| 211 | + | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
| 216 | + | |
| 217 | + | |
| 218 | + | |
| 219 | + | |
| 220 | + | |
| 221 | + | |
| 222 | + | |
| 223 | + | |
| 224 | + | |
| 225 | + | |
| 226 | + | |
| 227 | + | |
| 228 | + | |
| 229 | + | |
| 230 | + | |
| 231 | + | |
| 232 | + | |
| 233 | + | |
| 234 | + | |
| 235 | + | |
| 236 | + | |
| 237 | + | |
| 238 | + | |
| 239 | + | |
| 240 | + | |
| 241 | + | |
| 242 | + | |
| 243 | + | |
| 244 | + | |
| 245 | + | |
| 246 | + | |
| 247 | + | |
| 248 | + | |
| 249 | + | |
195 | 250 | | |
Lines changed: 23 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
19 | 19 | | |
20 | 20 | | |
21 | 21 | | |
| 22 | + | |
22 | 23 | | |
23 | 24 | | |
24 | 25 | | |
| |||
358 | 359 | | |
359 | 360 | | |
360 | 361 | | |
| 362 | + | |
| 363 | + | |
| 364 | + | |
| 365 | + | |
| 366 | + | |
| 367 | + | |
| 368 | + | |
| 369 | + | |
| 370 | + | |
| 371 | + | |
| 372 | + | |
| 373 | + | |
| 374 | + | |
| 375 | + | |
| 376 | + | |
| 377 | + | |
| 378 | + | |
| 379 | + | |
| 380 | + | |
| 381 | + | |
| 382 | + | |
| 383 | + | |
361 | 384 | | |
Lines changed: 4 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
Lines changed: 52 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
41 | 41 | | |
42 | 42 | | |
43 | 43 | | |
| 44 | + | |
44 | 45 | | |
45 | 46 | | |
46 | 47 | | |
| |||
74 | 75 | | |
75 | 76 | | |
76 | 77 | | |
| 78 | + | |
77 | 79 | | |
78 | 80 | | |
79 | 81 | | |
| |||
2788 | 2790 | | |
2789 | 2791 | | |
2790 | 2792 | | |
| 2793 | + | |
| 2794 | + | |
| 2795 | + | |
| 2796 | + | |
| 2797 | + | |
| 2798 | + | |
| 2799 | + | |
| 2800 | + | |
| 2801 | + | |
| 2802 | + | |
| 2803 | + | |
| 2804 | + | |
| 2805 | + | |
| 2806 | + | |
| 2807 | + | |
| 2808 | + | |
| 2809 | + | |
| 2810 | + | |
| 2811 | + | |
| 2812 | + | |
| 2813 | + | |
| 2814 | + | |
| 2815 | + | |
| 2816 | + | |
| 2817 | + | |
| 2818 | + | |
| 2819 | + | |
| 2820 | + | |
| 2821 | + | |
| 2822 | + | |
| 2823 | + | |
| 2824 | + | |
| 2825 | + | |
| 2826 | + | |
| 2827 | + | |
| 2828 | + | |
| 2829 | + | |
| 2830 | + | |
| 2831 | + | |
| 2832 | + | |
| 2833 | + | |
| 2834 | + | |
| 2835 | + | |
| 2836 | + | |
| 2837 | + | |
| 2838 | + | |
| 2839 | + | |
| 2840 | + | |
| 2841 | + | |
| 2842 | + | |
2791 | 2843 | | |
2792 | 2844 | | |
2793 | 2845 | | |
| |||
0 commit comments