Commit a930445
[SPARK-39731][SQL] Fix issue in CSV and JSON data sources when parsing dates in "yyyyMMdd" format with CORRECTED time parser policy
### What changes were proposed in this pull request?
This PR fixes a correctness issue when reading a CSV or a JSON file with dates in "yyyyMMdd" format:
```
name,mydate
1,2020011
2,20201203
```
or
```
{"date": "2020011"}
{"date": "20201203"}
```
Prior to #32959, reading this CSV file would return:
```
+----+--------------+
|name|mydate |
+----+--------------+
|1 |null |
|2 |2020-12-03 |
+----+--------------+
```
However, after the patch, the invalid date is parsed because of the much more lenient parsing in `DateTimeUtils.stringToDate`, the method treats `2020011` as a full year:
```
+----+--------------+
|name|mydate |
+----+--------------+
|1 |+2020011-01-01|
|2 |2020-12-03 |
+----+--------------+
```
Similar result would be observed in JSON.
This PR attempts to address correctness issue by introducing a new configuration option `enableDateTimeParsingFallback` which allows to enable/disable the backward compatible parsing.
Currently, by default we will fall back to the backward compatible behavior only if parser policy is legacy and no custom pattern was set (this is defined in `UnivocityParser` and `JacksonParser` for csv and json respectively).
### Why are the changes needed?
Fixes a correctness issue in Spark 3.4.
### Does this PR introduce _any_ user-facing change?
In order to avoid correctness issues when reading CSV or JSON files with a custom pattern, a new configuration option `enableDateTimeParsingFallback` has been added to control whether or not the code would fall back to the backward compatible behavior of parsing dates and timestamps in CSV and JSON data sources.
- If the config is enabled and the date cannot be parsed, we will fall back to `DateTimeUtils.stringToDate`.
- If the config is enabled and the timestamp cannot be parsed, `DateTimeUtils.stringToTimestamp` will be used.
- Otherwise, depending on the parser policy and a custom pattern, the value will be parsed as null.
### How was this patch tested?
I added unit tests for CSV and JSON to verify the fix and the config option.
Closes #37147 from sadikovi/fix-csv-date-inference.
Authored-by: Ivan Sadikov <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>1 parent 0e33195 commit a930445
File tree
9 files changed
+248
-17
lines changed- docs
- sql
- catalyst/src
- main/scala/org/apache/spark/sql/catalyst
- csv
- json
- test/scala/org/apache/spark/sql/catalyst/csv
- core/src/test/scala/org/apache/spark/sql/execution/datasources
- csv
- json
9 files changed
+248
-17
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
174 | 174 | | |
175 | 175 | | |
176 | 176 | | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
177 | 183 | | |
178 | 184 | | |
179 | 185 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
202 | 202 | | |
203 | 203 | | |
204 | 204 | | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
205 | 211 | | |
206 | 212 | | |
207 | 213 | | |
| |||
Lines changed: 11 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
190 | 190 | | |
191 | 191 | | |
192 | 192 | | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
193 | 204 | | |
194 | 205 | | |
195 | 206 | | |
| |||
Lines changed: 28 additions & 10 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
120 | 120 | | |
121 | 121 | | |
122 | 122 | | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
123 | 137 | | |
124 | 138 | | |
125 | 139 | | |
| |||
205 | 219 | | |
206 | 220 | | |
207 | 221 | | |
208 | | - | |
| 222 | + | |
| 223 | + | |
| 224 | + | |
| 225 | + | |
209 | 226 | | |
210 | 227 | | |
211 | 228 | | |
| |||
217 | 234 | | |
218 | 235 | | |
219 | 236 | | |
220 | | - | |
221 | | - | |
222 | | - | |
223 | | - | |
224 | | - | |
225 | | - | |
226 | | - | |
227 | | - | |
228 | | - | |
| 237 | + | |
| 238 | + | |
| 239 | + | |
| 240 | + | |
| 241 | + | |
| 242 | + | |
| 243 | + | |
| 244 | + | |
229 | 245 | | |
| 246 | + | |
| 247 | + | |
230 | 248 | | |
231 | 249 | | |
232 | 250 | | |
| |||
Lines changed: 11 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
111 | 111 | | |
112 | 112 | | |
113 | 113 | | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
114 | 125 | | |
115 | 126 | | |
116 | 127 | | |
| |||
Lines changed: 22 additions & 2 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
78 | 78 | | |
79 | 79 | | |
80 | 80 | | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
81 | 95 | | |
82 | 96 | | |
83 | 97 | | |
| |||
257 | 271 | | |
258 | 272 | | |
259 | 273 | | |
260 | | - | |
| 274 | + | |
| 275 | + | |
| 276 | + | |
| 277 | + | |
261 | 278 | | |
262 | 279 | | |
263 | 280 | | |
| |||
280 | 297 | | |
281 | 298 | | |
282 | 299 | | |
283 | | - | |
| 300 | + | |
| 301 | + | |
| 302 | + | |
| 303 | + | |
284 | 304 | | |
285 | 305 | | |
286 | 306 | | |
| |||
Lines changed: 16 additions & 3 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
355 | 355 | | |
356 | 356 | | |
357 | 357 | | |
358 | | - | |
359 | | - | |
360 | | - | |
| 358 | + | |
| 359 | + | |
| 360 | + | |
| 361 | + | |
| 362 | + | |
| 363 | + | |
| 364 | + | |
| 365 | + | |
| 366 | + | |
| 367 | + | |
| 368 | + | |
| 369 | + | |
| 370 | + | |
| 371 | + | |
| 372 | + | |
| 373 | + | |
361 | 374 | | |
362 | 375 | | |
363 | 376 | | |
| |||
Lines changed: 74 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
36 | 36 | | |
37 | 37 | | |
38 | 38 | | |
39 | | - | |
| 39 | + | |
40 | 40 | | |
41 | 41 | | |
42 | 42 | | |
| |||
2837 | 2837 | | |
2838 | 2838 | | |
2839 | 2839 | | |
| 2840 | + | |
| 2841 | + | |
| 2842 | + | |
| 2843 | + | |
| 2844 | + | |
| 2845 | + | |
| 2846 | + | |
| 2847 | + | |
| 2848 | + | |
| 2849 | + | |
| 2850 | + | |
| 2851 | + | |
| 2852 | + | |
| 2853 | + | |
| 2854 | + | |
| 2855 | + | |
| 2856 | + | |
| 2857 | + | |
| 2858 | + | |
2840 | 2859 | | |
| 2860 | + | |
| 2861 | + | |
| 2862 | + | |
| 2863 | + | |
| 2864 | + | |
| 2865 | + | |
| 2866 | + | |
| 2867 | + | |
| 2868 | + | |
| 2869 | + | |
| 2870 | + | |
| 2871 | + | |
| 2872 | + | |
| 2873 | + | |
| 2874 | + | |
| 2875 | + | |
| 2876 | + | |
| 2877 | + | |
| 2878 | + | |
| 2879 | + | |
| 2880 | + | |
| 2881 | + | |
| 2882 | + | |
| 2883 | + | |
| 2884 | + | |
| 2885 | + | |
| 2886 | + | |
| 2887 | + | |
| 2888 | + | |
| 2889 | + | |
| 2890 | + | |
| 2891 | + | |
| 2892 | + | |
| 2893 | + | |
| 2894 | + | |
| 2895 | + | |
| 2896 | + | |
| 2897 | + | |
| 2898 | + | |
| 2899 | + | |
| 2900 | + | |
| 2901 | + | |
| 2902 | + | |
| 2903 | + | |
| 2904 | + | |
| 2905 | + | |
| 2906 | + | |
| 2907 | + | |
| 2908 | + | |
| 2909 | + | |
| 2910 | + | |
| 2911 | + | |
| 2912 | + | |
| 2913 | + | |
2841 | 2914 | | |
2842 | 2915 | | |
2843 | 2916 | | |
| |||
Lines changed: 74 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
29 | 29 | | |
30 | 30 | | |
31 | 31 | | |
32 | | - | |
| 32 | + | |
33 | 33 | | |
34 | 34 | | |
35 | 35 | | |
| |||
3249 | 3249 | | |
3250 | 3250 | | |
3251 | 3251 | | |
| 3252 | + | |
| 3253 | + | |
| 3254 | + | |
| 3255 | + | |
| 3256 | + | |
| 3257 | + | |
| 3258 | + | |
| 3259 | + | |
| 3260 | + | |
| 3261 | + | |
| 3262 | + | |
| 3263 | + | |
| 3264 | + | |
| 3265 | + | |
| 3266 | + | |
| 3267 | + | |
| 3268 | + | |
| 3269 | + | |
| 3270 | + | |
| 3271 | + | |
| 3272 | + | |
| 3273 | + | |
| 3274 | + | |
| 3275 | + | |
| 3276 | + | |
| 3277 | + | |
| 3278 | + | |
| 3279 | + | |
| 3280 | + | |
| 3281 | + | |
| 3282 | + | |
| 3283 | + | |
| 3284 | + | |
| 3285 | + | |
| 3286 | + | |
| 3287 | + | |
| 3288 | + | |
| 3289 | + | |
| 3290 | + | |
| 3291 | + | |
| 3292 | + | |
| 3293 | + | |
| 3294 | + | |
| 3295 | + | |
| 3296 | + | |
| 3297 | + | |
| 3298 | + | |
| 3299 | + | |
| 3300 | + | |
| 3301 | + | |
| 3302 | + | |
| 3303 | + | |
| 3304 | + | |
| 3305 | + | |
| 3306 | + | |
| 3307 | + | |
| 3308 | + | |
| 3309 | + | |
| 3310 | + | |
| 3311 | + | |
| 3312 | + | |
| 3313 | + | |
| 3314 | + | |
| 3315 | + | |
| 3316 | + | |
| 3317 | + | |
| 3318 | + | |
| 3319 | + | |
| 3320 | + | |
| 3321 | + | |
| 3322 | + | |
| 3323 | + | |
| 3324 | + | |
3252 | 3325 | | |
3253 | 3326 | | |
3254 | 3327 | | |
| |||
0 commit comments