[SPARK-39731][SQL] Fix issue in CSV and JSON data sources when parsing dates in "yyyyMMdd" format with CORRECTED time parser policy #37147

sadikovi · 2022-07-11T01:22:12Z

What changes were proposed in this pull request?

This PR fixes a correctness issue when reading a CSV or a JSON file with dates in "yyyyMMdd" format:

name,mydate
1,2020011
2,20201203

or

{"date": "2020011"}
{"date": "20201203"}

Prior to #32959, reading this CSV file would return:

+----+--------------+
|name|mydate        |
+----+--------------+
|1   |null          |
|2   |2020-12-03    |
+----+--------------+

However, after the patch, the invalid date is parsed because of the much more lenient parsing in DateTimeUtils.stringToDate, the method treats 2020011 as a full year:

+----+--------------+
|name|mydate        |
+----+--------------+
|1   |+2020011-01-01|
|2   |2020-12-03    |
+----+--------------+

Similar result would be observed in JSON.

This PR attempts to address correctness issue by introducing a new configuration option enableDateTimeParsingFallback which allows to enable/disable the backward compatible parsing.

Currently, by default we will fall back to the backward compatible behavior only if parser policy is legacy and no custom pattern was set (this is defined in UnivocityParser and JacksonParser for csv and json respectively).

Why are the changes needed?

Fixes a correctness issue in Spark 3.4.

Does this PR introduce any user-facing change?

In order to avoid correctness issues when reading CSV or JSON files with a custom pattern, a new configuration option enableDateTimeParsingFallback has been added to control whether or not the code would fall back to the backward compatible behavior of parsing dates and timestamps in CSV and JSON data sources.

If the config is enabled and the date cannot be parsed, we will fall back to DateTimeUtils.stringToDate.
If the config is enabled and the timestamp cannot be parsed, DateTimeUtils.stringToTimestamp will be used.
Otherwise, depending on the parser policy and a custom pattern, the value will be parsed as null.

How was this patch tested?

I added unit tests for CSV and JSON to verify the fix and the config option.

sadikovi · 2022-07-11T01:23:10Z

cc @HyukjinKwon @cloud-fan @MaxGekk.
I am not very familiar with UnivocityParser so I would appreciate your reviews, thanks.

HyukjinKwon · 2022-07-11T01:57:20Z

@Jonathancui123 since you took a look around here lately. Mind reviewing this when you find some time?

Jonathancui123

LGTM - we don't want to use the legacy parser because it will allow bad data when a custom format is used. However, there are some failing tests that should be addressed:

[info] - SPARK-36536: use casting when datetime pattern is not set *** FAILED *** (289 milliseconds)
[info] - SPARK-30960: parse date/timestamp string with legacy format *** FAILED *** (73 milliseconds)
[info] - SPARK-36536: use casting when datetime pattern is not set *** FAILED *** (289 milliseconds)

Jonathancui123 · 2022-07-11T23:22:21Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

+      check(
+        "legacy",
+        Seq(
+          Row(1, Date.valueOf("2020-01-01"), Timestamp.valueOf("2020-01-01 00:00:00")),
+          Row(2, Date.valueOf("2020-12-03"), Timestamp.valueOf("2020-12-03 00:00:00"))
+        )
+      )
+
+      check(
+        "corrected",
+        Seq(
+          Row(1, null, null),
+          Row(2, Date.valueOf("2020-12-03"), Timestamp.valueOf("2020-12-03 00:00:00"))
+        )


For completeness, would you consider adding a check for LEGACY_TIME_PARSER_POLICY = EXCEPTION? Similar to the following?

spark/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

Lines 2598 to 2601 in 1193ce7

val msg = intercept[SparkException] {

csv.collect()

}.getCause.getMessage

assert(msg.contains("Fail to parse"))

sadikovi · 2022-07-12T01:00:30Z

Thanks for the reviews. I will address the comments and failing tests and update the PR.
My question was whether there are any concerns with this change and whether users might experience compatibility issues. I would appreciate some thoughts on this.

cloud-fan · 2022-07-12T02:02:00Z

If the legacy behavior is unreasonable, I think we don't have to keep it. If datetime patten is specified, we should not fall back to the legacy code path, even if it only supports 4 digits like Spark 2.x.

AmplabJenkins · 2022-07-12T04:43:19Z

Can one of the admins verify this patch?

sadikovi · 2022-07-12T07:08:08Z

I actually found a similar issue in JSON data source. I will also address it in this PR and will update the title and description:

test.json

{"date": "2020011"}
{"date": "20201203"}

val df = spark.read.schema("date date").option("dateFormat", "yyyyMMdd").json("file:/tmp/test.json")
df.show(false)

returns

+--------------+
|date          |
+--------------+
|+2020011-01-01|
|2020-12-03    |
+--------------+

but before the patch linked in the description it used to show:

+----------+
|date      |
+----------+
|7500-08-09|
|2020-12-03|
+----------+

which is strange either way.

sadikovi · 2022-07-14T06:32:14Z

@Jonathancui123 @kamcheungting-db Could you review this PR again? Thanks.

sadikovi · 2022-07-14T06:32:26Z

Thanks @cloud-fan!

Jonathancui123

Changes look good to me! I have one question about whether this is a breaking change. Thanks for working on the fix :))

Jonathancui123 · 2022-07-14T22:51:21Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/UnivocityParserSuite.scala

+    val err = intercept[IllegalArgumentException] {
+      check(new UnivocityParser(StructType(Seq.empty), optionsWithPattern))
+    }
+    assert(err.getMessage.contains("Illegal pattern character: n"))


Is this technically a breaking change for users who could previously specify an invalid pattern without LEGACY mode?

Before -- ignore the invalid pattern and parse with DateTimeUtils.stringToTimestamp
Now -- it throws an error

We don't support invalid patterns but as a user I would be unhappy to see my code break. I'm unsure if this is actually considered a breaking change because this is such an edge case and the user is already doing something invalid. I'm curious to hear your thoughts.

This is a good point. It would be a breaking change for users if they were relying on the compatibility fallback.
There could an alternative fix, maybe we can look into updating DateTimeUtils.stringToDate but I am not sure.

I can also add a feature flag to control this behaviour in JSON and CSV connectors so users can always opt in to use legacy behaviour. For example, I can a data source option "useLegacyParsing" or something similar. The option could be disabled by default, the exception would contain a message saying that you can enable the option to maintain the previous behaviour. Maybe this could be a good solution.

Let me know if something like that could work, thanks.

I can also add a feature flag to control this behaviour in JSON and CSV connectors so users can always opt in to use legacy behaviour.

I think this should work. It feels weird that users have to opt-in to the correct behavior but hopefully this is a small percentage of users. Maybe @kamcheungting-db or @cloud-fan can weigh in.

There could an alternative fix, maybe we can look into updating DateTimeUtils.stringToDate but I am not sure

I personally wouldn't be confident updating DateTimeUtils.stringToDate because there are so many usages elsewhere. But if you are familiar with the other use cases of DateTimeUtils.stringToDate then this could work.

I'll loop back if I think of an alternative.

I think the safest option is to copy-paste the old code of stringToDate before #32959 and use it here, but that's really ugly and hard to maintain.

I'd like to understand more about the invalid pattern behavior. Will we trigger the fallback for every input row? That sounds like a big perf problem...

With the invalid pattern and before this PR, yes, the fallback code would be triggered on every pattern mismatch. With the change, we will just throw an exception parsing those values as nulls. Yes, it does sound like a performance issue but it has been there for some time.

I agree with copy-paste of stringToDate, I proposed to add a data source config to keep the old behaviour. What do you think?

sadikovi · 2022-07-18T00:46:15Z

@Jonathancui123 @cloud-fan I decided to introduce a new config option for JSON and CSV to control this parsing behaviour and have updated the code accordingly.

Now, users will have a way to enable backward compatible behaviour if they rely on it. The default is the same as with the initial changes: we will not parse the date/timestamp again if time parser policy is not legacy and the custom pattern is set.

sadikovi · 2022-07-18T00:46:55Z

cc @MaxGekk for a review as you are very familiar with JSON and CSV and date/time parsing 🙂.

MaxGekk · 2022-07-21T06:02:59Z

@sadikovi Could you re-trigger tests, please.

sadikovi · 2022-07-21T06:41:38Z

Yes, sure. Let me do that.

…rence

sadikovi · 2022-07-22T05:15:37Z

@Jonathancui123 Can you review a447b08? The tests were failing so I disabled the flag but maybe we need to revisit the test and/or behaviour of "inferDate" (or confirm it). I am happy to sync offline on this. Thanks.

Jonathancui123 · 2022-07-22T20:50:43Z

@Jonathancui123 Can you review a447b08?

I've reviewed the changes and written my thoughts here. Depending on our target behavior, we will need to handle these tests differently

sadikovi · 2022-07-24T22:44:40Z

@Jonathancui123 I can fix the inference order in my PR if you like; otherwise, a separate PR may be required to unblock mine.

…rence

This reverts commit a447b08.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala

cloud-fan · 2022-07-25T04:45:38Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

+
+      checkAnswer(
+        output(enableFallback = false),
+        Seq(Row(null, null))


sorry I'm a bit confused. Why date parsing fails? 2020-01-01 is a valid date string.

ah, because the format pattern is given but invalid.

…rence

sadikovi · 2022-07-26T06:40:39Z

I have re-triggered the failed jobs: Hive - slow tests and TPC-DS with SF=1.

sadikovi · 2022-07-26T07:59:13Z

@HyukjinKwon Do you know what could be going wrong with regard to the test failures? I don't quite understand how to figure out why the tests failed. For example, I get the following error for TPC-DS job (https://github.com/sadikovi/spark/runs/7514664578?check_suite_focus=true):

[info] - q69 (6 seconds, 699 milliseconds)
[info] - q70 (7 seconds, 678 milliseconds)
[info] - q71 (5 seconds, 279 milliseconds)
/home/runner/work/spark/spark/build/sbt-launch-lib.bash: line 77:  2375 Killed                  "$@"
Error: Process completed with exit code 137.

HyukjinKwon · 2022-07-26T09:39:37Z

TPC-DS build can be ignored I believe.

sadikovi · 2022-07-26T23:56:19Z

@cloud-fan @HyukjinKwon the tests passed. If there are no reviews, what else is needed to get this PR merged?

HyukjinKwon · 2022-07-27T00:29:37Z

I'll defer to @cloud-fan

cloud-fan · 2022-07-27T08:46:02Z

thanks, merging to master!

…tamp parsing behavior ### What changes were proposed in this pull request? This is a follow-up for [SPARK-39731](https://issues.apache.org/jira/browse/SPARK-39731) and PR #37147. I found that it could be problematic to change `spark.sql.legacy.timeParserPolicy` to LEGACY when inferring dates and timestamps in CSV and JSON. Sometimes it is beneficial to have the time parser policy as CORRECTED but still use a more lenient date and timestamp inference (or when migrating to a newer Spark version). I added two separate configs that control this behavior: - `spark.sql.legacy.csv.enableDateTimeParsingFallback` - `spark.sql.legacy.json.enableDateTimeParsingFallback` When the configs are set to `true`, the legacy time parsing behaviour is enabled (pre Spark 3.0). With this PR, the precedence order is as follows for CSV (similar for JSON): - data source option `enableDateTimeParsingFallback` - if that is not set, check `spark.sql.legacy.{csv,json}.enableDateTimeParsingFallback` - if that is not set, check `spark.sql.legacy.timeParserPolicy` and whether or not a custom format is used. ### Why are the changes needed? The change makes it easier for users to migrate to a newer Spark version without changing global config `spark.sql.legacy.timeParserPolicy`. Also, allows to enable legacy parsing for CSV and JSON separately without changing the code or the global time parser config. ### Does this PR introduce _any_ user-facing change? No, simply adds an ability to change the behaviour specifically for CSV or JSON. ### How was this patch tested? I added a unit test for CSV and JSON to verify the flag. Closes #37653 from sadikovi/SPARK-40215. Authored-by: Ivan Sadikov <[email protected]> Signed-off-by: Max Gekk <[email protected]>

fix issue

1193ce7

github-actions bot added the SQL label Jul 11, 2022

sadikovi changed the title ~~[SPARK-39731] Fix issue in CSV data source when parsing date in "yyyyMMdd" format with CORRECTED time parser policy~~ [SPARK-39731][SQL] Fix issue in CSV data source when parsing date in "yyyyMMdd" format with CORRECTED time parser policy Jul 11, 2022

sadikovi changed the title ~~[SPARK-39731][SQL] Fix issue in CSV data source when parsing date in "yyyyMMdd" format with CORRECTED time parser policy~~ [SPARK-39731][SQL] Fix issue in CSV data source when parsing dates in "yyyyMMdd" format with CORRECTED time parser policy Jul 11, 2022

cloud-fan approved these changes Jul 11, 2022

View reviewed changes

Jonathancui123 approved these changes Jul 11, 2022

View reviewed changes

kamcheungting-db approved these changes Jul 12, 2022

View reviewed changes

update code

b714b7f

sadikovi changed the title ~~[SPARK-39731][SQL] Fix issue in CSV data source when parsing dates in "yyyyMMdd" format with CORRECTED time parser policy~~ [SPARK-39731][SQL] Fix issue in CSV and JSON data sources when parsing dates in "yyyyMMdd" format with CORRECTED time parser policy Jul 14, 2022

sadikovi added 2 commits July 14, 2022 18:23

fix issue in json

8a10a68

update code

9b65761

cloud-fan approved these changes Jul 14, 2022

View reviewed changes

Jonathancui123 approved these changes Jul 14, 2022

View reviewed changes

sadikovi added 3 commits July 18, 2022 12:17

add a config option to control legacy behavior

45011a0

add a config for json

40d07bd

update docs and comments

55c5579

github-actions bot added the DOCS label Jul 18, 2022

sadikovi requested a review from cloud-fan July 18, 2022 00:43

fix scalastyle

ef91606

cloud-fan approved these changes Jul 18, 2022

View reviewed changes

MaxGekk approved these changes Jul 21, 2022

View reviewed changes

sadikovi added 4 commits July 21, 2022 18:42

trigger ci

bf9351d

Merge remote-tracking branch 'upstream/master' into fix-csv-date-infe…

ac63b63

…rence

fix tests for SPARK-39469

a447b08

Merge remote-tracking branch 'upstream/master' into fix-csv-date-infe…

8feb707

…rence

sadikovi mentioned this pull request Jul 22, 2022

[SPARK-39469][SQL] Infer date type for CSV schema inference #36871

Closed

sadikovi added 4 commits July 25, 2022 10:50

Merge remote-tracking branch 'upstream/master' into fix-csv-date-infe…

8a01ece

…rence

update comments

fbdf9d8

Revert "fix tests for SPARK-39469"

2962cd9

This reverts commit a447b08.

update the priority order for SPARK-39469

10ca4a4

sadikovi commented Jul 24, 2022

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala Show resolved Hide resolved

cloud-fan reviewed Jul 25, 2022

View reviewed changes

cloud-fan approved these changes Jul 25, 2022

View reviewed changes

sadikovi added 2 commits July 25, 2022 19:05

trigger ci

b2a3db2

Merge remote-tracking branch 'upstream/master' into fix-csv-date-infe…

739e7db

…rence

cloud-fan approved these changes Jul 27, 2022

View reviewed changes

cloud-fan closed this in a930445 Jul 27, 2022

sadikovi mentioned this pull request Aug 25, 2022

[SPARK-40215][SQL] Add SQL configs to control CSV/JSON date and timestamp parsing behavior #37653

Closed

	val msg = intercept[SparkException] {
	csv.collect()
	}.getCause.getMessage
	assert(msg.contains("Fail to parse"))

[SPARK-39731][SQL] Fix issue in CSV and JSON data sources when parsing dates in "yyyyMMdd" format with CORRECTED time parser policy #37147

[SPARK-39731][SQL] Fix issue in CSV and JSON data sources when parsing dates in "yyyyMMdd" format with CORRECTED time parser policy #37147

Uh oh!

Conversation

sadikovi commented Jul 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

sadikovi commented Jul 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Jul 11, 2022

Uh oh!

Jonathancui123 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sadikovi commented Jul 12, 2022

Uh oh!

cloud-fan commented Jul 12, 2022

Uh oh!

AmplabJenkins commented Jul 12, 2022

Uh oh!

sadikovi commented Jul 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sadikovi commented Jul 14, 2022

Uh oh!

sadikovi commented Jul 14, 2022

Uh oh!

Jonathancui123 left a comment

Choose a reason for hiding this comment

Uh oh!

Jonathancui123 Jul 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sadikovi commented Jul 18, 2022

Uh oh!

sadikovi commented Jul 18, 2022

Uh oh!

MaxGekk commented Jul 21, 2022

Uh oh!

sadikovi commented Jul 21, 2022

Uh oh!

sadikovi commented Jul 22, 2022

Uh oh!

Jonathancui123 commented Jul 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sadikovi commented Jul 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sadikovi commented Jul 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sadikovi commented Jul 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Jul 26, 2022

sadikovi commented Jul 11, 2022 •

edited

Loading

sadikovi commented Jul 11, 2022 •

edited

Loading

Jonathancui123 left a comment •

edited

Loading

sadikovi commented Jul 12, 2022 •

edited

Loading

Jonathancui123 Jul 14, 2022 •

edited

Loading

Jonathancui123 commented Jul 22, 2022 •

edited

Loading

sadikovi commented Jul 24, 2022 •

edited

Loading

sadikovi commented Jul 26, 2022 •

edited

Loading

sadikovi commented Jul 26, 2022 •

edited

Loading