-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-39731][SQL] Fix issue in CSV and JSON data sources when parsing dates in "yyyyMMdd" format with CORRECTED time parser policy #37147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @HyukjinKwon @cloud-fan @MaxGekk. |
|
@Jonathancui123 since you took a look around here lately. Mind reviewing this when you find some time? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - we don't want to use the legacy parser because it will allow bad data when a custom format is used. However, there are some failing tests that should be addressed:
[info] - SPARK-36536: use casting when datetime pattern is not set *** FAILED *** (289 milliseconds)
[info] - SPARK-30960: parse date/timestamp string with legacy format *** FAILED *** (73 milliseconds)
[info] - SPARK-36536: use casting when datetime pattern is not set *** FAILED *** (289 milliseconds)
| check( | ||
| "legacy", | ||
| Seq( | ||
| Row(1, Date.valueOf("2020-01-01"), Timestamp.valueOf("2020-01-01 00:00:00")), | ||
| Row(2, Date.valueOf("2020-12-03"), Timestamp.valueOf("2020-12-03 00:00:00")) | ||
| ) | ||
| ) | ||
|
|
||
| check( | ||
| "corrected", | ||
| Seq( | ||
| Row(1, null, null), | ||
| Row(2, Date.valueOf("2020-12-03"), Timestamp.valueOf("2020-12-03 00:00:00")) | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For completeness, would you consider adding a check for LEGACY_TIME_PARSER_POLICY = EXCEPTION? Similar to the following?
spark/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
Lines 2598 to 2601 in 1193ce7
| val msg = intercept[SparkException] { | |
| csv.collect() | |
| }.getCause.getMessage | |
| assert(msg.contains("Fail to parse")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
|
Thanks for the reviews. I will address the comments and failing tests and update the PR. |
|
If the legacy behavior is unreasonable, I think we don't have to keep it. If datetime patten is specified, we should not fall back to the legacy code path, even if it only supports 4 digits like Spark 2.x. |
|
Can one of the admins verify this patch? |
|
I actually found a similar issue in JSON data source. I will also address it in this PR and will update the title and description: test.json val df = spark.read.schema("date date").option("dateFormat", "yyyyMMdd").json("file:/tmp/test.json")
df.show(false)returns but before the patch linked in the description it used to show: which is strange either way. |
|
@Jonathancui123 @kamcheungting-db Could you review this PR again? Thanks. |
|
Thanks @cloud-fan! |
Jonathancui123
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good to me! I have one question about whether this is a breaking change. Thanks for working on the fix :))
| val err = intercept[IllegalArgumentException] { | ||
| check(new UnivocityParser(StructType(Seq.empty), optionsWithPattern)) | ||
| } | ||
| assert(err.getMessage.contains("Illegal pattern character: n")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this technically a breaking change for users who could previously specify an invalid pattern without LEGACY mode?
Before -- ignore the invalid pattern and parse with DateTimeUtils.stringToTimestamp
Now -- it throws an error
We don't support invalid patterns but as a user I would be unhappy to see my code break. I'm unsure if this is actually considered a breaking change because this is such an edge case and the user is already doing something invalid. I'm curious to hear your thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point. It would be a breaking change for users if they were relying on the compatibility fallback.
There could an alternative fix, maybe we can look into updating DateTimeUtils.stringToDate but I am not sure.
I can also add a feature flag to control this behaviour in JSON and CSV connectors so users can always opt in to use legacy behaviour. For example, I can a data source option "useLegacyParsing" or something similar. The option could be disabled by default, the exception would contain a message saying that you can enable the option to maintain the previous behaviour. Maybe this could be a good solution.
Let me know if something like that could work, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can also add a feature flag to control this behaviour in JSON and CSV connectors so users can always opt in to use legacy behaviour.
I think this should work. It feels weird that users have to opt-in to the correct behavior but hopefully this is a small percentage of users. Maybe @kamcheungting-db or @cloud-fan can weigh in.
There could an alternative fix, maybe we can look into updating DateTimeUtils.stringToDate but I am not sure
I personally wouldn't be confident updating DateTimeUtils.stringToDate because there are so many usages elsewhere. But if you are familiar with the other use cases of DateTimeUtils.stringToDate then this could work.
I'll loop back if I think of an alternative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the safest option is to copy-paste the old code of stringToDate before #32959 and use it here, but that's really ugly and hard to maintain.
I'd like to understand more about the invalid pattern behavior. Will we trigger the fallback for every input row? That sounds like a big perf problem...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the invalid pattern and before this PR, yes, the fallback code would be triggered on every pattern mismatch. With the change, we will just throw an exception parsing those values as nulls. Yes, it does sound like a performance issue but it has been there for some time.
I agree with copy-paste of stringToDate, I proposed to add a data source config to keep the old behaviour. What do you think?
|
@Jonathancui123 @cloud-fan I decided to introduce a new config option for JSON and CSV to control this parsing behaviour and have updated the code accordingly. Now, users will have a way to enable backward compatible behaviour if they rely on it. The default is the same as with the initial changes: we will not parse the date/timestamp again if time parser policy is not legacy and the custom pattern is set. |
|
cc @MaxGekk for a review as you are very familiar with JSON and CSV and date/time parsing 🙂. |
|
@sadikovi Could you re-trigger tests, please. |
|
Yes, sure. Let me do that. |
|
@Jonathancui123 Can you review a447b08? The tests were failing so I disabled the flag but maybe we need to revisit the test and/or behaviour of "inferDate" (or confirm it). I am happy to sync offline on this. Thanks. |
I've reviewed the changes and written my thoughts here. Depending on our target behavior, we will need to handle these tests differently |
|
@Jonathancui123 I can fix the inference order in my PR if you like; otherwise, a separate PR may be required to unblock mine. |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala
Show resolved
Hide resolved
|
|
||
| checkAnswer( | ||
| output(enableFallback = false), | ||
| Seq(Row(null, null)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry I'm a bit confused. Why date parsing fails? 2020-01-01 is a valid date string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, because the format pattern is given but invalid.
|
I have re-triggered the failed jobs: Hive - slow tests and TPC-DS with SF=1. |
|
@HyukjinKwon Do you know what could be going wrong with regard to the test failures? I don't quite understand how to figure out why the tests failed. For example, I get the following error for TPC-DS job (https://github.com/sadikovi/spark/runs/7514664578?check_suite_focus=true): |
|
TPC-DS build can be ignored I believe. |
|
@cloud-fan @HyukjinKwon the tests passed. If there are no reviews, what else is needed to get this PR merged? |
|
I'll defer to @cloud-fan |
|
thanks, merging to master! |
…tamp parsing behavior ### What changes were proposed in this pull request? This is a follow-up for [SPARK-39731](https://issues.apache.org/jira/browse/SPARK-39731) and PR #37147. I found that it could be problematic to change `spark.sql.legacy.timeParserPolicy` to LEGACY when inferring dates and timestamps in CSV and JSON. Sometimes it is beneficial to have the time parser policy as CORRECTED but still use a more lenient date and timestamp inference (or when migrating to a newer Spark version). I added two separate configs that control this behavior: - `spark.sql.legacy.csv.enableDateTimeParsingFallback` - `spark.sql.legacy.json.enableDateTimeParsingFallback` When the configs are set to `true`, the legacy time parsing behaviour is enabled (pre Spark 3.0). With this PR, the precedence order is as follows for CSV (similar for JSON): - data source option `enableDateTimeParsingFallback` - if that is not set, check `spark.sql.legacy.{csv,json}.enableDateTimeParsingFallback` - if that is not set, check `spark.sql.legacy.timeParserPolicy` and whether or not a custom format is used. ### Why are the changes needed? The change makes it easier for users to migrate to a newer Spark version without changing global config `spark.sql.legacy.timeParserPolicy`. Also, allows to enable legacy parsing for CSV and JSON separately without changing the code or the global time parser config. ### Does this PR introduce _any_ user-facing change? No, simply adds an ability to change the behaviour specifically for CSV or JSON. ### How was this patch tested? I added a unit test for CSV and JSON to verify the flag. Closes #37653 from sadikovi/SPARK-40215. Authored-by: Ivan Sadikov <[email protected]> Signed-off-by: Max Gekk <[email protected]>
What changes were proposed in this pull request?
This PR fixes a correctness issue when reading a CSV or a JSON file with dates in "yyyyMMdd" format:
or
Prior to #32959, reading this CSV file would return:
However, after the patch, the invalid date is parsed because of the much more lenient parsing in
DateTimeUtils.stringToDate, the method treats2020011as a full year:Similar result would be observed in JSON.
This PR attempts to address correctness issue by introducing a new configuration option
enableDateTimeParsingFallbackwhich allows to enable/disable the backward compatible parsing.Currently, by default we will fall back to the backward compatible behavior only if parser policy is legacy and no custom pattern was set (this is defined in
UnivocityParserandJacksonParserfor csv and json respectively).Why are the changes needed?
Fixes a correctness issue in Spark 3.4.
Does this PR introduce any user-facing change?
In order to avoid correctness issues when reading CSV or JSON files with a custom pattern, a new configuration option
enableDateTimeParsingFallbackhas been added to control whether or not the code would fall back to the backward compatible behavior of parsing dates and timestamps in CSV and JSON data sources.DateTimeUtils.stringToDate.DateTimeUtils.stringToTimestampwill be used.How was this patch tested?
I added unit tests for CSV and JSON to verify the fix and the config option.