Skip to content

Conversation

@sadikovi
Copy link
Contributor

@sadikovi sadikovi commented Jul 11, 2022

What changes were proposed in this pull request?

This PR fixes a correctness issue when reading a CSV or a JSON file with dates in "yyyyMMdd" format:

name,mydate
1,2020011
2,20201203

or

{"date": "2020011"}
{"date": "20201203"} 

Prior to #32959, reading this CSV file would return:

+----+--------------+
|name|mydate        |
+----+--------------+
|1   |null          |
|2   |2020-12-03    |
+----+--------------+

However, after the patch, the invalid date is parsed because of the much more lenient parsing in DateTimeUtils.stringToDate, the method treats 2020011 as a full year:

+----+--------------+
|name|mydate        |
+----+--------------+
|1   |+2020011-01-01|
|2   |2020-12-03    |
+----+--------------+

Similar result would be observed in JSON.

This PR attempts to address correctness issue by introducing a new configuration option enableDateTimeParsingFallback which allows to enable/disable the backward compatible parsing.

Currently, by default we will fall back to the backward compatible behavior only if parser policy is legacy and no custom pattern was set (this is defined in UnivocityParser and JacksonParser for csv and json respectively).

Why are the changes needed?

Fixes a correctness issue in Spark 3.4.

Does this PR introduce any user-facing change?

In order to avoid correctness issues when reading CSV or JSON files with a custom pattern, a new configuration option enableDateTimeParsingFallback has been added to control whether or not the code would fall back to the backward compatible behavior of parsing dates and timestamps in CSV and JSON data sources.

  • If the config is enabled and the date cannot be parsed, we will fall back to DateTimeUtils.stringToDate.
  • If the config is enabled and the timestamp cannot be parsed, DateTimeUtils.stringToTimestamp will be used.
  • Otherwise, depending on the parser policy and a custom pattern, the value will be parsed as null.

How was this patch tested?

I added unit tests for CSV and JSON to verify the fix and the config option.

@github-actions github-actions bot added the SQL label Jul 11, 2022
@sadikovi
Copy link
Contributor Author

sadikovi commented Jul 11, 2022

cc @HyukjinKwon @cloud-fan @MaxGekk.
I am not very familiar with UnivocityParser so I would appreciate your reviews, thanks.

@sadikovi sadikovi changed the title [SPARK-39731] Fix issue in CSV data source when parsing date in "yyyyMMdd" format with CORRECTED time parser policy [SPARK-39731][SQL] Fix issue in CSV data source when parsing date in "yyyyMMdd" format with CORRECTED time parser policy Jul 11, 2022
@sadikovi sadikovi changed the title [SPARK-39731][SQL] Fix issue in CSV data source when parsing date in "yyyyMMdd" format with CORRECTED time parser policy [SPARK-39731][SQL] Fix issue in CSV data source when parsing dates in "yyyyMMdd" format with CORRECTED time parser policy Jul 11, 2022
@HyukjinKwon
Copy link
Member

@Jonathancui123 since you took a look around here lately. Mind reviewing this when you find some time?

Copy link
Contributor

@Jonathancui123 Jonathancui123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - we don't want to use the legacy parser because it will allow bad data when a custom format is used. However, there are some failing tests that should be addressed:

[info] - SPARK-36536: use casting when datetime pattern is not set *** FAILED *** (289 milliseconds)
[info] - SPARK-30960: parse date/timestamp string with legacy format *** FAILED *** (73 milliseconds)
[info] - SPARK-36536: use casting when datetime pattern is not set *** FAILED *** (289 milliseconds)

Comment on lines +2815 to +2828
check(
"legacy",
Seq(
Row(1, Date.valueOf("2020-01-01"), Timestamp.valueOf("2020-01-01 00:00:00")),
Row(2, Date.valueOf("2020-12-03"), Timestamp.valueOf("2020-12-03 00:00:00"))
)
)

check(
"corrected",
Seq(
Row(1, null, null),
Row(2, Date.valueOf("2020-12-03"), Timestamp.valueOf("2020-12-03 00:00:00"))
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For completeness, would you consider adding a check for LEGACY_TIME_PARSER_POLICY = EXCEPTION? Similar to the following?

val msg = intercept[SparkException] {
csv.collect()
}.getCause.getMessage
assert(msg.contains("Fail to parse"))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@sadikovi
Copy link
Contributor Author

Thanks for the reviews. I will address the comments and failing tests and update the PR.
My question was whether there are any concerns with this change and whether users might experience compatibility issues. I would appreciate some thoughts on this.

@cloud-fan
Copy link
Contributor

If the legacy behavior is unreasonable, I think we don't have to keep it. If datetime patten is specified, we should not fall back to the legacy code path, even if it only supports 4 digits like Spark 2.x.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@sadikovi
Copy link
Contributor Author

sadikovi commented Jul 12, 2022

I actually found a similar issue in JSON data source. I will also address it in this PR and will update the title and description:

test.json

{"date": "2020011"}
{"date": "20201203"}
val df = spark.read.schema("date date").option("dateFormat", "yyyyMMdd").json("file:/tmp/test.json")
df.show(false)

returns

+--------------+
|date          |
+--------------+
|+2020011-01-01|
|2020-12-03    |
+--------------+

but before the patch linked in the description it used to show:

+----------+
|date      |
+----------+
|7500-08-09|
|2020-12-03|
+----------+

which is strange either way.

@sadikovi sadikovi changed the title [SPARK-39731][SQL] Fix issue in CSV data source when parsing dates in "yyyyMMdd" format with CORRECTED time parser policy [SPARK-39731][SQL] Fix issue in CSV and JSON data sources when parsing dates in "yyyyMMdd" format with CORRECTED time parser policy Jul 14, 2022
@sadikovi
Copy link
Contributor Author

@Jonathancui123 @kamcheungting-db Could you review this PR again? Thanks.

@sadikovi
Copy link
Contributor Author

Thanks @cloud-fan!

Copy link
Contributor

@Jonathancui123 Jonathancui123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good to me! I have one question about whether this is a breaking change. Thanks for working on the fix :))

Comment on lines 366 to 369
val err = intercept[IllegalArgumentException] {
check(new UnivocityParser(StructType(Seq.empty), optionsWithPattern))
}
assert(err.getMessage.contains("Illegal pattern character: n"))
Copy link
Contributor

@Jonathancui123 Jonathancui123 Jul 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this technically a breaking change for users who could previously specify an invalid pattern without LEGACY mode?

Before -- ignore the invalid pattern and parse with DateTimeUtils.stringToTimestamp
Now -- it throws an error

We don't support invalid patterns but as a user I would be unhappy to see my code break. I'm unsure if this is actually considered a breaking change because this is such an edge case and the user is already doing something invalid. I'm curious to hear your thoughts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point. It would be a breaking change for users if they were relying on the compatibility fallback.
There could an alternative fix, maybe we can look into updating DateTimeUtils.stringToDate but I am not sure.

I can also add a feature flag to control this behaviour in JSON and CSV connectors so users can always opt in to use legacy behaviour. For example, I can a data source option "useLegacyParsing" or something similar. The option could be disabled by default, the exception would contain a message saying that you can enable the option to maintain the previous behaviour. Maybe this could be a good solution.

Let me know if something like that could work, thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can also add a feature flag to control this behaviour in JSON and CSV connectors so users can always opt in to use legacy behaviour.

I think this should work. It feels weird that users have to opt-in to the correct behavior but hopefully this is a small percentage of users. Maybe @kamcheungting-db or @cloud-fan can weigh in.

There could an alternative fix, maybe we can look into updating DateTimeUtils.stringToDate but I am not sure

I personally wouldn't be confident updating DateTimeUtils.stringToDate because there are so many usages elsewhere. But if you are familiar with the other use cases of DateTimeUtils.stringToDate then this could work.

I'll loop back if I think of an alternative.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the safest option is to copy-paste the old code of stringToDate before #32959 and use it here, but that's really ugly and hard to maintain.

I'd like to understand more about the invalid pattern behavior. Will we trigger the fallback for every input row? That sounds like a big perf problem...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the invalid pattern and before this PR, yes, the fallback code would be triggered on every pattern mismatch. With the change, we will just throw an exception parsing those values as nulls. Yes, it does sound like a performance issue but it has been there for some time.

I agree with copy-paste of stringToDate, I proposed to add a data source config to keep the old behaviour. What do you think?

@github-actions github-actions bot added the DOCS label Jul 18, 2022
@sadikovi sadikovi requested a review from cloud-fan July 18, 2022 00:43
@sadikovi
Copy link
Contributor Author

@Jonathancui123 @cloud-fan I decided to introduce a new config option for JSON and CSV to control this parsing behaviour and have updated the code accordingly.

Now, users will have a way to enable backward compatible behaviour if they rely on it. The default is the same as with the initial changes: we will not parse the date/timestamp again if time parser policy is not legacy and the custom pattern is set.

@sadikovi
Copy link
Contributor Author

cc @MaxGekk for a review as you are very familiar with JSON and CSV and date/time parsing 🙂.

@MaxGekk
Copy link
Member

MaxGekk commented Jul 21, 2022

@sadikovi Could you re-trigger tests, please.

@sadikovi
Copy link
Contributor Author

Yes, sure. Let me do that.

@sadikovi
Copy link
Contributor Author

@Jonathancui123 Can you review a447b08? The tests were failing so I disabled the flag but maybe we need to revisit the test and/or behaviour of "inferDate" (or confirm it). I am happy to sync offline on this. Thanks.

@Jonathancui123
Copy link
Contributor

Jonathancui123 commented Jul 22, 2022

@Jonathancui123 Can you review a447b08?

I've reviewed the changes and written my thoughts here. Depending on our target behavior, we will need to handle these tests differently

@sadikovi
Copy link
Contributor Author

sadikovi commented Jul 24, 2022

@Jonathancui123 I can fix the inference order in my PR if you like; otherwise, a separate PR may be required to unblock mine.


checkAnswer(
output(enableFallback = false),
Seq(Row(null, null))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry I'm a bit confused. Why date parsing fails? 2020-01-01 is a valid date string.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, because the format pattern is given but invalid.

@sadikovi
Copy link
Contributor Author

sadikovi commented Jul 26, 2022

I have re-triggered the failed jobs: Hive - slow tests and TPC-DS with SF=1.

@sadikovi
Copy link
Contributor Author

sadikovi commented Jul 26, 2022

@HyukjinKwon Do you know what could be going wrong with regard to the test failures? I don't quite understand how to figure out why the tests failed. For example, I get the following error for TPC-DS job (https://github.com/sadikovi/spark/runs/7514664578?check_suite_focus=true):

[info] - q69 (6 seconds, 699 milliseconds)
[info] - q70 (7 seconds, 678 milliseconds)
[info] - q71 (5 seconds, 279 milliseconds)
/home/runner/work/spark/spark/build/sbt-launch-lib.bash: line 77:  2375 Killed                  "$@"
Error: Process completed with exit code 137.

@HyukjinKwon
Copy link
Member

TPC-DS build can be ignored I believe.

@sadikovi
Copy link
Contributor Author

@cloud-fan @HyukjinKwon the tests passed. If there are no reviews, what else is needed to get this PR merged?

@HyukjinKwon
Copy link
Member

I'll defer to @cloud-fan

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in a930445 Jul 27, 2022
MaxGekk pushed a commit that referenced this pull request Aug 26, 2022
…tamp parsing behavior

### What changes were proposed in this pull request?

This is a follow-up for [SPARK-39731](https://issues.apache.org/jira/browse/SPARK-39731) and PR #37147.

I found that it could be problematic to change `spark.sql.legacy.timeParserPolicy` to LEGACY when inferring dates and timestamps in CSV and JSON. Sometimes it is beneficial to have the time parser policy as CORRECTED but still use a more lenient date and timestamp inference (or when migrating to a newer Spark version).

I added two separate configs that control this behavior:
- `spark.sql.legacy.csv.enableDateTimeParsingFallback`
- `spark.sql.legacy.json.enableDateTimeParsingFallback`

When the configs are set to `true`, the legacy time parsing behaviour is enabled (pre Spark 3.0).

With this PR, the precedence order is as follows for CSV (similar for JSON):
- data source option `enableDateTimeParsingFallback`
- if that is not set, check `spark.sql.legacy.{csv,json}.enableDateTimeParsingFallback`
- if that is not set, check `spark.sql.legacy.timeParserPolicy` and whether or not a custom format is used.

### Why are the changes needed?

The change makes it easier for users to migrate to a newer Spark version without changing global config `spark.sql.legacy.timeParserPolicy`. Also, allows to enable legacy parsing for CSV and JSON separately without changing the code or the global time parser config.

### Does this PR introduce _any_ user-facing change?

No, simply adds an ability to change the behaviour specifically for CSV or JSON.

### How was this patch tested?

I added a unit test for CSV and JSON to verify the flag.

Closes #37653 from sadikovi/SPARK-40215.

Authored-by: Ivan Sadikov <[email protected]>
Signed-off-by: Max Gekk <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants