[SPARK-40474][SQL] Correct CSV schema inference and data parsing behavior on columns with mixed dates and timestamps #37933

xiaonanyang-db · 2022-09-19T17:24:21Z

What changes were proposed in this pull request?

This PR corrects the behavior of how columns with mixed dates and timestamps are supported in CSV schema inference and data parsing.

If user specifies timestamp format, this type of columns will always be inferred as StringType.
If no timestamp format specified by user, we will try inferring this type of columns as TimestampType if possible, otherwise inferred as StringType

Here are the semantics of the changes:

In CSVInferSchema
- Remove the attempts to infer field as DateType when prefersDate=true and typeSoFar=TimestampType/TimestampNTZType
- Change the dataType merging behavior between DateType and TimestampType/TimestampNTZType:
  - If the timestampFormat/timestampNTZFormat is given, merge the two types into StringType
  - Otherwise
    - if the dateFormat could be parsed by the lenient timestamp formatter, merge the two types into TimestampType/TimestampNTZType
    - otherwise, merge the two types into StringType
In UnivocityParser, remove the attempts to parse field as Date if it failed to be parsed as Timestamp.

As an additional change, this PR also turn the default value of prefersDate as true.

Why are the changes needed?

Simplify CSV dateTime inference logic.

Does this PR introduce any user-facing change?

No compared to the previous PR.

How was this patch tested?

…CSV schema inference

HyukjinKwon

I'm okay w/ this change.

sadikovi

Can you update the description to list all of the semantics of the change? You can remove the point where we need to merge them to TimestampType if this is not what the PR implements and replace it with "merging to StringType" instead.

Is it correct that the date inference is still controlled by "prefersDate"?

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala

xiaonanyang-db · 2022-09-20T05:46:47Z

Can you update the description to list all of the semantics of the change? You can remove the point where we need to merge them to TimestampType if this is not what the PR implements and replace it with "merging to StringType" instead.

Is it correct that the date inference is still controlled by "prefersDate"?

Sure!
Yes, it's still controlled by "prefersDate".

sadikovi · 2022-09-20T06:42:51Z

Let me re-review the change to use ISO8601 parsing only.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala

cloud-fan · 2022-09-20T07:21:36Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala

    } else {
-      parameters.get("timestampFormat")
+      // Use Iso8601TimestampFormatter (with strict timestamp parsing) to
+      // avoid parsing dates in timestamp columns as timestamp type


why do we need this change? when inferring schema, we always try to infer as date then timestamp, right?

There are two types of timestampFormatter: Default and Iso8601.

The default formatter use same parsing logic as CAST, which can parse a date value as timestamp.

The Iso8601 has more restrict parse.

If no timestampFormat is given, we will use the default one

In the case of a column containing timestamp value first and dates following, if we use the default formatter, it would still parse the dates as timestamp and infer the column as timestamp type, which is inconsistent with what we are proposing now.

It’s pretty similar to the legacy parser problem.

I see, but the CAST code is supposed to be more efficient than the formatter. Do we have some perf numbers? Is there a slowdown for schema inference with timestamp value only CSV files?

Let me think about it as well as there are different fallbacks with the regard to date/timestamp parsing. IMHO, we would want to have some kind of flag to revert to the original behaviour but this would be a lot of flags to configure.

It is possible that some users might be relying on a more lenient parsing of timestamps so if we switch to ISO8601 only, some of the jobs might need to be updated. Let me think a bit more on this.

Totally agree with your concerns @cloud-fan @sadikovi.

After some quick discussion within my team, we agreed on not changing these lines to avoid unnecessary regressions and any other behavior changes. Thus, the behavior after this PR become:

If user provides a timestampFormat/timestampNTZFormat, we will strictly parse fields as timestamp according to the format. Thus, columns with mixing dates and timestamps will always be inferred as StringType.

If no timestampFormat/timestampNTZFormat specified by user, for a column with mixing dates and timestamps

If date values are before timestamp values

If prefersDate=true, the column will be inferred as StringType

otherwise the column could be inferred as timestamp/string type based on whether the date format is supported by the lenient timestampFormatter

If timestamp values are before date values

the column could be inferred as timestamp/string type based on whether the date format is supported by the lenient timestampFormatter

There is no behavior change when prefersDate=false.
Does this make sense to you? @sadikovi @cloud-fan

cc @brkyvz @Yaohua628

Made some follow-up changes, please check the updated description for the behavior after changes and semantics.

…ce properly when lenient timestamp formatter is used

sadikovi

Looks good, thanks for making the changes.

I left a few nit comments and questions, would appreciate it if you could take a look. Thanks.

docs/sql-data-sources-csv.md

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala

sadikovi · 2022-09-21T05:28:09Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala

   */
  private def compatibleType(t1: DataType, t2: DataType): Option[DataType] = {
-    TypeCoercion.findTightestCommonType(t1, t2).orElse(findCompatibleTypeForCSV(t1, t2))
+    (t1, t2) match {


Should this match be in findCompatibleTypeForCSV? Or does findTightestCommonType merge DateType and TimestampType in a way that is not applicable here?

findTightestCommonType merge DateType and TimestampType in a way that is not applicable here

What result does findTightestCommonType return for DateType and TimestampType?

(d1, d2) match { case (_: TimestampType, _: DateType) | (_: DateType, _: TimestampType) => TimestampType case (_: TimestampType, _: TimestampNTZType) | (_: TimestampNTZType, _: TimestampType) => TimestampType case (_: TimestampNTZType, _: DateType) | (_: DateType, _: TimestampNTZType) => TimestampNTZType }

Never mind, I checked the code, the resulting type will be TimestampType or TimestampNTZ.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

sadikovi · 2022-09-21T05:52:45Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala


+  // Date formats that could be parsed in DefaultTimestampFormatter
+  // Reference: DateTimeUtils.parseTimestampString
+  private val LENIENT_TS_FORMATTER_SUPPORTED_DATE_FORMATS = Set(


Sorry, I don't quite get this part, would you be able to elaborate on why we need to keep a set of such formats?

More of a thought experiment, I don't think this actually happens in practice: cast allows years longer than 4 digits, is it something that needs to be supported here? For more context, https://issues.apache.org/jira/browse/SPARK-39731.

Also, will this work? My understanding is that we will not, only yyyy-MM-dd.

dateFormat = "yyyy/MM/dd" timestampFormat = "yyyy/MM/dd HH:mm:ss"

We need this set to determine inferring a column with mixture of dates and timestamps as TimestampType or StringType when no timestamp format is specified (the lenient timestamp formatter will be used)

dateFormat = "yyyy/MM/dd" timestampFormat = "yyyy/MM/dd HH:mm:ss"

I don't quite understand your question on this case.
But speaking in the context of this PR, because timestampFormat is specified, a column with a mix of dates and timestamps will be inferred as StringType.

More of a thought experiment, I don't think this actually happens in practice: cast allows years longer than 4 digits, is it something that needs to be supported here? For more context, https://issues.apache.org/jira/browse/SPARK-39731.

That's some interesting formats, I am not sure if we need to take care of them here.

Can you elaborate on why we need LENIENT_TS_FORMATTER_SUPPORTED_DATE_FORMATS? I understand how it is used. Also, I am not a supporter of hardcoding date/time formats here.

When timestamp format is not specified, the desired behavior is that a column with mix of dates and timestamps could be inferred as timestamp type if the lenient timestamp formatter can parse the date strings under the column as well.

To achieve that without bringing other performance concern, we want to simply check if the date format could be supported by the lenient timestamp formatter. Does that make sense?

xiaonanyang-db · 2022-09-21T07:34:07Z

There are many cases to consider here: 1) the CSV data is pure date, pure timestamp, or a mixture. 2) the user specifies datetime pattern or not.

pure date + no datetime pattern: infer as date type

pure timestamp + no datetime pattern: infer as timestamp type

mixture + no datetime pattern: infer as timestamp type

pure date + datetime pattern: if pattern matches, infer as date type, otherwise string type

pure timestamp + datetime pattern: if pattern matches, infer as timestamp type, otherwise string type

mixture + datetime pattern: I think this is where the problem occurs. We will first parse the data as date, if can't, try parse as timestamp. This is very slow as we invoke the formatter twice. I think we shouldn't support mixture of date and timestamp in this case. If prefersDate is true, only try to infer as date type, otherwise only try to infer as timestamp.

Thanks @cloud-fan. Totally agree with those behaviors, and this PR is exactly making that happen.

xiaonanyang-db · 2022-09-21T07:39:28Z

I think the logic in this PR seems reasonable. If prefersDate = true and we have date and timestamp strings, make the column StringType. If prefersDate was not set, then this could be inferred as timestamp if possible.

I am just not sure about hardcoding date formats in the CSVInferSchema parser.

Thanks @sadikovi.

The logic of this PR you described is not accurate. The actual logic is that

if user specifies timestamp format, columns of mixed dates and timestamps will be inferred as String type.
If user does not specify any timestamp format, columns with mixed dates and timestamps could be inferred as timestamp if possible, regardless of if prefersDate is set to true or not.

Understand your concern, it's a trade-off of not introducing performance degradation.

AmplabJenkins · 2022-09-21T14:04:52Z

Can one of the admins verify this patch?

sadikovi · 2022-09-21T22:54:22Z

@cloud-fan @HyukjinKwon Do you have any concerns or questions? IMHO, we can merge this PR, seems that all of the questions have been addressed. Thanks.

cloud-fan · 2022-09-22T00:15:44Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala

+           (TimestampNTZType, DateType) | (TimestampType, DateType) =>
+        // For a column containing a mixture of dates and timestamps
+        // infer it as timestamp type if its dates can be inferred as timestamp type
+        // otherwise infer it as StringType


I don't quite understand the rationale here. why can't we directly return Some(StringType)?

We want to have consistent behavior when timestamp format is not specified.

When prefersDate=false, a column with mixed date and timestamp could be inferred as timestamp if possible.
Thus, we added the additional handling here for a consistent behavior as above when prefersDate=true.

Does this make sense?

nvm, I got it

cloud-fan · 2022-09-22T00:41:08Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala

+           (TimestampNTZType, DateType) | (TimestampType, DateType) =>
+        // For a column containing a mixture of dates and timestamps
+        // infer it as timestamp type if its dates can be inferred as timestamp type
+        // otherwise infer it as StringType


let's enrich the comment a bit more

This only happens when the timestamp pattern is not specified, as the default timestamp parser is very lenient and can parse date string as well.

cloud-fan · 2022-09-22T00:42:10Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala

+  private def canParseDateAsTimestamp(dateFormat: String, tsType: DataType): Boolean = {
+    if ((tsType.isInstanceOf[TimestampType] && options.timestampFormatInRead.isEmpty) ||
+      (tsType.isInstanceOf[TimestampNTZType] && options.timestampNTZFormatInRead.isEmpty)) {
+      LENIENT_TS_FORMATTER_SUPPORTED_DATE_FORMATS.contains(dateFormat)


Do we really need to cover these corner cases? We can just say that we can only parse date as timestamp if neither timestamp pattern nor date pattern is specified.

This is a behavior change in terms of Spark 3.3 branch, where a column with mixed dates and timestamps could be inferred as timestamp type if possible when no timestamp pattern specified.

cloud-fan · 2022-09-22T00:42:54Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala

-   * timestamp values as dates if the values do not conform to the timestamp formatter before
-   * falling back to the backward compatible parsing - the parsed values will be cast to timestamp
-   * afterwards.
+   * Infer columns with all valid date entries as date type (otherwise inferred as string type)


Suggested change

* Infer columns with all valid date entries as date type (otherwise inferred as string type)

* Infer columns with all valid date entries as date type (otherwise inferred as string or timestamp type)

cloud-fan · 2022-09-22T13:09:15Z

Can we update PR description to mention that prefersDate is by default true now?

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala

cloud-fan · 2022-09-23T05:05:05Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala


      check(
-        "legacy",
+        "corrected",


to reduce the diff, can we still test legacy first?

cloud-fan · 2022-09-23T07:54:52Z

thanks, merging to master!

github-actions bot added the SQL label Sep 19, 2022

SPARK-40474 Infer columns with mixed date and timestamp as String in …

0394030

…CSV schema inference

xiaonanyang-db force-pushed the SPARK-40474 branch from ac5bd1d to 0394030 Compare September 19, 2022 17:42

xiaonanyang-db changed the title ~~SPARK-40474 Infer columns with mixed date and timestamp as String in CSV schema inference~~ [SPARK-40474] Infer columns with mixed date and timestamp as String in CSV schema inference Sep 19, 2022

xiaonanyang-db changed the title ~~[SPARK-40474] Infer columns with mixed date and timestamp as String in CSV schema inference~~ [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference Sep 19, 2022

xiaonanyang-db added 2 commits September 19, 2022 12:48

[SPARK-40474] remove unused imports

f4fadf7

[SPARK-40474] Resolve test failures

813ac74

github-actions bot added the DOCS label Sep 19, 2022

HyukjinKwon reviewed Sep 20, 2022

View reviewed changes

[SPARK-40474] fix test failures

5c2dde8

sadikovi reviewed Sep 20, 2022

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala Show resolved Hide resolved

sadikovi approved these changes Sep 20, 2022

View reviewed changes

xiaonanyang-db added 2 commits September 19, 2022 23:15

[SPARK-40474] handle edge cases

0d2be1d

[SPARK-40474] handle edge cases

df56946

cloud-fan reviewed Sep 20, 2022

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Sep 20, 2022

View reviewed changes

xiaonanyang-db added 8 commits September 20, 2022 10:01

SPARK-40474 revert part of CSVOptions changes

4bc480d

SPARK-40474 revert part of CSVOptions changes

f6ed29f

[SPARK-40474] fix test failures

93b6422

[SPARK-40474] handle columns with mixing dates and timestamps inferen…

6942f2b

…ce properly when lenient timestamp formatter is used

[SPARK-40474] remove unnecessary changes

b4a6f1d

[SPARK-40474] small changes

1502618

Merge remote-tracking branch 'origin' into SPARK-40474

4767ae7

[SPARK-40474] remove new line added by mistake

1f57098

sadikovi reviewed Sep 21, 2022

View reviewed changes

xiaonanyang-db added 2 commits September 20, 2022 23:59

[SPARK-40474] address comments

e9150ec

[SPARK-40474] small changes

a07e432

HeartSaVioR mentioned this pull request Sep 21, 2022

[SPARK-40435][SS][PYTHON] Add test suites for applyInPandasWithState in PySpark #37894

Closed

[SPARK-40474] fix test failures

255aea3

xiaonanyang-db changed the title ~~[SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference~~ [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference when timestamp format is specified Sep 21, 2022

xiaonanyang-db changed the title ~~[SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference when timestamp format is specified~~ [SPARK-40474][SQL] Correct CSV schema inference and data parsing behavior on columns with mixed dates and timestamps Sep 21, 2022

sadikovi approved these changes Sep 21, 2022

View reviewed changes

cloud-fan reviewed Sep 22, 2022

View reviewed changes

[SPARK-40474] address review comments

533c487

cloud-fan approved these changes Sep 22, 2022

View reviewed changes

xiaonanyang-db added 5 commits September 21, 2022 22:27

[SPARK-40474] fix test failures

be4c86f

[SPARK-40474] update doc

c7225b1

[SPARK-40474] disable prefersDate when leagcyTimeParser is enabled

9e87d6e

[SPARK-40474] fix test failures

af66b83

[SPARK-40474] fix test failures

812fa65

[SPARK-40474] fix tests

5288eb0

HyukjinKwon reviewed Sep 23, 2022

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala Show resolved Hide resolved

xiaonanyang-db added 2 commits September 22, 2022 21:58

[SPARK-40474] revert code causing behavior change

a2f0b80

[SPARK-40474] revert changes

00a8661

cloud-fan reviewed Sep 23, 2022

View reviewed changes

SPARK-40474 reduce diff

16e187c

cloud-fan approved these changes Sep 23, 2022

View reviewed changes

cloud-fan closed this in 541b830 Sep 23, 2022

xiaonanyang-db deleted the SPARK-40474 branch September 23, 2022 16:29

	* Infer columns with all valid date entries as date type (otherwise inferred as string type)
	* Infer columns with all valid date entries as date type (otherwise inferred as string or timestamp type)

[SPARK-40474][SQL] Correct CSV schema inference and data parsing behavior on columns with mixed dates and timestamps #37933

[SPARK-40474][SQL] Correct CSV schema inference and data parsing behavior on columns with mixed dates and timestamps #37933

Uh oh!

Conversation

xiaonanyang-db commented Sep 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

sadikovi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

xiaonanyang-db commented Sep 20, 2022

Uh oh!

sadikovi commented Sep 20, 2022

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xiaonanyang-db Sep 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xiaonanyang-db Sep 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sadikovi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xiaonanyang-db Sep 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xiaonanyang-db Sep 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xiaonanyang-db Sep 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xiaonanyang-db commented Sep 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xiaonanyang-db commented Sep 19, 2022 •

edited

Loading

xiaonanyang-db Sep 20, 2022 •

edited

Loading

xiaonanyang-db Sep 20, 2022 •

edited

Loading

xiaonanyang-db Sep 21, 2022 •

edited

Loading

xiaonanyang-db Sep 21, 2022 •

edited

Loading

xiaonanyang-db Sep 21, 2022 •

edited

Loading

xiaonanyang-db commented Sep 21, 2022 •

edited

Loading

xiaonanyang-db commented Sep 21, 2022 •

edited

Loading

xiaonanyang-db Sep 22, 2022 •

edited

Loading

cloud-fan Sep 22, 2022 •

edited

Loading

xiaonanyang-db Sep 22, 2022 •

edited

Loading