Skip to content

Conversation

@xiaonanyang-db
Copy link
Contributor

@xiaonanyang-db xiaonanyang-db commented Sep 19, 2022

What changes were proposed in this pull request?

This PR corrects the behavior of how columns with mixed dates and timestamps are supported in CSV schema inference and data parsing.

  • If user specifies timestamp format, this type of columns will always be inferred as StringType.
  • If no timestamp format specified by user, we will try inferring this type of columns as TimestampType if possible, otherwise inferred as StringType

Here are the semantics of the changes:

  • In CSVInferSchema

    • Remove the attempts to infer field as DateType when prefersDate=true and typeSoFar=TimestampType/TimestampNTZType
    • Change the dataType merging behavior between DateType and TimestampType/TimestampNTZType:
      • If the timestampFormat/timestampNTZFormat is given, merge the two types into StringType
      • Otherwise
        • if the dateFormat could be parsed by the lenient timestamp formatter, merge the two types into TimestampType/TimestampNTZType
        • otherwise, merge the two types into StringType
  • In UnivocityParser, remove the attempts to parse field as Date if it failed to be parsed as Timestamp.

As an additional change, this PR also turn the default value of prefersDate as true.

Why are the changes needed?

Simplify CSV dateTime inference logic.

Does this PR introduce any user-facing change?

No compared to the previous PR.

How was this patch tested?

@github-actions github-actions bot added the SQL label Sep 19, 2022
@xiaonanyang-db xiaonanyang-db changed the title SPARK-40474 Infer columns with mixed date and timestamp as String in CSV schema inference [SPARK-40474] Infer columns with mixed date and timestamp as String in CSV schema inference Sep 19, 2022
@xiaonanyang-db xiaonanyang-db changed the title [SPARK-40474] Infer columns with mixed date and timestamp as String in CSV schema inference [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference Sep 19, 2022
@github-actions github-actions bot added the DOCS label Sep 19, 2022
Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay w/ this change.

Copy link
Contributor

@sadikovi sadikovi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update the description to list all of the semantics of the change? You can remove the point where we need to merge them to TimestampType if this is not what the PR implements and replace it with "merging to StringType" instead.

Is it correct that the date inference is still controlled by "prefersDate"?

@xiaonanyang-db
Copy link
Contributor Author

Can you update the description to list all of the semantics of the change? You can remove the point where we need to merge them to TimestampType if this is not what the PR implements and replace it with "merging to StringType" instead.

Is it correct that the date inference is still controlled by "prefersDate"?

Sure!
Yes, it's still controlled by "prefersDate".

@sadikovi
Copy link
Contributor

Let me re-review the change to use ISO8601 parsing only.

} else {
parameters.get("timestampFormat")
// Use Iso8601TimestampFormatter (with strict timestamp parsing) to
// avoid parsing dates in timestamp columns as timestamp type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this change? when inferring schema, we always try to infer as date then timestamp, right?

Copy link
Contributor Author

@xiaonanyang-db xiaonanyang-db Sep 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • There are two types of timestampFormatter: Default and Iso8601.

    • The default formatter use same parsing logic as CAST, which can parse a date value as timestamp.
    • The Iso8601 has more restrict parse.
    • If no timestampFormat is given, we will use the default one
  • In the case of a column containing timestamp value first and dates following, if we use the default formatter, it would still parse the dates as timestamp and infer the column as timestamp type, which is inconsistent with what we are proposing now.

It’s pretty similar to the legacy parser problem.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, but the CAST code is supposed to be more efficient than the formatter. Do we have some perf numbers? Is there a slowdown for schema inference with timestamp value only CSV files?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me think about it as well as there are different fallbacks with the regard to date/timestamp parsing. IMHO, we would want to have some kind of flag to revert to the original behaviour but this would be a lot of flags to configure.

It is possible that some users might be relying on a more lenient parsing of timestamps so if we switch to ISO8601 only, some of the jobs might need to be updated. Let me think a bit more on this.

Copy link
Contributor Author

@xiaonanyang-db xiaonanyang-db Sep 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally agree with your concerns @cloud-fan @sadikovi.

After some quick discussion within my team, we agreed on not changing these lines to avoid unnecessary regressions and any other behavior changes. Thus, the behavior after this PR become:

  • If user provides a timestampFormat/timestampNTZFormat, we will strictly parse fields as timestamp according to the format. Thus, columns with mixing dates and timestamps will always be inferred as StringType.
  • If no timestampFormat/timestampNTZFormat specified by user, for a column with mixing dates and timestamps
    • If date values are before timestamp values
      • If prefersDate=true, the column will be inferred as StringType
      • otherwise the column could be inferred as timestamp/string type based on whether the date format is supported by the lenient timestampFormatter
    • If timestamp values are before date values
      • the column could be inferred as timestamp/string type based on whether the date format is supported by the lenient timestampFormatter

There is no behavior change when prefersDate=false.
Does this make sense to you? @sadikovi @cloud-fan

cc @brkyvz @Yaohua628

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made some follow-up changes, please check the updated description for the behavior after changes and semantics.

Copy link
Contributor

@sadikovi sadikovi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for making the changes.

I left a few nit comments and questions, would appreciate it if you could take a look. Thanks.

*/
private def compatibleType(t1: DataType, t2: DataType): Option[DataType] = {
TypeCoercion.findTightestCommonType(t1, t2).orElse(findCompatibleTypeForCSV(t1, t2))
(t1, t2) match {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this match be in findCompatibleTypeForCSV? Or does findTightestCommonType merge DateType and TimestampType in a way that is not applicable here?

Copy link
Contributor Author

@xiaonanyang-db xiaonanyang-db Sep 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

findTightestCommonType merge DateType and TimestampType in a way that is not applicable here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What result does findTightestCommonType return for DateType and TimestampType?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(d1, d2) match {
      case (_: TimestampType, _: DateType) | (_: DateType, _: TimestampType) =>
        TimestampType

      case (_: TimestampType, _: TimestampNTZType) | (_: TimestampNTZType, _: TimestampType) =>
        TimestampType

      case (_: TimestampNTZType, _: DateType) | (_: DateType, _: TimestampNTZType) =>
        TimestampNTZType
    }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind, I checked the code, the resulting type will be TimestampType or TimestampNTZ.


// Date formats that could be parsed in DefaultTimestampFormatter
// Reference: DateTimeUtils.parseTimestampString
private val LENIENT_TS_FORMATTER_SUPPORTED_DATE_FORMATS = Set(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't quite get this part, would you be able to elaborate on why we need to keep a set of such formats?

More of a thought experiment, I don't think this actually happens in practice: cast allows years longer than 4 digits, is it something that needs to be supported here? For more context, https://issues.apache.org/jira/browse/SPARK-39731.

Also, will this work? My understanding is that we will not, only yyyy-MM-dd.

dateFormat = "yyyy/MM/dd"
timestampFormat = "yyyy/MM/dd HH:mm:ss" 

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need this set to determine inferring a column with mixture of dates and timestamps as TimestampType or StringType when no timestamp format is specified (the lenient timestamp formatter will be used)

Copy link
Contributor Author

@xiaonanyang-db xiaonanyang-db Sep 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dateFormat = "yyyy/MM/dd"
timestampFormat = "yyyy/MM/dd HH:mm:ss" 

I don't quite understand your question on this case.
But speaking in the context of this PR, because timestampFormat is specified, a column with a mix of dates and timestamps will be inferred as StringType.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More of a thought experiment, I don't think this actually happens in practice: cast allows years longer than 4 digits, is it something that needs to be supported here? For more context, https://issues.apache.org/jira/browse/SPARK-39731.

That's some interesting formats, I am not sure if we need to take care of them here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on why we need LENIENT_TS_FORMATTER_SUPPORTED_DATE_FORMATS? I understand how it is used. Also, I am not a supporter of hardcoding date/time formats here.

Copy link
Contributor Author

@xiaonanyang-db xiaonanyang-db Sep 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When timestamp format is not specified, the desired behavior is that a column with mix of dates and timestamps could be inferred as timestamp type if the lenient timestamp formatter can parse the date strings under the column as well.

To achieve that without bringing other performance concern, we want to simply check if the date format could be supported by the lenient timestamp formatter. Does that make sense?

@xiaonanyang-db
Copy link
Contributor Author

xiaonanyang-db commented Sep 21, 2022

There are many cases to consider here: 1) the CSV data is pure date, pure timestamp, or a mixture. 2) the user specifies datetime pattern or not.

  1. pure date + no datetime pattern: infer as date type
  2. pure timestamp + no datetime pattern: infer as timestamp type
  3. mixture + no datetime pattern: infer as timestamp type
  4. pure date + datetime pattern: if pattern matches, infer as date type, otherwise string type
  5. pure timestamp + datetime pattern: if pattern matches, infer as timestamp type, otherwise string type
  6. mixture + datetime pattern: I think this is where the problem occurs. We will first parse the data as date, if can't, try parse as timestamp. This is very slow as we invoke the formatter twice. I think we shouldn't support mixture of date and timestamp in this case. If prefersDate is true, only try to infer as date type, otherwise only try to infer as timestamp.

Thanks @cloud-fan. Totally agree with those behaviors, and this PR is exactly making that happen.

@xiaonanyang-db
Copy link
Contributor Author

xiaonanyang-db commented Sep 21, 2022

I think the logic in this PR seems reasonable. If prefersDate = true and we have date and timestamp strings, make the column StringType. If prefersDate was not set, then this could be inferred as timestamp if possible.

I am just not sure about hardcoding date formats in the CSVInferSchema parser.

Thanks @sadikovi.

The logic of this PR you described is not accurate. The actual logic is that

  • if user specifies timestamp format, columns of mixed dates and timestamps will be inferred as String type.
  • If user does not specify any timestamp format, columns with mixed dates and timestamps could be inferred as timestamp if possible, regardless of if prefersDate is set to true or not.

Understand your concern, it's a trade-off of not introducing performance degradation.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@xiaonanyang-db xiaonanyang-db changed the title [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference when timestamp format is specified Sep 21, 2022
@xiaonanyang-db xiaonanyang-db changed the title [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference when timestamp format is specified [SPARK-40474][SQL] Correct CSV schema inference and data parsing behavior on columns with mixed dates and timestamps Sep 21, 2022
@sadikovi
Copy link
Contributor

@cloud-fan @HyukjinKwon Do you have any concerns or questions? IMHO, we can merge this PR, seems that all of the questions have been addressed. Thanks.

(TimestampNTZType, DateType) | (TimestampType, DateType) =>
// For a column containing a mixture of dates and timestamps
// infer it as timestamp type if its dates can be inferred as timestamp type
// otherwise infer it as StringType
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand the rationale here. why can't we directly return Some(StringType)?

Copy link
Contributor Author

@xiaonanyang-db xiaonanyang-db Sep 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to have consistent behavior when timestamp format is not specified.

When prefersDate=false, a column with mixed date and timestamp could be inferred as timestamp if possible.
Thus, we added the additional handling here for a consistent behavior as above when prefersDate=true.

Does this make sense?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm, I got it

(TimestampNTZType, DateType) | (TimestampType, DateType) =>
// For a column containing a mixture of dates and timestamps
// infer it as timestamp type if its dates can be inferred as timestamp type
// otherwise infer it as StringType
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's enrich the comment a bit more

This only happens when the timestamp pattern is not specified, as the default
timestamp parser is very lenient and can parse date string as well.

private def canParseDateAsTimestamp(dateFormat: String, tsType: DataType): Boolean = {
if ((tsType.isInstanceOf[TimestampType] && options.timestampFormatInRead.isEmpty) ||
(tsType.isInstanceOf[TimestampNTZType] && options.timestampNTZFormatInRead.isEmpty)) {
LENIENT_TS_FORMATTER_SUPPORTED_DATE_FORMATS.contains(dateFormat)
Copy link
Contributor

@cloud-fan cloud-fan Sep 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to cover these corner cases? We can just say that we can only parse date as timestamp if neither timestamp pattern nor date pattern is specified.

Copy link
Contributor Author

@xiaonanyang-db xiaonanyang-db Sep 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a behavior change in terms of Spark 3.3 branch, where a column with mixed dates and timestamps could be inferred as timestamp type if possible when no timestamp pattern specified.

* timestamp values as dates if the values do not conform to the timestamp formatter before
* falling back to the backward compatible parsing - the parsed values will be cast to timestamp
* afterwards.
* Infer columns with all valid date entries as date type (otherwise inferred as string type)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Infer columns with all valid date entries as date type (otherwise inferred as string type)
* Infer columns with all valid date entries as date type (otherwise inferred as string or timestamp type)

@cloud-fan
Copy link
Contributor

Can we update PR description to mention that prefersDate is by default true now?


check(
"legacy",
"corrected",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to reduce the diff, can we still test legacy first?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 541b830 Sep 23, 2022
@xiaonanyang-db xiaonanyang-db deleted the SPARK-40474 branch September 23, 2022 16:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants