-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-40474][SQL] Correct CSV schema inference and data parsing behavior on columns with mixed dates and timestamps #37933
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…CSV schema inference
ac5bd1d to
0394030
Compare
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm okay w/ this change.
sadikovi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you update the description to list all of the semantics of the change? You can remove the point where we need to merge them to TimestampType if this is not what the PR implements and replace it with "merging to StringType" instead.
Is it correct that the date inference is still controlled by "prefersDate"?
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala
Show resolved
Hide resolved
Sure! |
|
Let me re-review the change to use ISO8601 parsing only. |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala
Outdated
Show resolved
Hide resolved
| } else { | ||
| parameters.get("timestampFormat") | ||
| // Use Iso8601TimestampFormatter (with strict timestamp parsing) to | ||
| // avoid parsing dates in timestamp columns as timestamp type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need this change? when inferring schema, we always try to infer as date then timestamp, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
There are two types of timestampFormatter: Default and Iso8601.
- The default formatter use same parsing logic as CAST, which can parse a date value as timestamp.
- The Iso8601 has more restrict parse.
- If no timestampFormat is given, we will use the default one
-
In the case of a column containing timestamp value first and dates following, if we use the default formatter, it would still parse the dates as timestamp and infer the column as timestamp type, which is inconsistent with what we are proposing now.
It’s pretty similar to the legacy parser problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, but the CAST code is supposed to be more efficient than the formatter. Do we have some perf numbers? Is there a slowdown for schema inference with timestamp value only CSV files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me think about it as well as there are different fallbacks with the regard to date/timestamp parsing. IMHO, we would want to have some kind of flag to revert to the original behaviour but this would be a lot of flags to configure.
It is possible that some users might be relying on a more lenient parsing of timestamps so if we switch to ISO8601 only, some of the jobs might need to be updated. Let me think a bit more on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally agree with your concerns @cloud-fan @sadikovi.
After some quick discussion within my team, we agreed on not changing these lines to avoid unnecessary regressions and any other behavior changes. Thus, the behavior after this PR become:
- If user provides a
timestampFormat/timestampNTZFormat, we will strictly parse fields as timestamp according to the format. Thus, columns with mixing dates and timestamps will always be inferred asStringType. - If no
timestampFormat/timestampNTZFormatspecified by user, for a column with mixing dates and timestamps- If date values are before timestamp values
- If
prefersDate=true, the column will be inferred asStringType - otherwise the column could be inferred as timestamp/string type based on whether the date format is supported by the lenient timestampFormatter
- If
- If timestamp values are before date values
- the column could be inferred as timestamp/string type based on whether the date format is supported by the lenient timestampFormatter
- If date values are before timestamp values
There is no behavior change when prefersDate=false.
Does this make sense to you? @sadikovi @cloud-fan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made some follow-up changes, please check the updated description for the behavior after changes and semantics.
…ce properly when lenient timestamp formatter is used
sadikovi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks for making the changes.
I left a few nit comments and questions, would appreciate it if you could take a look. Thanks.
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala
Outdated
Show resolved
Hide resolved
| */ | ||
| private def compatibleType(t1: DataType, t2: DataType): Option[DataType] = { | ||
| TypeCoercion.findTightestCommonType(t1, t2).orElse(findCompatibleTypeForCSV(t1, t2)) | ||
| (t1, t2) match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this match be in findCompatibleTypeForCSV? Or does findTightestCommonType merge DateType and TimestampType in a way that is not applicable here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
findTightestCommonType merge DateType and TimestampType in a way that is not applicable here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What result does findTightestCommonType return for DateType and TimestampType?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(d1, d2) match {
case (_: TimestampType, _: DateType) | (_: DateType, _: TimestampType) =>
TimestampType
case (_: TimestampType, _: TimestampNTZType) | (_: TimestampNTZType, _: TimestampType) =>
TimestampType
case (_: TimestampNTZType, _: DateType) | (_: DateType, _: TimestampNTZType) =>
TimestampNTZType
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never mind, I checked the code, the resulting type will be TimestampType or TimestampNTZ.
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
Outdated
Show resolved
Hide resolved
|
|
||
| // Date formats that could be parsed in DefaultTimestampFormatter | ||
| // Reference: DateTimeUtils.parseTimestampString | ||
| private val LENIENT_TS_FORMATTER_SUPPORTED_DATE_FORMATS = Set( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I don't quite get this part, would you be able to elaborate on why we need to keep a set of such formats?
More of a thought experiment, I don't think this actually happens in practice: cast allows years longer than 4 digits, is it something that needs to be supported here? For more context, https://issues.apache.org/jira/browse/SPARK-39731.
Also, will this work? My understanding is that we will not, only yyyy-MM-dd.
dateFormat = "yyyy/MM/dd"
timestampFormat = "yyyy/MM/dd HH:mm:ss"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need this set to determine inferring a column with mixture of dates and timestamps as TimestampType or StringType when no timestamp format is specified (the lenient timestamp formatter will be used)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dateFormat = "yyyy/MM/dd"
timestampFormat = "yyyy/MM/dd HH:mm:ss"
I don't quite understand your question on this case.
But speaking in the context of this PR, because timestampFormat is specified, a column with a mix of dates and timestamps will be inferred as StringType.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More of a thought experiment, I don't think this actually happens in practice: cast allows years longer than 4 digits, is it something that needs to be supported here? For more context, https://issues.apache.org/jira/browse/SPARK-39731.
That's some interesting formats, I am not sure if we need to take care of them here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate on why we need LENIENT_TS_FORMATTER_SUPPORTED_DATE_FORMATS? I understand how it is used. Also, I am not a supporter of hardcoding date/time formats here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When timestamp format is not specified, the desired behavior is that a column with mix of dates and timestamps could be inferred as timestamp type if the lenient timestamp formatter can parse the date strings under the column as well.
To achieve that without bringing other performance concern, we want to simply check if the date format could be supported by the lenient timestamp formatter. Does that make sense?
Thanks @cloud-fan. Totally agree with those behaviors, and this PR is exactly making that happen. |
Thanks @sadikovi. The logic of this PR you described is not accurate. The actual logic is that
Understand your concern, it's a trade-off of not introducing performance degradation. |
|
Can one of the admins verify this patch? |
|
@cloud-fan @HyukjinKwon Do you have any concerns or questions? IMHO, we can merge this PR, seems that all of the questions have been addressed. Thanks. |
| (TimestampNTZType, DateType) | (TimestampType, DateType) => | ||
| // For a column containing a mixture of dates and timestamps | ||
| // infer it as timestamp type if its dates can be inferred as timestamp type | ||
| // otherwise infer it as StringType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite understand the rationale here. why can't we directly return Some(StringType)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to have consistent behavior when timestamp format is not specified.
When prefersDate=false, a column with mixed date and timestamp could be inferred as timestamp if possible.
Thus, we added the additional handling here for a consistent behavior as above when prefersDate=true.
Does this make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvm, I got it
| (TimestampNTZType, DateType) | (TimestampType, DateType) => | ||
| // For a column containing a mixture of dates and timestamps | ||
| // infer it as timestamp type if its dates can be inferred as timestamp type | ||
| // otherwise infer it as StringType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's enrich the comment a bit more
This only happens when the timestamp pattern is not specified, as the default
timestamp parser is very lenient and can parse date string as well.
| private def canParseDateAsTimestamp(dateFormat: String, tsType: DataType): Boolean = { | ||
| if ((tsType.isInstanceOf[TimestampType] && options.timestampFormatInRead.isEmpty) || | ||
| (tsType.isInstanceOf[TimestampNTZType] && options.timestampNTZFormatInRead.isEmpty)) { | ||
| LENIENT_TS_FORMATTER_SUPPORTED_DATE_FORMATS.contains(dateFormat) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need to cover these corner cases? We can just say that we can only parse date as timestamp if neither timestamp pattern nor date pattern is specified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a behavior change in terms of Spark 3.3 branch, where a column with mixed dates and timestamps could be inferred as timestamp type if possible when no timestamp pattern specified.
| * timestamp values as dates if the values do not conform to the timestamp formatter before | ||
| * falling back to the backward compatible parsing - the parsed values will be cast to timestamp | ||
| * afterwards. | ||
| * Infer columns with all valid date entries as date type (otherwise inferred as string type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| * Infer columns with all valid date entries as date type (otherwise inferred as string type) | |
| * Infer columns with all valid date entries as date type (otherwise inferred as string or timestamp type) |
|
Can we update PR description to mention that |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala
Show resolved
Hide resolved
|
|
||
| check( | ||
| "legacy", | ||
| "corrected", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to reduce the diff, can we still test legacy first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
thanks, merging to master! |
What changes were proposed in this pull request?
This PR corrects the behavior of how columns with mixed dates and timestamps are supported in CSV schema inference and data parsing.
StringType.TimestampTypeif possible, otherwise inferred asStringTypeHere are the semantics of the changes:
In
CSVInferSchemaDateTypewhenprefersDate=trueandtypeSoFar=TimestampType/TimestampNTZTypeDateTypeandTimestampType/TimestampNTZType:timestampFormat/timestampNTZFormatis given, merge the two types into StringTypedateFormatcould be parsed by the lenient timestamp formatter, merge the two types intoTimestampType/TimestampNTZTypeStringTypeIn
UnivocityParser, remove the attempts to parse field as Date if it failed to be parsed as Timestamp.As an additional change, this PR also turn the default value of
prefersDateastrue.Why are the changes needed?
Simplify CSV dateTime inference logic.
Does this PR introduce any user-facing change?
No compared to the previous PR.
How was this patch tested?