-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-39469][SQL] Infer date type for CSV schema inference #36871
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-39469][SQL] Infer date type for CSV schema inference #36871
Conversation
|
Can one of the admins verify this patch? |
6120782 to
1630370
Compare
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/UnivocityParserSuite.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/UnivocityParserSuite.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One test we might need would be:
timestampFormat" -> "dd/MM/yyyy HH:mm and dateFormat -> dd/MM/yyyy to make sure timestamps are not parsed as date types without conflicting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to make sure timestamps are not parsed as date types without conflicting.
That's actually what happens:
Before this PR:
scala> val csvInput = Seq("0,2012-01-01 12:00:00", "1,2021-07-01 15:00:00").toDS()
csvInput: org.apache.spark.sql.Dataset[String] = [value: string]
scala> val df = spark.read.option("inferSchema", "true").csv(csvInput)
df: org.apache.spark.sql.DataFrame = [_c0: int, _c1: timestamp]
scala> df.printSchema
root
|-- _c0: integer (nullable = true)
|-- _c1: timestamp (nullable = true)
scala>
After this PR:
scala> val csvInput = Seq("0,2012-01-01 12:00:00", "1,2021-07-01 15:00:00").toDS()
csvInput: org.apache.spark.sql.Dataset[String] = [value: string]
scala> val df = spark.read.option("inferSchema", "true").csv(csvInput)
df: org.apache.spark.sql.DataFrame = [_c0: int, _c1: date]
scala> df.printSchema
root
|-- _c0: integer (nullable = true)
|-- _c1: date (nullable = true)
scala>
It looks like some tests fail too, like CSVInferSchemaSuite, and CSVv1Suite, possibly others (I ran these two suites on my laptop. For some reason, the github actions didn't run tests for this PR. Maybe @Jonathancui123 needs to turn them on in his fork?).
We should probably 1. add either SQL configuration or an option e.g., infersDate
I think you would need something like that: when set, the date formatter could use the slower, more strict method of parsing (so "2012-01-01 12:00:00" wouldn't parse as a date).
Edit: To do a strict parsing, one might need to use ParsePosition and check that the whole date/time value string was consumed. Even after setting lenient=false SimpleDateFormat.parse didn't complain about extra characters that weren't consumed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I addressed inference mistakes in the following code snippet and comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I don't get this case. If the schema is TimestampType, the output here should always timestamps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider the column of a DateType followed by a TimestampType. We would expect this column to be inferred as a TimestampType column.
Thus, when parsing the column, the timestamp converter will fail on the Date entry so we will need to try and convert it with the Date converter. If both converters fail, then we will throw an error.
|
Took a cursory look. @MaxGekk do you remember the context here? I remember we didn't merge this change because the legacy fast format parser (Java 8 time libraries) did not support the exact matching (e.g., "yyyy" parses "2000-10-12" as "2000") |
|
cc @bersprockets too if you find some time to review. |
|
@Jonathancui123 You probably want to turn on github actions so tests will run. From https://spark.apache.org/contributing.html:
|
dcbe9e8 to
1cd55c7
Compare
|
I've added a
The change to parsing behavior is necessary because:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One test we might need would be timestampFormat" -> "dd/MM/yyyy HH:mm and dateFormat -> dd/MM/yyyy to make sure timestamps are not parsed as date types without conflicting.
This test uses:
"timestampFormat" -> "yyyy-MM-dd'T'HH:mm",
"dateFormat" -> "yyyy-MM-dd",
This e2e test ensures that our DateFormatter is using strict parsing. We will not infer Timestamp columns as Date columns if the DateFormat is a prefix of the TimestampFormat.
Thank you for the review! @HyukjinKwon @bersprockets
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not use the legacy DateFormatter here to avoid parsing timestamps with invalid suffixes. We want to throw an error when invalid timstamps are given.
e.g. The legacy DateFormatter will parse the following string without throwing an error:
dateFormat: yyyy-mm-dd
string: 2001-09-08-randomtext
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think it makes sense to throw an exception or disallow when legacy parser is used (that doesn't care about surffixes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not use the legacy DateFormatter here to avoid parsing timestamps with invalid suffixes.
I think you could still make it work, but you would need a new extension of LegacySimpleDateFormatter (maybe LegacyStrictSimpleDateFormatter), with an override like this:
def parseToDate(s: String): Date = {
val pp = new ParsePosition(0)
val res = sdf.parse(s, pp)
if (s.length != pp.getIndex) {
throw new RuntimeException(s"$s is not a date")
}
res
}
2001-09-08-randomtext would not parse, neither would 2022-01-02 12:56:33, but 2022-01-02 would (assuming a format of yyyy-MM-dd).
I assume it would be slow (but I have not tested it).
Maybe not worth the extra code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bersprockets Thanks for the suggestion! Do you know what is the advantage of allowing Legacy Formatter? i.e. what is a date format that the legacy formatter can handle but the current formatter cannot?
I'm wondering if there will be a sufficient population of users who want to infer date in schema and also use legacy date formats
cc: @Yaohua628
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know what is the advantage of allowing Legacy Formatter?
One benefit of the legacy formatter is that it recognizes some pre-Gregorian leap years (like 1500-02-29) that exist only in the hybrid Julian calendar. Note how schema inference chooses string until you set the legacy parser.
scala> val csvInput = Seq("1425-03-22T00:00:00", "2022-01-01T00:00:00", "1500-02-29T00:00:00").toDS()
csvInput: org.apache.spark.sql.Dataset[String] = [value: string]
scala> spark.read.options(Map("inferSchema" -> "true", "timestampFormat" -> "yyyy-MM-dd'T'HH:mm:ss")).csv(csvInput).printSchema
root
|-- _c0: string (nullable = true)
scala> sql("set spark.sql.legacy.timeParserPolicy=legacy")
res1: org.apache.spark.sql.DataFrame = [key: string, value: string]
scala> spark.read.options(Map("inferSchema" -> "true", "timestampFormat" -> "yyyy-MM-dd'T'HH:mm:ss")).csv(csvInput).printSchema
root
|-- _c0: timestamp (nullable = true)
scala>
That, of course, matters only if the application's input comes from legacy systems that still use hybrid Julian, and the input contains pre-Gregorian dates (e.g., for date encoding, which is the only real-world use case I have come across). I would imagine that audience is small and probably getting smaller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you could still make it work, but you would need a new extension of LegacySimpleDateFormatter
By the way, to avoid confusion, I meant the above in the context of inferring dates when using the legacy parser (I realize now that this discussion is happening in reference to code changes in UnivocityParser).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Bruce! This is great context! This will definitely be necessary if we want to support inference along with legacy date formats. Users on legacy dates will be unaffected by this change - how about we can open another ticket for date inference with legacy formats if the demand exists (and merge this PR without legacy date inference support)?
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala
Outdated
Show resolved
Hide resolved
|
The logic looks making sense in general .. but would be best to have second look from @bersprockets @MaxGekk @gengliangwang .... |
|
One test failed at |
|
I added a new |
df7146e to
e1170d0
Compare
This reverts commit e1170d0.
|
Hi folks, @HyukjinKwon @bersprockets @cloud-fan, thanks for reviewing and some great suggestions, is this PR good to go? Thanks! |
|
thanks, merging to master! |
|
@cloud-fan @Jonathancui123 Wouldn't this patch cause correctness issues? This is what I found when working on #37147: The "SPARK-39469: Infer schema for date type" test in CSVSuite highlights the issue when run together with my patch which attempts to forbid users to fall back to the default parser when the timestamp format is provided as it could lead to correctness issues. Because "1765-03-28" does not match timestamp pattern and the column is inferred as TimestampType, it should be returned as Technically my patch corrects this but I fixed the tests by enabling the incorrect behaviour with the flag here: a447b08. Can someone take a look and clarify this one? |
|
Maybe it is just the test so I can update that in my PR but I would like to clarify the expected behaviour here. |
| // compatibility. | ||
| val str = DateTimeUtils.cleanLegacyTimestampStr(UTF8String.fromString(datum)) | ||
| DateTimeUtils.stringToDate(str).getOrElse(throw e) | ||
| DateTimeUtils.stringToTimestamp(str, options.zoneId).getOrElse { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the issue here is, if the timestamp parsing fails, maybe it's because this is a date, or maybe it's a legacy timestamp format. We need to define the priority here. Since inferDate is opt-in, I think it makes more sense to try parsing as date first, then the legacy format.
cc @sadikovi
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wonder, all issues mentioned by @HyukjinKwon in my PR #23202 (comment) have been addressed by this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. We should address the order. Otherwise, it is unclear how to handle fallback. Fixed here: 10ca4a4.
@sadikovi I've verified on an older version of spark prior to this PR: "1765-03-28" in a timestamp column without user specified format is parsed as "1765-03-28 00:00:00.0". So the behavior of parsing default date format in timestamp columns is not due to this PR. Prior to changes, in a timestamp column:
After inferDate PR, in a timestamp column:
After enableParsingFallbackForDateType PR (#37147), in a timestamp column with
Since default format date was previously parsed by fallback, I thought it was desirable for dates to be parsed in a timestamp column. So I included support for custom format dates in a timestamp column when We have two options for target behavior: OPTION A - Target behavior, in a timestamp column:
OPTION B - Target behavior, in a timestamp column:
@cloud-fan @HyukjinKwon should we go with Option A or Option B? |
|
So am I right that Option A considers the type coercion between timestamps and dates, and Option B does not? I personally prefers Option B so we can switch to Option A in the future. Once we pick Option A, it's difficult to go back to Option B. |
|
I think my main confusion comes from the "inferDate=true" turning an invalid timestamp value into a date and then returning it as a timestamp, the column should have been a DateType column. @Jonathancui123 Would it be possible to revisit this behaviour? I agree with Wenchen, we may need to decide whether to parse it as legacy timestamp or inferDate. |
|
@Jonathancui123 I fixed this issue in 10ca4a4 on my PR. Can you review? Thanks. |
|
@sadikovi mind opening a pr? |
|
It is a small change so I fixed it in my PR #37147. |
@sadikovi Thanks for the change! I agree with it and I've left a comment |
…ics of the option in CSV data source ### What changes were proposed in this pull request? This is a follow-up for #36871. PR renames `inferDate` to `prefersDate` to avoid confusion when dates inference would change the column type and result in confusion when the user meant to infer timestamps. The patch also updates semantics of the option: `prefersDate` is allowed to be used during schema inference (`inferSchema`) as well as user-provided schema where it could be used as a fallback mechanism when parsing timestamps. ### Why are the changes needed? Fixes ambiguity when setting `prefersDate` to true and clarifies semantics of the option. ### Does this PR introduce _any_ user-facing change? Although it is an option rename, the original PR was merged a few days ago and the config option has not been included in a Spark release. ### How was this patch tested? I added a unit test for prefersDate = true with a user schema. Closes #37327 from sadikovi/rename_config. Authored-by: Ivan Sadikov <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
What changes were proposed in this pull request?
inferDateoption to CSV Options. The description is:An error will be thrown if
inferDateis true when SQL Configuration LegacyTimeParserPolicy isLEGACY. This is to avoid incorrect schema inferences from legacy time parsers not doing strict parsing.The
inferDateoption should prevent performance degradation for users who don't opt-in.If
typeSoFarininferFieldis Date, Timestamp or TimstampNTZ, we will first attempt to parse Date and then parse Timestamp/TimestampNTZ. The reason why we attempt to parse date fortypeSoFar=Timestamp/TimestampNTZ is because of the case where a column contains a timestamp entry and then a date entry - we should detect both of the data types and infer the column as a timestamp type.Example:
Result:
makeConverterinUnivocityParserto handle Date type entries in a timestamp type column to properly parse the above example.Does this PR introduce any user-facing change?
The new behavior of schema inference when
inferDate = true:--> If the date format and the timestamp format are identical (e.g. both are yyyy/mm/dd), entries will default to being interpreted as Date
How was this patch tested?
Unit tests were added to
CSVInferSchemaSuiteandUnivocityParserSuite. An end to end test is added toCSVSuiteBenchmarks:
inferDateincreases parsing/inference time in general. The impact scales with the number of rows (and not the number of columns). For columns of date type (which would be inferred as timestamp wheninferDate=false), inference and parsing takes 30% longer. The performance impact is much greater on columns of timestamp type (taking 30x longer thaninferDate=false) - due to trying each timestamp as a date (and throwing an error) during the inference step.Number of seconds taken to parse each CSV file with
inferDate trueandinferDate falseResults are the average of 3 trials with the same machine.
Over multiple runs, master branch benchmark times have also shown results that are slower than
inferDate=false(although the average is slightly faster). Given the +/- 20% variance in results between trials, master branch benchmark results are roughly similar toinferDate=Falseresults.