[SPARK-39469][SQL] Infer date type for CSV schema inference #36871

Jonathancui123 · 2022-06-15T00:49:54Z

What changes were proposed in this pull request?

Add a new inferDate option to CSV Options. The description is:

Whether or not to infer columns that satisfy the dateFormat option as Date. Requires inferSchema to be true. When false, columns with dates will be inferred as String (or as Timestamp if it fits the timestampFormat) Legacy date formats in Timestamp columns cannot be parsed with this option.

An error will be thrown if inferDate is true when SQL Configuration LegacyTimeParserPolicy is LEGACY. This is to avoid incorrect schema inferences from legacy time parsers not doing strict parsing.

The inferDate option should prevent performance degradation for users who don't opt-in.

Modify InferField in CSVInferSchema.scala to include Date type.

If typeSoFar in inferField is Date, Timestamp or TimstampNTZ, we will first attempt to parse Date and then parse Timestamp/TimestampNTZ. The reason why we attempt to parse date for typeSoFar=Timestamp/TimestampNTZ is because of the case where a column contains a timestamp entry and then a date entry - we should detect both of the data types and infer the column as a timestamp type.

Example:

Seq("2010|10|10", "2010_10_10")
  .toDF.repartition(1).write.mode("overwrite").text("/tmp/foo")
spark.read
  .option("inferSchema", "true")
  .option("header", "false")
  .option("dateFormat", "yyyy|MM|dd")
  .option("timestampFormat", "yyyy_MM_dd").csv("/tmp/foo").printSchema()

Result:

root
 |-- _c0: timestamp (nullable = true)

Also modified makeConverter in UnivocityParser to handle Date type entries in a timestamp type column to properly parse the above example.

Does this PR introduce any user-facing change?

The new behavior of schema inference when inferDate = true:

If a column contains only dates, it should be of “date” type in the inferred schema
--> If the date format and the timestamp format are identical (e.g. both are yyyy/mm/dd), entries will default to being interpreted as Date
If a column contains dates and timestamps, it should be of “timestamp” type in the inferred schema

How was this patch tested?

Unit tests were added to CSVInferSchemaSuite and UnivocityParserSuite. An end to end test is added to CSVSuite

Benchmarks:

inferDate increases parsing/inference time in general. The impact scales with the number of rows (and not the number of columns). For columns of date type (which would be inferred as timestamp when inferDate=false), inference and parsing takes 30% longer. The performance impact is much greater on columns of timestamp type (taking 30x longer than inferDate=false) - due to trying each timestamp as a date (and throwing an error) during the inference step.

Number of seconds taken to parse each CSV file with `inferDate true` and `inferDate false`

	inferDate=False	inferDate=True	master branch
Small file (<100 row/col). Mixed data types	0.32	0.33
100K rows. 4 columns. Mixed data types.	0.70	2.80	0.70
20k columns. 4 rows. Mixed Data types.	16.32	15.90	13.5
Large file. Only date type.	2.15	3.70	2.10
Large file. Only timestamp type.	2.60	77.00	2.30

Results are the average of 3 trials with the same machine.

Over multiple runs, master branch benchmark times have also shown results that are slower than inferDate=false (although the average is slightly faster). Given the +/- 20% variance in results between trials, master branch benchmark results are roughly similar to inferDate=False results.

AmplabJenkins · 2022-06-15T18:39:17Z

Can one of the admins verify this patch?

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/UnivocityParserSuite.scala

HyukjinKwon · 2022-06-16T05:01:03Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/UnivocityParserSuite.scala

One test we might need would be:

timestampFormat" -> "dd/MM/yyyy HH:mm and dateFormat -> dd/MM/yyyy to make sure timestamps are not parsed as date types without conflicting.

to make sure timestamps are not parsed as date types without conflicting.

That's actually what happens:

Before this PR:

scala> val csvInput = Seq("0,2012-01-01 12:00:00", "1,2021-07-01 15:00:00").toDS() csvInput: org.apache.spark.sql.Dataset[String] = [value: string] scala> val df = spark.read.option("inferSchema", "true").csv(csvInput) df: org.apache.spark.sql.DataFrame = [_c0: int, _c1: timestamp] scala> df.printSchema root |-- _c0: integer (nullable = true) |-- _c1: timestamp (nullable = true) scala>

After this PR:

scala> val csvInput = Seq("0,2012-01-01 12:00:00", "1,2021-07-01 15:00:00").toDS() csvInput: org.apache.spark.sql.Dataset[String] = [value: string] scala> val df = spark.read.option("inferSchema", "true").csv(csvInput) df: org.apache.spark.sql.DataFrame = [_c0: int, _c1: date] scala> df.printSchema root |-- _c0: integer (nullable = true) |-- _c1: date (nullable = true) scala>

It looks like some tests fail too, like CSVInferSchemaSuite, and CSVv1Suite, possibly others (I ran these two suites on my laptop. For some reason, the github actions didn't run tests for this PR. Maybe @Jonathancui123 needs to turn them on in his fork?).

We should probably 1. add either SQL configuration or an option e.g., infersDate

I think you would need something like that: when set, the date formatter could use the slower, more strict method of parsing (so "2012-01-01 12:00:00" wouldn't parse as a date).

Edit: To do a strict parsing, one might need to use ParsePosition and check that the whole date/time value string was consumed. Even after setting lenient=false SimpleDateFormat.parse didn't complain about extra characters that weren't consumed.

I addressed inference mistakes in the following code snippet and comment

HyukjinKwon · 2022-06-16T05:03:47Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala

Hm, I don't get this case. If the schema is TimestampType, the output here should always timestamps.

Consider the column of a DateType followed by a TimestampType. We would expect this column to be inferred as a TimestampType column.

Thus, when parsing the column, the timestamp converter will fail on the Date entry so we will need to try and convert it with the Date converter. If both converters fail, then we will throw an error.

HyukjinKwon · 2022-06-16T05:05:44Z

Took a cursory look. @MaxGekk do you remember the context here? I remember we didn't merge this change because the legacy fast format parser (Java 8 time libraries) did not support the exact matching (e.g., "yyyy" parses "2000-10-12" as "2000")

HyukjinKwon · 2022-06-16T05:06:41Z

cc @bersprockets too if you find some time to review.

bersprockets · 2022-06-17T00:27:33Z

@Jonathancui123 You probably want to turn on github actions so tests will run.

From https://spark.apache.org/contributing.html:

Go to “Actions” tab on your forked repository and enable “Build and test” and “Report test results” workflows

Jonathancui123 · 2022-06-22T00:56:27Z

I've added a inferDate flag that addresses the comments above:

Performance: Logic for date inference during is guarded behind the flag
Parsing behavior inferDate=false: Behavior is identical to prior to changes
Parsing behavior inferDate=true: In timestamp columns, we will attempt to parse as timestamp and then parse as date if timestamp fails.

The change to parsing behavior is necessary because:

We can have columns of Date and Timestamp entries inferred as Timestamp. Consider the column of a DateType followed by a TimestampType. We would expect this column to be inferred as a TimestampType column. Thus, when parsing the column, the timestamp converter will fail on the Date entry so we will need to try and convert it with the Date converter. If both converters fail, then we will throw an error.

Jonathancui123 · 2022-06-22T00:59:53Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

One test we might need would be timestampFormat" -> "dd/MM/yyyy HH:mm and dateFormat -> dd/MM/yyyy to make sure timestamps are not parsed as date types without conflicting.

This test uses:

"timestampFormat" -> "yyyy-MM-dd'T'HH:mm", "dateFormat" -> "yyyy-MM-dd",

This e2e test ensures that our DateFormatter is using strict parsing. We will not infer Timestamp columns as Date columns if the DateFormat is a prefix of the TimestampFormat.

Thank you for the review! @HyukjinKwon @bersprockets

Jonathancui123 · 2022-06-22T01:02:51Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala

We do not use the legacy DateFormatter here to avoid parsing timestamps with invalid suffixes. We want to throw an error when invalid timstamps are given.

e.g. The legacy DateFormatter will parse the following string without throwing an error:
dateFormat: yyyy-mm-dd
string: 2001-09-08-randomtext

Yeah, I think it makes sense to throw an exception or disallow when legacy parser is used (that doesn't care about surffixes)

We do not use the legacy DateFormatter here to avoid parsing timestamps with invalid suffixes.

I think you could still make it work, but you would need a new extension of LegacySimpleDateFormatter (maybe LegacyStrictSimpleDateFormatter), with an override like this:

def parseToDate(s: String): Date = { val pp = new ParsePosition(0) val res = sdf.parse(s, pp) if (s.length != pp.getIndex) { throw new RuntimeException(s"$s is not a date") } res }

2001-09-08-randomtext would not parse, neither would 2022-01-02 12:56:33, but 2022-01-02 would (assuming a format of yyyy-MM-dd).

I assume it would be slow (but I have not tested it).

Maybe not worth the extra code.

@bersprockets Thanks for the suggestion! Do you know what is the advantage of allowing Legacy Formatter? i.e. what is a date format that the legacy formatter can handle but the current formatter cannot?

I'm wondering if there will be a sufficient population of users who want to infer date in schema and also use legacy date formats

cc: @Yaohua628

Do you know what is the advantage of allowing Legacy Formatter?

One benefit of the legacy formatter is that it recognizes some pre-Gregorian leap years (like 1500-02-29) that exist only in the hybrid Julian calendar. Note how schema inference chooses string until you set the legacy parser.

scala> val csvInput = Seq("1425-03-22T00:00:00", "2022-01-01T00:00:00", "1500-02-29T00:00:00").toDS() csvInput: org.apache.spark.sql.Dataset[String] = [value: string] scala> spark.read.options(Map("inferSchema" -> "true", "timestampFormat" -> "yyyy-MM-dd'T'HH:mm:ss")).csv(csvInput).printSchema root |-- _c0: string (nullable = true) scala> sql("set spark.sql.legacy.timeParserPolicy=legacy") res1: org.apache.spark.sql.DataFrame = [key: string, value: string] scala> spark.read.options(Map("inferSchema" -> "true", "timestampFormat" -> "yyyy-MM-dd'T'HH:mm:ss")).csv(csvInput).printSchema root |-- _c0: timestamp (nullable = true) scala>

That, of course, matters only if the application's input comes from legacy systems that still use hybrid Julian, and the input contains pre-Gregorian dates (e.g., for date encoding, which is the only real-world use case I have come across). I would imagine that audience is small and probably getting smaller.

I think you could still make it work, but you would need a new extension of LegacySimpleDateFormatter

By the way, to avoid confusion, I meant the above in the context of inferring dates when using the legacy parser (I realize now that this discussion is happening in reference to code changes in UnivocityParser).

Thanks Bruce! This is great context! This will definitely be necessary if we want to support inference along with legacy date formats. Users on legacy dates will be unaffected by this change - how about we can open another ticket for date inference with legacy formats if the demand exists (and merge this PR without legacy date inference support)?

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala

HyukjinKwon · 2022-06-22T03:52:31Z

The logic looks making sense in general .. but would be best to have second look from @bersprockets @MaxGekk @gengliangwang ....

HyukjinKwon · 2022-06-22T03:55:45Z

One test failed at org.apache.spark.sql.execution.datasources.csv.CSVLegacyTimeParserSuite

2022-06-22T01:10:13.0585659Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m- SPARK-39469: Infer schema for date type *** FAILED *** (106 milliseconds)�[0m�[0m
2022-06-22T01:10:13.0592465Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  StructType(StructField(date,DateType,true), StructField(timestamp-date,DateType,true), StructField(date-timestamp,DateType,true)) did not equal StructType(StructField(date,DateType,true), StructField(timestamp-date,TimestampType,true), StructField(date-timestamp,TimestampType,true)) (CSVSuite.scala:2773)�[0m�[0m
2022-06-22T01:10:13.0598357Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  Analysis:�[0m�[0m
2022-06-22T01:10:13.0604058Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  StructType(1: StructField(timestamp-date,DateType,true) -> StructField(timestamp-date,TimestampType,true), 2: StructField(date-timestamp,DateType,true) -> StructField(date-timestamp,TimestampType,true))�[0m�[0m
2022-06-22T01:10:13.0610333Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  org.scalatest.exceptions.TestFailedException:�[0m�[0m
2022-06-22T01:10:13.0622299Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)�[0m�[0m
2022-06-22T01:10:13.0627015Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)�[0m�[0m
2022-06-22T01:10:13.0634808Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)�[0m�[0m
2022-06-22T01:10:13.0639642Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)�[0m�[0m
2022-06-22T01:10:13.0647369Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.apache.spark.sql.execution.datasources.csv.CSVSuite.$anonfun$new$379(CSVSuite.scala:2773)�[0m�[0m
2022-06-22T01:10:13.0655897Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)�[0m�[0m
2022-06-22T01:10:13.0659677Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)�[0m�[0m
2022-06-22T01:10:13.0663352Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)�[0m�[0m
2022-06-22T01:10:13.0666989Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.scalatest.Transformer.apply(Transformer.scala:22)�[0m�[0m
2022-06-22T01:10:13.0670619Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.scalatest.Transformer.apply(Transformer.scala:20)�[0m�[0m
2022-06-22T01:10:13.0674478Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:190)�[0m�[0m
2022-06-22T01:10:13.0678412Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:203)�[0m�[0m
2022-06-22T01:10:13.0694953Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:188)�[0m�[0m
2022-06-22T01:10:13.0699731Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:200)�[0m�[0m
2022-06-22T01:10:13.0703990Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)�[0m�[0m
2022-06-22T01:10:13.0708256Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:200)�[0m�[0m
2022-06-22T01:10:13.0712723Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:182)�[0m�[0m
2022-06-22T01:10:13.0717517Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:64)�[0m�[0m
2022-06-22T01:10:13.0721680Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234)�[0m�[0m
2022-06-22T01:10:13.0726397Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227)�[0m�[0m

Jonathancui123 · 2022-06-22T23:52:43Z

I added a new QueryExecutionError called inferDateWithLegacyTimeParserError that is thrown when inferDate=true and SQL Configuration LegacyTimeParserPolicy is LEGACY. This prevents the legacy parser from being used with schema inference for date and fixes the failing test

This reverts commit e1170d0.

Yaohua628 · 2022-07-21T01:01:18Z

Hi folks, @HyukjinKwon @bersprockets @cloud-fan, thanks for reviewing and some great suggestions, is this PR good to go? Thanks!

cloud-fan · 2022-07-21T09:04:38Z

thanks, merging to master!

sadikovi · 2022-07-22T05:04:49Z

@cloud-fan @Jonathancui123 Wouldn't this patch cause correctness issues? This is what I found when working on #37147:

The "SPARK-39469: Infer schema for date type" test in CSVSuite highlights the issue when run together with my patch which attempts to forbid users to fall back to the default parser when the timestamp format is provided as it could lead to correctness issues.

Because "1765-03-28" does not match timestamp pattern and the column is inferred as TimestampType, it should be returned as null. However, in the test it is returned as 1765-03-28 00:00:00.0. This is at very least confusing - inferDate should only affect dates, not timestamp columns.

Technically my patch corrects this but I fixed the tests by enabling the incorrect behaviour with the flag here: a447b08.

Can someone take a look and clarify this one?

sadikovi · 2022-07-22T05:08:56Z

Maybe it is just the test so I can update that in my PR but I would like to clarify the expected behaviour here.

cloud-fan · 2022-07-22T07:13:13Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala

            // compatibility.
            val str = DateTimeUtils.cleanLegacyTimestampStr(UTF8String.fromString(datum))
-            DateTimeUtils.stringToDate(str).getOrElse(throw e)
+            DateTimeUtils.stringToTimestamp(str, options.zoneId).getOrElse {


I think the issue here is, if the timestamp parsing fails, maybe it's because this is a date, or maybe it's a legacy timestamp format. We need to define the priority here. Since inferDate is opt-in, I think it makes more sense to try parsing as date first, then the legacy format.

cc @sadikovi

Just wonder, all issues mentioned by @HyukjinKwon in my PR #23202 (comment) have been addressed by this PR.

Agreed. We should address the order. Otherwise, it is unclear how to handle fallback. Fixed here: 10ca4a4.

Jonathancui123 · 2022-07-22T20:47:53Z

Because "1765-03-28" does not match timestamp pattern and the column is inferred as TimestampType, it should be returned as null. However, in the test it is returned as 1765-03-28 00:00:00.0. This is at very least confusing - inferDate should only affect dates, not timestamp columns.

@sadikovi I've verified on an older version of spark prior to this PR: "1765-03-28" in a timestamp column without user specified format is parsed as "1765-03-28 00:00:00.0". So the behavior of parsing default date format in timestamp columns is not due to this PR.

Prior to changes, in a timestamp column:

custom format date: null
default format date: Parsed by fallback

After inferDate PR, in a timestamp column:

custom format date: Parsed if inferDate is true, otherwise null
default format date: Parsed by fallback

After enableParsingFallbackForDateType PR (#37147), in a timestamp column with enableParsingFallbackForDateType=false:

custom format date: null
default format date: null

Since default format date was previously parsed by fallback, I thought it was desirable for dates to be parsed in a timestamp column. So I included support for custom format dates in a timestamp column when inferDate=true.

We have two options for target behavior:

OPTION A - Target behavior, in a timestamp column:

custom format date: Parsed if inferDate is true, otherwise null
default format date: Parsed if inferDate is true, otherwise null
--> move the fallback error in [SPARK-39731][SQL] Fix issue in CSV and JSON data sources when parsing dates in "yyyyMMdd" format with CORRECTED time parser policy #37147 to allow date parsing if inferDate=true

OPTION B - Target behavior, in a timestamp column:
enableParsingFallbackForDateType False:

custom format date: null
default format date: null
enableParsingFallbackForDateType True:
custom format date: null
default format date: parsed!
--> remove redundant dateFormatter.parse behavior for inferDate = true

@cloud-fan @HyukjinKwon should we go with Option A or Option B?

HyukjinKwon · 2022-07-23T00:13:27Z

So am I right that Option A considers the type coercion between timestamps and dates, and Option B does not? I personally prefers Option B so we can switch to Option A in the future. Once we pick Option A, it's difficult to go back to Option B.

sadikovi · 2022-07-24T22:44:09Z

I think my main confusion comes from the "inferDate=true" turning an invalid timestamp value into a date and then returning it as a timestamp, the column should have been a DateType column.

@Jonathancui123 Would it be possible to revisit this behaviour? I agree with Wenchen, we may need to decide whether to parse it as legacy timestamp or inferDate.

sadikovi · 2022-07-24T23:04:43Z

@Jonathancui123 I fixed this issue in 10ca4a4 on my PR. Can you review? Thanks.

HyukjinKwon · 2022-07-25T00:29:12Z

@sadikovi mind opening a pr?

sadikovi · 2022-07-25T00:42:41Z

It is a small change so I fixed it in my PR #37147.

Jonathancui123 · 2022-07-25T03:35:44Z

@Jonathancui123 I fixed this issue in 10ca4a4 on my PR. Can you review? Thanks.

@sadikovi Thanks for the change! I agree with it and I've left a comment

…ics of the option in CSV data source ### What changes were proposed in this pull request? This is a follow-up for #36871. PR renames `inferDate` to `prefersDate` to avoid confusion when dates inference would change the column type and result in confusion when the user meant to infer timestamps. The patch also updates semantics of the option: `prefersDate` is allowed to be used during schema inference (`inferSchema`) as well as user-provided schema where it could be used as a fallback mechanism when parsing timestamps. ### Why are the changes needed? Fixes ambiguity when setting `prefersDate` to true and clarifies semantics of the option. ### Does this PR introduce _any_ user-facing change? Although it is an option rename, the original PR was merged a few days ago and the config option has not been included in a Spark release. ### How was this patch tested? I added a unit test for prefersDate = true with a user schema. Closes #37327 from sadikovi/rename_config. Authored-by: Ivan Sadikov <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

github-actions bot added the SQL label Jun 15, 2022

Jonathancui123 changed the title ~~infer date type in csv~~ [WIP] infer date type in csv Jun 15, 2022

Jonathancui123 changed the title ~~[WIP] infer date type in csv~~ [WIP] Infer date type for CSV schema inference Jun 15, 2022

Jonathancui123 changed the title ~~[WIP] Infer date type for CSV schema inference~~ [WIP][SPARK-39469] Infer date type for CSV schema inference Jun 15, 2022

Jonathancui123 changed the title ~~[WIP][SPARK-39469] Infer date type for CSV schema inference~~ [WIP][SPARK-39469][SQL] Infer date type for CSV schema inference Jun 15, 2022

Jonathancui123 force-pushed the SPARK-39469-date-infer branch from 6120782 to 1630370 Compare June 15, 2022 22:59

Jonathancui123 changed the title ~~[WIP][SPARK-39469][SQL] Infer date type for CSV schema inference~~ [SPARK-39469][SQL] Infer date type for CSV schema inference Jun 16, 2022

Jonathancui123 marked this pull request as ready for review June 16, 2022 03:53

HyukjinKwon reviewed Jun 16, 2022

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala Outdated Show resolved Hide resolved

HyukjinKwon reviewed Jun 16, 2022

View reviewed changes

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/UnivocityParserSuite.scala Outdated Show resolved Hide resolved

HyukjinKwon reviewed Jun 16, 2022

View reviewed changes

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/UnivocityParserSuite.scala Outdated Show resolved Hide resolved

HyukjinKwon reviewed Jun 16, 2022

View reviewed changes

Jonathancui123 force-pushed the SPARK-39469-date-infer branch from dcbe9e8 to 1cd55c7 Compare June 22, 2022 00:28

github-actions bot added the DOCS label Jun 22, 2022

Jonathancui123 commented Jun 22, 2022

View reviewed changes

Jonathancui123 requested a review from HyukjinKwon June 22, 2022 01:09

HyukjinKwon reviewed Jun 22, 2022

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala Outdated Show resolved Hide resolved

HyukjinKwon reviewed Jun 22, 2022

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala Outdated Show resolved Hide resolved

HyukjinKwon reviewed Jun 22, 2022

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala Outdated Show resolved Hide resolved

Jonathancui123 requested a review from HyukjinKwon June 22, 2022 23:55

Jonathancui123 and others added 10 commits July 12, 2022 11:11

added new error for inferDate with legacy parser and fixed feedback

50a91a6

Update CSVInferSchema.scala

d71558d

Update CSVOptions.scala

601dfc8

change inferDate error type and improve docs

5aa4ab6

fix error class

2282c59

fix sqlState code and typo

2484b77

Update CSVInferSchema.scala

762e0d8

updated docs and test parsing dates in TimestampNTZ column

2c93af5

Fix case where dateFormat is not specified

41fa8eb

allow legacy parser with inferDate

e1170d0

Jonathancui123 force-pushed the SPARK-39469-date-infer branch from df7146e to e1170d0 Compare July 12, 2022 18:12

Revert "allow legacy parser with inferDate"

1e8f938

This reverts commit e1170d0.

cloud-fan approved these changes Jul 21, 2022

View reviewed changes

cloud-fan closed this in c2536a7 Jul 21, 2022

cloud-fan reviewed Jul 22, 2022

View reviewed changes

Jonathancui123 mentioned this pull request Jul 22, 2022

[SPARK-39731][SQL] Fix issue in CSV and JSON data sources when parsing dates in "yyyyMMdd" format with CORRECTED time parser policy #37147

Closed

sadikovi mentioned this pull request Jul 28, 2022

[SPARK-39904][SQL] Rename inferDate to prefersDate and clarify semantics of the option in CSV data source #37327

Closed

xiaonanyang-db mentioned this pull request Sep 19, 2022

[SPARK-40474][SQL] Correct CSV schema inference and data parsing behavior on columns with mixed dates and timestamps #37933

Closed

[SPARK-39469][SQL] Infer date type for CSV schema inference #36871

[SPARK-39469][SQL] Infer date type for CSV schema inference #36871

Uh oh!

Conversation

Jonathancui123 commented Jun 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Does this PR introduce any user-facing change?

How was this patch tested?

Benchmarks:

Number of seconds taken to parse each CSV file with inferDate true and inferDate false

Uh oh!

AmplabJenkins commented Jun 15, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bersprockets Jun 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Jun 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Jun 16, 2022

Uh oh!

bersprockets commented Jun 17, 2022

Uh oh!

Jonathancui123 commented Jun 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Jonathancui123 Jun 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Jonathancui123 Jun 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HyukjinKwon commented Jun 22, 2022

Uh oh!

HyukjinKwon commented Jun 22, 2022

Uh oh!

Jonathancui123 commented Jun 22, 2022

Uh oh!

Yaohua628 commented Jul 21, 2022

Uh oh!

cloud-fan commented Jul 21, 2022

Uh oh!

sadikovi commented Jul 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sadikovi commented Jul 22, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaxGekk Jul 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Jonathancui123 commented Jun 15, 2022 •

edited

Loading

Number of seconds taken to parse each CSV file with `inferDate true` and `inferDate false`

bersprockets Jun 16, 2022 •

edited

Loading

HyukjinKwon commented Jun 16, 2022 •

edited

Loading

Jonathancui123 commented Jun 22, 2022 •

edited

Loading

Jonathancui123 Jun 22, 2022 •

edited

Loading

Jonathancui123 Jun 22, 2022 •

edited

Loading

sadikovi commented Jul 22, 2022 •

edited

Loading

MaxGekk Jul 22, 2022 •

edited

Loading

sadikovi Jul 24, 2022 •

edited

Loading

Jonathancui123 commented Jul 22, 2022 •

edited

Loading

sadikovi commented Jul 24, 2022 •

edited

Loading

sadikovi commented Jul 24, 2022 •

edited

Loading

sadikovi commented Jul 25, 2022 •

edited

Loading

Jonathancui123 commented Jul 25, 2022 •

edited

Loading