Skip to content

Conversation

@Jonathancui123
Copy link
Contributor

@Jonathancui123 Jonathancui123 commented Jun 15, 2022

What changes were proposed in this pull request?

  1. Add a new inferDate option to CSV Options. The description is:

Whether or not to infer columns that satisfy the dateFormat option as Date. Requires inferSchema to be true. When false, columns with dates will be inferred as String (or as Timestamp if it fits the timestampFormat) Legacy date formats in Timestamp columns cannot be parsed with this option.

An error will be thrown if inferDate is true when SQL Configuration LegacyTimeParserPolicy is LEGACY. This is to avoid incorrect schema inferences from legacy time parsers not doing strict parsing.

The inferDate option should prevent performance degradation for users who don't opt-in.

  1. Modify InferField in CSVInferSchema.scala to include Date type.

If typeSoFar in inferField is Date, Timestamp or TimstampNTZ, we will first attempt to parse Date and then parse Timestamp/TimestampNTZ. The reason why we attempt to parse date for typeSoFar=Timestamp/TimestampNTZ is because of the case where a column contains a timestamp entry and then a date entry - we should detect both of the data types and infer the column as a timestamp type.

Example:

Seq("2010|10|10", "2010_10_10")
  .toDF.repartition(1).write.mode("overwrite").text("/tmp/foo")
spark.read
  .option("inferSchema", "true")
  .option("header", "false")
  .option("dateFormat", "yyyy|MM|dd")
  .option("timestampFormat", "yyyy_MM_dd").csv("/tmp/foo").printSchema()

Result:

root
 |-- _c0: timestamp (nullable = true)
  1. Also modified makeConverter in UnivocityParser to handle Date type entries in a timestamp type column to properly parse the above example.

Does this PR introduce any user-facing change?

The new behavior of schema inference when inferDate = true:

  1. If a column contains only dates, it should be of “date” type in the inferred schema
    --> If the date format and the timestamp format are identical (e.g. both are yyyy/mm/dd), entries will default to being interpreted as Date
  2. If a column contains dates and timestamps, it should be of “timestamp” type in the inferred schema

How was this patch tested?

Unit tests were added to CSVInferSchemaSuite and UnivocityParserSuite. An end to end test is added to CSVSuite

Benchmarks:

inferDate increases parsing/inference time in general. The impact scales with the number of rows (and not the number of columns). For columns of date type (which would be inferred as timestamp when inferDate=false), inference and parsing takes 30% longer. The performance impact is much greater on columns of timestamp type (taking 30x longer than inferDate=false) - due to trying each timestamp as a date (and throwing an error) during the inference step.

Number of seconds taken to parse each CSV file with inferDate true and inferDate false

inferDate=False inferDate=True master branch
Small file (<100 row/col). Mixed data types 0.32 0.33
100K rows. 4 columns. Mixed data types. 0.70 2.80 0.70
20k columns. 4 rows. Mixed Data types. 16.32 15.90 13.5
Large file. Only date type. 2.15 3.70 2.10
Large file. Only timestamp type. 2.60 77.00 2.30

Results are the average of 3 trials with the same machine.

Over multiple runs, master branch benchmark times have also shown results that are slower than inferDate=false (although the average is slightly faster). Given the +/- 20% variance in results between trials, master branch benchmark results are roughly similar to inferDate=False results.

@github-actions github-actions bot added the SQL label Jun 15, 2022
@Jonathancui123 Jonathancui123 changed the title infer date type in csv [WIP] infer date type in csv Jun 15, 2022
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@Jonathancui123 Jonathancui123 changed the title [WIP] infer date type in csv [WIP] Infer date type for CSV schema inference Jun 15, 2022
@Jonathancui123 Jonathancui123 changed the title [WIP] Infer date type for CSV schema inference [WIP][SPARK-39469] Infer date type for CSV schema inference Jun 15, 2022
@Jonathancui123 Jonathancui123 changed the title [WIP][SPARK-39469] Infer date type for CSV schema inference [WIP][SPARK-39469][SQL] Infer date type for CSV schema inference Jun 15, 2022
@Jonathancui123 Jonathancui123 force-pushed the SPARK-39469-date-infer branch from 6120782 to 1630370 Compare June 15, 2022 22:59
@Jonathancui123 Jonathancui123 changed the title [WIP][SPARK-39469][SQL] Infer date type for CSV schema inference [SPARK-39469][SQL] Infer date type for CSV schema inference Jun 16, 2022
@Jonathancui123 Jonathancui123 marked this pull request as ready for review June 16, 2022 03:53
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One test we might need would be:

timestampFormat" -> "dd/MM/yyyy HH:mm and dateFormat -> dd/MM/yyyy to make sure timestamps are not parsed as date types without conflicting.

Copy link
Contributor

@bersprockets bersprockets Jun 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to make sure timestamps are not parsed as date types without conflicting.

That's actually what happens:

Before this PR:

scala> val csvInput = Seq("0,2012-01-01 12:00:00", "1,2021-07-01 15:00:00").toDS()
csvInput: org.apache.spark.sql.Dataset[String] = [value: string]

scala> val df = spark.read.option("inferSchema", "true").csv(csvInput)
df: org.apache.spark.sql.DataFrame = [_c0: int, _c1: timestamp]

scala> df.printSchema
root
 |-- _c0: integer (nullable = true)
 |-- _c1: timestamp (nullable = true)

scala> 

After this PR:

scala> val csvInput = Seq("0,2012-01-01 12:00:00", "1,2021-07-01 15:00:00").toDS()
csvInput: org.apache.spark.sql.Dataset[String] = [value: string]

scala> val df = spark.read.option("inferSchema", "true").csv(csvInput)
df: org.apache.spark.sql.DataFrame = [_c0: int, _c1: date]

scala> df.printSchema
root
 |-- _c0: integer (nullable = true)
 |-- _c1: date (nullable = true)

scala>

It looks like some tests fail too, like CSVInferSchemaSuite, and CSVv1Suite, possibly others (I ran these two suites on my laptop. For some reason, the github actions didn't run tests for this PR. Maybe @Jonathancui123 needs to turn them on in his fork?).

We should probably 1. add either SQL configuration or an option e.g., infersDate

I think you would need something like that: when set, the date formatter could use the slower, more strict method of parsing (so "2012-01-01 12:00:00" wouldn't parse as a date).

Edit: To do a strict parsing, one might need to use ParsePosition and check that the whole date/time value string was consumed. Even after setting lenient=false SimpleDateFormat.parse didn't complain about extra characters that weren't consumed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I addressed inference mistakes in the following code snippet and comment

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I don't get this case. If the schema is TimestampType, the output here should always timestamps.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider the column of a DateType followed by a TimestampType. We would expect this column to be inferred as a TimestampType column.

Thus, when parsing the column, the timestamp converter will fail on the Date entry so we will need to try and convert it with the Date converter. If both converters fail, then we will throw an error.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Jun 16, 2022

Took a cursory look. @MaxGekk do you remember the context here? I remember we didn't merge this change because the legacy fast format parser (Java 8 time libraries) did not support the exact matching (e.g., "yyyy" parses "2000-10-12" as "2000")

@HyukjinKwon
Copy link
Member

cc @bersprockets too if you find some time to review.

@bersprockets
Copy link
Contributor

@Jonathancui123 You probably want to turn on github actions so tests will run.

From https://spark.apache.org/contributing.html:

Go to “Actions” tab on your forked repository and enable “Build and test” and “Report test results” workflows

@Jonathancui123 Jonathancui123 force-pushed the SPARK-39469-date-infer branch from dcbe9e8 to 1cd55c7 Compare June 22, 2022 00:28
@github-actions github-actions bot added the DOCS label Jun 22, 2022
@Jonathancui123
Copy link
Contributor Author

Jonathancui123 commented Jun 22, 2022

I've added a inferDate flag that addresses the comments above:

  1. Performance: Logic for date inference during is guarded behind the flag
  2. Parsing behavior inferDate=false: Behavior is identical to prior to changes
  3. Parsing behavior inferDate=true: In timestamp columns, we will attempt to parse as timestamp and then parse as date if timestamp fails.

The change to parsing behavior is necessary because:

We can have columns of Date and Timestamp entries inferred as Timestamp. Consider the column of a DateType followed by a TimestampType. We would expect this column to be inferred as a TimestampType column. Thus, when parsing the column, the timestamp converter will fail on the Date entry so we will need to try and convert it with the Date converter. If both converters fail, then we will throw an error.

Comment on lines 2757 to 2836
Copy link
Contributor Author

@Jonathancui123 Jonathancui123 Jun 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One test we might need would be timestampFormat" -> "dd/MM/yyyy HH:mm and dateFormat -> dd/MM/yyyy to make sure timestamps are not parsed as date types without conflicting.

This test uses:

      "timestampFormat" -> "yyyy-MM-dd'T'HH:mm",
      "dateFormat" -> "yyyy-MM-dd",

This e2e test ensures that our DateFormatter is using strict parsing. We will not infer Timestamp columns as Date columns if the DateFormat is a prefix of the TimestampFormat.

Thank you for the review! @HyukjinKwon @bersprockets

Comment on lines +226 to +227
Copy link
Contributor Author

@Jonathancui123 Jonathancui123 Jun 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not use the legacy DateFormatter here to avoid parsing timestamps with invalid suffixes. We want to throw an error when invalid timstamps are given.

e.g. The legacy DateFormatter will parse the following string without throwing an error:
dateFormat: yyyy-mm-dd
string: 2001-09-08-randomtext

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think it makes sense to throw an exception or disallow when legacy parser is used (that doesn't care about surffixes)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not use the legacy DateFormatter here to avoid parsing timestamps with invalid suffixes.

I think you could still make it work, but you would need a new extension of LegacySimpleDateFormatter (maybe LegacyStrictSimpleDateFormatter), with an override like this:

  def parseToDate(s: String): Date = {
    val pp = new ParsePosition(0)
    val res = sdf.parse(s, pp)
    if (s.length != pp.getIndex) {
      throw new RuntimeException(s"$s is not a date")
    }
    res
  }

2001-09-08-randomtext would not parse, neither would 2022-01-02 12:56:33, but 2022-01-02 would (assuming a format of yyyy-MM-dd).

I assume it would be slow (but I have not tested it).

Maybe not worth the extra code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bersprockets Thanks for the suggestion! Do you know what is the advantage of allowing Legacy Formatter? i.e. what is a date format that the legacy formatter can handle but the current formatter cannot?

I'm wondering if there will be a sufficient population of users who want to infer date in schema and also use legacy date formats

cc: @Yaohua628

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know what is the advantage of allowing Legacy Formatter?

One benefit of the legacy formatter is that it recognizes some pre-Gregorian leap years (like 1500-02-29) that exist only in the hybrid Julian calendar. Note how schema inference chooses string until you set the legacy parser.

scala> val csvInput = Seq("1425-03-22T00:00:00", "2022-01-01T00:00:00", "1500-02-29T00:00:00").toDS()
csvInput: org.apache.spark.sql.Dataset[String] = [value: string]

scala> spark.read.options(Map("inferSchema" -> "true", "timestampFormat" -> "yyyy-MM-dd'T'HH:mm:ss")).csv(csvInput).printSchema
root
 |-- _c0: string (nullable = true)

scala> sql("set spark.sql.legacy.timeParserPolicy=legacy")
res1: org.apache.spark.sql.DataFrame = [key: string, value: string]

scala> spark.read.options(Map("inferSchema" -> "true", "timestampFormat" -> "yyyy-MM-dd'T'HH:mm:ss")).csv(csvInput).printSchema
root
 |-- _c0: timestamp (nullable = true)

scala> 

That, of course, matters only if the application's input comes from legacy systems that still use hybrid Julian, and the input contains pre-Gregorian dates (e.g., for date encoding, which is the only real-world use case I have come across). I would imagine that audience is small and probably getting smaller.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could still make it work, but you would need a new extension of LegacySimpleDateFormatter

By the way, to avoid confusion, I meant the above in the context of inferring dates when using the legacy parser (I realize now that this discussion is happening in reference to code changes in UnivocityParser).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Bruce! This is great context! This will definitely be necessary if we want to support inference along with legacy date formats. Users on legacy dates will be unaffected by this change - how about we can open another ticket for date inference with legacy formats if the demand exists (and merge this PR without legacy date inference support)?

@HyukjinKwon
Copy link
Member

The logic looks making sense in general .. but would be best to have second look from @bersprockets @MaxGekk @gengliangwang ....

@HyukjinKwon
Copy link
Member

One test failed at org.apache.spark.sql.execution.datasources.csv.CSVLegacyTimeParserSuite

2022-06-22T01:10:13.0585659Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m- SPARK-39469: Infer schema for date type *** FAILED *** (106 milliseconds)�[0m�[0m
2022-06-22T01:10:13.0592465Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  StructType(StructField(date,DateType,true), StructField(timestamp-date,DateType,true), StructField(date-timestamp,DateType,true)) did not equal StructType(StructField(date,DateType,true), StructField(timestamp-date,TimestampType,true), StructField(date-timestamp,TimestampType,true)) (CSVSuite.scala:2773)�[0m�[0m
2022-06-22T01:10:13.0598357Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  Analysis:�[0m�[0m
2022-06-22T01:10:13.0604058Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  StructType(1: StructField(timestamp-date,DateType,true) -> StructField(timestamp-date,TimestampType,true), 2: StructField(date-timestamp,DateType,true) -> StructField(date-timestamp,TimestampType,true))�[0m�[0m
2022-06-22T01:10:13.0610333Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  org.scalatest.exceptions.TestFailedException:�[0m�[0m
2022-06-22T01:10:13.0622299Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)�[0m�[0m
2022-06-22T01:10:13.0627015Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)�[0m�[0m
2022-06-22T01:10:13.0634808Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)�[0m�[0m
2022-06-22T01:10:13.0639642Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)�[0m�[0m
2022-06-22T01:10:13.0647369Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.apache.spark.sql.execution.datasources.csv.CSVSuite.$anonfun$new$379(CSVSuite.scala:2773)�[0m�[0m
2022-06-22T01:10:13.0655897Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)�[0m�[0m
2022-06-22T01:10:13.0659677Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)�[0m�[0m
2022-06-22T01:10:13.0663352Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)�[0m�[0m
2022-06-22T01:10:13.0666989Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.scalatest.Transformer.apply(Transformer.scala:22)�[0m�[0m
2022-06-22T01:10:13.0670619Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.scalatest.Transformer.apply(Transformer.scala:20)�[0m�[0m
2022-06-22T01:10:13.0674478Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:190)�[0m�[0m
2022-06-22T01:10:13.0678412Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:203)�[0m�[0m
2022-06-22T01:10:13.0694953Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:188)�[0m�[0m
2022-06-22T01:10:13.0699731Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:200)�[0m�[0m
2022-06-22T01:10:13.0703990Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)�[0m�[0m
2022-06-22T01:10:13.0708256Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:200)�[0m�[0m
2022-06-22T01:10:13.0712723Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:182)�[0m�[0m
2022-06-22T01:10:13.0717517Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:64)�[0m�[0m
2022-06-22T01:10:13.0721680Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234)�[0m�[0m
2022-06-22T01:10:13.0726397Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227)�[0m�[0m

@Jonathancui123
Copy link
Contributor Author

I added a new QueryExecutionError called inferDateWithLegacyTimeParserError that is thrown when inferDate=true and SQL Configuration LegacyTimeParserPolicy is LEGACY. This prevents the legacy parser from being used with schema inference for date and fixes the failing test

@Jonathancui123 Jonathancui123 force-pushed the SPARK-39469-date-infer branch from df7146e to e1170d0 Compare July 12, 2022 18:12
@Yaohua628
Copy link
Contributor

Hi folks, @HyukjinKwon @bersprockets @cloud-fan, thanks for reviewing and some great suggestions, is this PR good to go? Thanks!

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in c2536a7 Jul 21, 2022
@sadikovi
Copy link
Contributor

sadikovi commented Jul 22, 2022

@cloud-fan @Jonathancui123 Wouldn't this patch cause correctness issues? This is what I found when working on #37147:

The "SPARK-39469: Infer schema for date type" test in CSVSuite highlights the issue when run together with my patch which attempts to forbid users to fall back to the default parser when the timestamp format is provided as it could lead to correctness issues.

Because "1765-03-28" does not match timestamp pattern and the column is inferred as TimestampType, it should be returned as null. However, in the test it is returned as 1765-03-28 00:00:00.0. This is at very least confusing - inferDate should only affect dates, not timestamp columns.

Technically my patch corrects this but I fixed the tests by enabling the incorrect behaviour with the flag here: a447b08.

Can someone take a look and clarify this one?

@sadikovi
Copy link
Contributor

Maybe it is just the test so I can update that in my PR but I would like to clarify the expected behaviour here.

// compatibility.
val str = DateTimeUtils.cleanLegacyTimestampStr(UTF8String.fromString(datum))
DateTimeUtils.stringToDate(str).getOrElse(throw e)
DateTimeUtils.stringToTimestamp(str, options.zoneId).getOrElse {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the issue here is, if the timestamp parsing fails, maybe it's because this is a date, or maybe it's a legacy timestamp format. We need to define the priority here. Since inferDate is opt-in, I think it makes more sense to try parsing as date first, then the legacy format.

cc @sadikovi

Copy link
Member

@MaxGekk MaxGekk Jul 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wonder, all issues mentioned by @HyukjinKwon in my PR #23202 (comment) have been addressed by this PR.

Copy link
Contributor

@sadikovi sadikovi Jul 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. We should address the order. Otherwise, it is unclear how to handle fallback. Fixed here: 10ca4a4.

@Jonathancui123
Copy link
Contributor Author

Jonathancui123 commented Jul 22, 2022

Because "1765-03-28" does not match timestamp pattern and the column is inferred as TimestampType, it should be returned as null. However, in the test it is returned as 1765-03-28 00:00:00.0. This is at very least confusing - inferDate should only affect dates, not timestamp columns.

@sadikovi I've verified on an older version of spark prior to this PR: "1765-03-28" in a timestamp column without user specified format is parsed as "1765-03-28 00:00:00.0". So the behavior of parsing default date format in timestamp columns is not due to this PR.

Prior to changes, in a timestamp column:

  • custom format date: null
  • default format date: Parsed by fallback

After inferDate PR, in a timestamp column:

  • custom format date: Parsed if inferDate is true, otherwise null
  • default format date: Parsed by fallback

After enableParsingFallbackForDateType PR (#37147), in a timestamp column with enableParsingFallbackForDateType=false:

  • custom format date: null
  • default format date: null

Since default format date was previously parsed by fallback, I thought it was desirable for dates to be parsed in a timestamp column. So I included support for custom format dates in a timestamp column when inferDate=true.

We have two options for target behavior:

OPTION A - Target behavior, in a timestamp column:

OPTION B - Target behavior, in a timestamp column:
enableParsingFallbackForDateType False:

  • custom format date: null
  • default format date: null
    enableParsingFallbackForDateType True:
  • custom format date: null
  • default format date: parsed!
    --> remove redundant dateFormatter.parse behavior for inferDate = true

@cloud-fan @HyukjinKwon should we go with Option A or Option B?

@HyukjinKwon
Copy link
Member

So am I right that Option A considers the type coercion between timestamps and dates, and Option B does not? I personally prefers Option B so we can switch to Option A in the future. Once we pick Option A, it's difficult to go back to Option B.

@sadikovi
Copy link
Contributor

sadikovi commented Jul 24, 2022

I think my main confusion comes from the "inferDate=true" turning an invalid timestamp value into a date and then returning it as a timestamp, the column should have been a DateType column.

@Jonathancui123 Would it be possible to revisit this behaviour? I agree with Wenchen, we may need to decide whether to parse it as legacy timestamp or inferDate.

@sadikovi
Copy link
Contributor

sadikovi commented Jul 24, 2022

@Jonathancui123 I fixed this issue in 10ca4a4 on my PR. Can you review? Thanks.

@HyukjinKwon
Copy link
Member

@sadikovi mind opening a pr?

@sadikovi
Copy link
Contributor

sadikovi commented Jul 25, 2022

It is a small change so I fixed it in my PR #37147.

@Jonathancui123
Copy link
Contributor Author

Jonathancui123 commented Jul 25, 2022

@Jonathancui123 I fixed this issue in 10ca4a4 on my PR. Can you review? Thanks.

@sadikovi Thanks for the change! I agree with it and I've left a comment

HyukjinKwon pushed a commit that referenced this pull request Aug 3, 2022
…ics of the option in CSV data source

### What changes were proposed in this pull request?

This is a follow-up for #36871.

PR renames `inferDate` to `prefersDate` to avoid confusion when dates inference would change the column type and result in confusion when the user meant to infer timestamps.

The patch also updates semantics of the option: `prefersDate` is allowed to be used during schema inference (`inferSchema`) as well as user-provided schema where it could be used as a fallback mechanism when parsing timestamps.

### Why are the changes needed?

Fixes ambiguity when setting `prefersDate` to true and clarifies semantics of the option.

### Does this PR introduce _any_ user-facing change?

Although it is an option rename, the original PR was merged a few days ago and the config option has not been included in a Spark release.

### How was this patch tested?

I added a unit test for prefersDate = true with a user schema.

Closes #37327 from sadikovi/rename_config.

Authored-by: Ivan Sadikov <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants