[SPARK-19228][SQL] Introduce tryParseDate method to process csv date,… #20140

sergey-rubtsov · 2018-01-03T11:23:44Z

… add a type-widening rule in findTightestCommonType between DateType and TimestampType, add java.time.format.DateTimeFormatter to more accurately infer the type of time, add an end-to-end test case and unit test

What changes were proposed in this pull request?

By design 'TimestampType' (8 bytes) is larger than 'DateType' (4 bytes).
But when a date is parsed, an option "dateFormat" is ignored and default date format ("yyyy-MM-dd") is using and the date is parsed as timestamp.

This patch fixes that bug.

For other details, please, read the ticket
https://issues.apache.org/jira/browse/SPARK-19228

How was this patch tested?

Add an end-to-end test case and unit test

… add a type-widening rule in findTightestCommonType between DateType and TimestampType, add java.time.format.DateTimeFormatter to more accurately infer the type of time, add an end-to-end test case and unit test

vanzin · 2018-01-03T18:02:50Z

ok to test

SparkQA · 2018-01-03T20:57:44Z

Test build #85639 has finished for PR 20140 at commit d2ed686.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sergey-rubtsov · 2018-01-15T10:47:02Z

@HyukjinKwon, @gatorsmile could you please help find someone to review this?

HyukjinKwon · 2018-01-18T03:48:00Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala


  val isCommentSet = this.comment != '\u0000'

+  def dateFormatter: DateTimeFormatter = {


Why is it def?

DateTimeFormatter has the disadvantage. It does not implement Serializable in contrast to FastDateFormat. That is why I couldn't make it as a val here.

I think you could do this via lazy val

HyukjinKwon · 2018-01-18T03:50:12Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala

+  }
+
+  def timestampFormatter: DateTimeFormatter = {
+    DateTimeFormatter.ofPattern(timestampFormat.getPattern)


Mind if I ask to elaborate DateTimeFormatter vs FastDateFormat?

DateTimeFormatter is a standard time library from java 8. FastDateFormat can't properly parse date and timestamp.

I can create some test cases to prove it, but I need many time for that.

Also, FastDateFormat does not meet the ISO8601: https://en.wikipedia.org/wiki/ISO_8601
Current implementation of CSVInferSchema contains other bugs. For example, test test("Timestamp field types are inferred correctly via custom date format") in class CSVInferSchemaSuite must not pass, because timestampFormat "yyyy-mm" is wrong format for year and month. It should be "yyyy-MM".
It is better to make refactor of date types and change deprecated types on new ones for the whole project.

HyukjinKwon · 2018-01-18T03:59:38Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala

-        case TimestampType => tryParseTimestamp(field, options)
+        case DateType => tryParseDate(field, options)
+        case TimestampType =>
+          findTightestCommonType(typeSoFar, tryParseTimestamp(field, options)).getOrElse(


Mind elaborating why we should find the wider type here?

Sorry, your question is not really clear for me.
We have to try parse object as DateType first, because date always can be parsed as date and as timestamp (begin of day).
Current implementation of spark ignores dates and it is always parsing them as timestamps

I mean, it wasn't clear why we need findTightestCommonType. I thought case TimestampType => tryParseTimestamp(field, options) will work.

HyukjinKwon · 2018-02-06T11:44:26Z

ok to test

SparkQA · 2018-02-06T14:02:11Z

Test build #87107 has finished for PR 20140 at commit d2ed686.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

…teTimeFormatter made lazy val

SparkQA · 2018-04-19T16:47:59Z

Test build #89572 has finished for PR 20140 at commit 84b236a.

This patch fails from timeout after a configured wait of `300m`.
This patch merges cleanly.
This patch adds no public classes.

sergey-rubtsov · 2018-04-19T18:41:30Z

@HyukjinKwon changed as you suggested

HyukjinKwon · 2018-04-28T13:50:30Z

ok to test

SparkQA · 2018-04-28T17:27:46Z

Test build #89958 has finished for PR 20140 at commit 84b236a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-04-29T03:40:16Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala

        Some(DecimalType(range + scale, scale))
      }
+    // By design 'TimestampType' (8 bytes) is larger than 'DateType' (4 bytes).
+    case (t1: DateType, t2: TimestampType) => Some(TimestampType)


I think we should do the opposite case too

case (t1: TimestampType, t2: DateType) => Some(TimestampType)

HyukjinKwon · 2018-04-29T03:42:03Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala


  val isCommentSet = this.comment != '\u0000'

+  lazy val dateFormatter: DateTimeFormatter = {


@transient lazy val

HyukjinKwon · 2018-04-29T03:44:47Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala

-    // This case infers a custom `dataFormat` is set.
-    if ((allCatch opt options.timestampFormat.parse(field)).isDefined) {
+    // This case infers a custom `timestampFormat` is set.
+    if ((allCatch opt options.timestampFormatter.parse(field)).isDefined) {


Should we replace it to timestampFormatter in CSV parsing logic too and document it in the migration guide? (e.g., date format is now inferred correctly and also things you mentioned in #20140 (comment))

Probably, adding a configuration to control this behaviour looks preferred in this case.

HyukjinKwon · 2018-04-29T03:47:19Z

Not a big deal but mind fixing the PR title to be complete and fix the PR description as the format indicates?

HyukjinKwon reviewed Jan 18, 2018

View reviewed changes

[SPARK-19228][SQL] refactor tryParseDate method after code review, Da…

84b236a

…teTimeFormatter made lazy val

HyukjinKwon reviewed Apr 29, 2018

View reviewed changes

sergey-rubtsov mentioned this pull request May 18, 2018

[SPARK-19228][SQL] Migrate on Java 8 time from FastDateFormat for meet the ISO8601 #21363

Closed

sergey-rubtsov closed this May 18, 2018


		val isCommentSet = this.comment != '\u0000'

		def dateFormatter: DateTimeFormatter = {


		val isCommentSet = this.comment != '\u0000'

		lazy val dateFormatter: DateTimeFormatter = {

[SPARK-19228][SQL] Introduce tryParseDate method to process csv date,… #20140

[SPARK-19228][SQL] Introduce tryParseDate method to process csv date,… #20140

Uh oh!

Conversation

sergey-rubtsov commented Jan 3, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

vanzin commented Jan 3, 2018

Uh oh!

SparkQA commented Jan 3, 2018

Uh oh!

sergey-rubtsov commented Jan 15, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Feb 6, 2018

Uh oh!

SparkQA commented Feb 6, 2018

Uh oh!

SparkQA commented Apr 19, 2018

Uh oh!

sergey-rubtsov commented Apr 19, 2018

Uh oh!

HyukjinKwon commented Apr 28, 2018

Uh oh!

SparkQA commented Apr 28, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Apr 29, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Apr 29, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

HyukjinKwon Apr 29, 2018 •

edited

Loading