Skip to content

Conversation

@sergey-rubtsov
Copy link

… add a type-widening rule in findTightestCommonType between DateType and TimestampType, add java.time.format.DateTimeFormatter to more accurately infer the type of time, add an end-to-end test case and unit test

What changes were proposed in this pull request?

By design 'TimestampType' (8 bytes) is larger than 'DateType' (4 bytes).
But when a date is parsed, an option "dateFormat" is ignored and default date format ("yyyy-MM-dd") is using and the date is parsed as timestamp.

This patch fixes that bug.

For other details, please, read the ticket
https://issues.apache.org/jira/browse/SPARK-19228

How was this patch tested?

Add an end-to-end test case and unit test

… add a type-widening rule in findTightestCommonType between DateType and TimestampType, add java.time.format.DateTimeFormatter to more accurately infer the type of time, add an end-to-end test case and unit test
@vanzin
Copy link
Contributor

vanzin commented Jan 3, 2018

ok to test

@SparkQA
Copy link

SparkQA commented Jan 3, 2018

Test build #85639 has finished for PR 20140 at commit d2ed686.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sergey-rubtsov
Copy link
Author

@HyukjinKwon, @gatorsmile could you please help find someone to review this?


val isCommentSet = this.comment != '\u0000'

def dateFormatter: DateTimeFormatter = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it def?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DateTimeFormatter has the disadvantage. It does not implement Serializable in contrast to FastDateFormat. That is why I couldn't make it as a val here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could do this via lazy val

}

def timestampFormatter: DateTimeFormatter = {
DateTimeFormatter.ofPattern(timestampFormat.getPattern)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mind if I ask to elaborate DateTimeFormatter vs FastDateFormat?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DateTimeFormatter is a standard time library from java 8. FastDateFormat can't properly parse date and timestamp.

I can create some test cases to prove it, but I need many time for that.

Also, FastDateFormat does not meet the ISO8601: https://en.wikipedia.org/wiki/ISO_8601
Current implementation of CSVInferSchema contains other bugs. For example, test test("Timestamp field types are inferred correctly via custom date format") in class CSVInferSchemaSuite must not pass, because timestampFormat "yyyy-mm" is wrong format for year and month. It should be "yyyy-MM".
It is better to make refactor of date types and change deprecated types on new ones for the whole project.

case TimestampType => tryParseTimestamp(field, options)
case DateType => tryParseDate(field, options)
case TimestampType =>
findTightestCommonType(typeSoFar, tryParseTimestamp(field, options)).getOrElse(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mind elaborating why we should find the wider type here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, your question is not really clear for me.
We have to try parse object as DateType first, because date always can be parsed as date and as timestamp (begin of day).
Current implementation of spark ignores dates and it is always parsing them as timestamps

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, it wasn't clear why we need findTightestCommonType. I thought case TimestampType => tryParseTimestamp(field, options) will work.

@HyukjinKwon
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Feb 6, 2018

Test build #87107 has finished for PR 20140 at commit d2ed686.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 19, 2018

Test build #89572 has finished for PR 20140 at commit 84b236a.

  • This patch fails from timeout after a configured wait of `300m`.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sergey-rubtsov
Copy link
Author

@HyukjinKwon changed as you suggested

@HyukjinKwon
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Apr 28, 2018

Test build #89958 has finished for PR 20140 at commit 84b236a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Some(DecimalType(range + scale, scale))
}
// By design 'TimestampType' (8 bytes) is larger than 'DateType' (4 bytes).
case (t1: DateType, t2: TimestampType) => Some(TimestampType)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should do the opposite case too

case (t1: TimestampType, t2: DateType) => Some(TimestampType)


val isCommentSet = this.comment != '\u0000'

lazy val dateFormatter: DateTimeFormatter = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@transient lazy val

// This case infers a custom `dataFormat` is set.
if ((allCatch opt options.timestampFormat.parse(field)).isDefined) {
// This case infers a custom `timestampFormat` is set.
if ((allCatch opt options.timestampFormatter.parse(field)).isDefined) {
Copy link
Member

@HyukjinKwon HyukjinKwon Apr 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we replace it to timestampFormatter in CSV parsing logic too and document it in the migration guide? (e.g., date format is now inferred correctly and also things you mentioned in #20140 (comment))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, adding a configuration to control this behaviour looks preferred in this case.

@HyukjinKwon
Copy link
Member

Not a big deal but mind fixing the PR title to be complete and fix the PR description as the format indicates?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants