-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-19228][SQL] Introduce tryParseDate method to process csv date,… #20140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… add a type-widening rule in findTightestCommonType between DateType and TimestampType, add java.time.format.DateTimeFormatter to more accurately infer the type of time, add an end-to-end test case and unit test
|
ok to test |
|
Test build #85639 has finished for PR 20140 at commit
|
|
@HyukjinKwon, @gatorsmile could you please help find someone to review this? |
|
|
||
| val isCommentSet = this.comment != '\u0000' | ||
|
|
||
| def dateFormatter: DateTimeFormatter = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it def?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DateTimeFormatter has the disadvantage. It does not implement Serializable in contrast to FastDateFormat. That is why I couldn't make it as a val here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you could do this via lazy val
| } | ||
|
|
||
| def timestampFormatter: DateTimeFormatter = { | ||
| DateTimeFormatter.ofPattern(timestampFormat.getPattern) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mind if I ask to elaborate DateTimeFormatter vs FastDateFormat?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DateTimeFormatter is a standard time library from java 8. FastDateFormat can't properly parse date and timestamp.
I can create some test cases to prove it, but I need many time for that.
Also, FastDateFormat does not meet the ISO8601: https://en.wikipedia.org/wiki/ISO_8601
Current implementation of CSVInferSchema contains other bugs. For example, test test("Timestamp field types are inferred correctly via custom date format") in class CSVInferSchemaSuite must not pass, because timestampFormat "yyyy-mm" is wrong format for year and month. It should be "yyyy-MM".
It is better to make refactor of date types and change deprecated types on new ones for the whole project.
| case TimestampType => tryParseTimestamp(field, options) | ||
| case DateType => tryParseDate(field, options) | ||
| case TimestampType => | ||
| findTightestCommonType(typeSoFar, tryParseTimestamp(field, options)).getOrElse( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mind elaborating why we should find the wider type here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, your question is not really clear for me.
We have to try parse object as DateType first, because date always can be parsed as date and as timestamp (begin of day).
Current implementation of spark ignores dates and it is always parsing them as timestamps
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, it wasn't clear why we need findTightestCommonType. I thought case TimestampType => tryParseTimestamp(field, options) will work.
|
ok to test |
|
Test build #87107 has finished for PR 20140 at commit
|
…teTimeFormatter made lazy val
|
Test build #89572 has finished for PR 20140 at commit
|
|
@HyukjinKwon changed as you suggested |
|
ok to test |
|
Test build #89958 has finished for PR 20140 at commit
|
| Some(DecimalType(range + scale, scale)) | ||
| } | ||
| // By design 'TimestampType' (8 bytes) is larger than 'DateType' (4 bytes). | ||
| case (t1: DateType, t2: TimestampType) => Some(TimestampType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should do the opposite case too
case (t1: TimestampType, t2: DateType) => Some(TimestampType)
|
|
||
| val isCommentSet = this.comment != '\u0000' | ||
|
|
||
| lazy val dateFormatter: DateTimeFormatter = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@transient lazy val
| // This case infers a custom `dataFormat` is set. | ||
| if ((allCatch opt options.timestampFormat.parse(field)).isDefined) { | ||
| // This case infers a custom `timestampFormat` is set. | ||
| if ((allCatch opt options.timestampFormatter.parse(field)).isDefined) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we replace it to timestampFormatter in CSV parsing logic too and document it in the migration guide? (e.g., date format is now inferred correctly and also things you mentioned in #20140 (comment))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably, adding a configuration to control this behaviour looks preferred in this case.
|
Not a big deal but mind fixing the PR title to be complete and fix the PR description as the format indicates? |
… add a type-widening rule in findTightestCommonType between DateType and TimestampType, add java.time.format.DateTimeFormatter to more accurately infer the type of time, add an end-to-end test case and unit test
What changes were proposed in this pull request?
By design 'TimestampType' (8 bytes) is larger than 'DateType' (4 bytes).
But when a date is parsed, an option "dateFormat" is ignored and default date format ("yyyy-MM-dd") is using and the date is parsed as timestamp.
This patch fixes that bug.
For other details, please, read the ticket
https://issues.apache.org/jira/browse/SPARK-19228
How was this patch tested?
Add an end-to-end test case and unit test