-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-23436][SQL] Infer partition as Date only if it can be casted to Date #20621
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
2f05ab8
5d60f88
6b56408
6274537
8698f4d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -407,6 +407,34 @@ object PartitioningUtils { | |
| Literal(bigDecimal) | ||
| } | ||
|
|
||
| val dateTry = Try { | ||
| // try and parse the date, if no exception occurs this is a candidate to be resolved as | ||
| // DateType | ||
| DateTimeUtils.getThreadLocalDateFormat.parse(raw) | ||
| // SPARK-23436: Casting the string to date may still return null if a bad Date is provided. | ||
| // This can happen since DateFormat.parse may not use the entire text of the given string: | ||
| // so if there are extra-characters after the date, it returns correctly. | ||
| // We need to check that we can cast the raw string since we later can use Cast to get | ||
| // the partition values with the right DataType (see | ||
| // org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.inferPartitioning) | ||
| val dateValue = Cast(Literal(raw), DateType).eval() | ||
| // Disallow DateType if the cast returned null | ||
| require(dateValue != null) | ||
| Literal.create(dateValue, DateType) | ||
| } | ||
|
|
||
| val timestampTry = Try { | ||
| val unescapedRaw = unescapePathName(raw) | ||
| // try and parse the date, if no exception occurs this is a candidate to be resolved as | ||
| // TimestampType | ||
| DateTimeUtils.getThreadLocalTimestampFormat(timeZone).parse(unescapedRaw) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we save the parsing here? I think to cast string to
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think so, because in the cast we tolerate various timestamp format which parse doesn't support (please check the comment to
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since this changes the behavior of
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. sorry, I am not sure I got 100% your question, may you elaborate it a bit more please?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, you are right. the only change introduced is that some values which were previously wrongly inferred as dates, now will be inferred as strings. Everything else works as before. |
||
| // SPARK-23436: see comment for date | ||
| val timestampValue = Cast(Literal(unescapedRaw), TimestampType, Some(timeZone.getID)).eval() | ||
| // Disallow TimestampType if the cast returned null | ||
| require(timestampValue != null) | ||
| Literal.create(timestampValue, TimestampType) | ||
| } | ||
|
|
||
| if (typeInference) { | ||
| // First tries integral types | ||
| Try(Literal.create(Integer.parseInt(raw), IntegerType)) | ||
|
|
@@ -415,16 +443,8 @@ object PartitioningUtils { | |
| // Then falls back to fractional types | ||
| .orElse(Try(Literal.create(JDouble.parseDouble(raw), DoubleType))) | ||
| // Then falls back to date/timestamp types | ||
| .orElse(Try( | ||
| Literal.create( | ||
| DateTimeUtils.getThreadLocalTimestampFormat(timeZone) | ||
| .parse(unescapePathName(raw)).getTime * 1000L, | ||
| TimestampType))) | ||
| .orElse(Try( | ||
| Literal.create( | ||
| DateTimeUtils.millisToDays( | ||
| DateTimeUtils.getThreadLocalDateFormat.parse(raw).getTime), | ||
| DateType))) | ||
| .orElse(timestampTry) | ||
| .orElse(dateTry) | ||
| // Then falls back to string | ||
| .getOrElse { | ||
| if (raw == DEFAULT_PARTITION_NAME) { | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, so the root cause is more specific to
SimpleDateFormatbecause it allows invalid dates like2018-01-01-04to be parsed fine ..There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually all the
DateFormat'sparseallow extra-characters after a valid date: (https://docs.oracle.com/javase/7/docs/api/java/text/DateFormat.html#parse(java.lang.String)).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this a short-cut? It seems OK to always go to the cast path,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it is enough to go always with the cast path, since it allows many format/strings, not allowed by the parse method. Thus I think it not safe to avoid the parse method.