-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-29904][SQL][2.4] Parse timestamps in microsecond precision by JSON/CSV datasources #26507
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@cloud-fan Please, take a look at this. |
| } | ||
|
|
||
| class MicrosCalendar(tz: TimeZone) extends GregorianCalendar(tz, Locale.US) { | ||
| def getMicros(): SQLTimestamp = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have some comments to explain the behavior? Seems to me it's
- for
.1, it's 1000 microseconds - for
.1234, it's 1234 microseconds - for
.1234567, it's 123456 microseconds
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a simple rule? The rule for interval is pretty simple: adding 0 at the end until the second fraction has 9 digits, then parse the 9 digits to nanoseconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a simple rule?
I haven't found simpler approach for now. The difference between interval and timestamp is the former one may have time zone at the end or anything else. We cannot say to users don't use the pattern like mm:ss.SSSSSSXXX yyyy/MM/dd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what we get here is the milliseconds FastDateFormat extracts from the string. I believe FastDateFormat can handle the part after seconds. e.g. 12:12:12.1234Z, the milliseconds part should be 1234.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have some comments to explain the behavior? Seems to me it's
for .1, it's 1000 microseconds
This is 100 * 1000 microsecond but SimpleDateFormat and FastDateFormat have weird behavior. The example below on 2.4 without my changes:
scala> val df = Seq("""{"a":"2019-10-14T09:39:07.1Z"}""").toDF
df: org.apache.spark.sql.DataFrame = [value: string]
scala> val res = df.select(from_json('value, schema, Map("timestampFormat" -> "yyyy-MM-dd'T'HH:mm:ss.SXXX")))
res: org.apache.spark.sql.DataFrame = [jsontostructs(value): struct<a: timestamp>]
scala> res.show(false)
+-------------------------+
|jsontostructs(value) |
+-------------------------+
|[2019-10-14 12:39:07.001]|
+-------------------------+scala> val res = df.select(from_json('value, schema, Map("timestampFormat" -> "yyyy-MM-dd'T'HH:mm:ss.SSSXXX")))
res: org.apache.spark.sql.DataFrame = [jsontostructs(value): struct<a: timestamp>]
scala> res.show(false)
+-------------------------+
|jsontostructs(value) |
+-------------------------+
|[2019-10-14 12:39:07.001]|
+-------------------------+So .1 cannot be parsed correctly only 0.100:
scala> val df = Seq("""{"a":"2019-10-14T09:39:07.100Z"}""").toDF
df: org.apache.spark.sql.DataFrame = [value: string]
scala> val res = df.select(from_json('value, schema, Map("timestampFormat" -> "yyyy-MM-dd'T'HH:mm:ss.SSSXXX")))
res: org.apache.spark.sql.DataFrame = [jsontostructs(value): struct<a: timestamp>]
scala> res.show(false)
+-----------------------+
|jsontostructs(value) |
+-----------------------+
|[2019-10-14 12:39:07.1]|
+-----------------------+There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I see in the source code of SimpleDateFormat, it just casts the fraction part to int. .001 and .01 are the same and equal to 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, this following check fails:
check("yyyy-MM-dd'T'HH:mm:ss.SSSSSSXXX",
"2019-10-14T09:39:07.000010Z", "2019-10-14T09:39:07.000010Z")
Expected :1571045947000010
Actual :1571045947010000
because.000010 is parsed to 10 inside of SimpleDateFormat.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The commit 86ac2b2 should fix that
| assert(actual === expected) | ||
| } | ||
| check("yyyy-MM-dd'T'HH:mm:ss.SSSSSSSXXX", | ||
| "2019-10-14T09:39:07.3220000Z", "2019-10-14T09:39:07.322Z") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this behavior the same with master branch?
|
Test build #113716 has finished for PR 26507 at commit
|
|
Test build #113727 has finished for PR 26507 at commit
|
|
Test build #113734 has finished for PR 26507 at commit
|
|
is there some standard to define the behavior? For example, |
I think regular mathematical definition is applicable here: https://en.wikipedia.org/wiki/Fraction_(mathematics)#Decimal_fractions_and_percentages
Java 8 DateTimeFormatter says that The doc of SimpleDateFormat says |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
Show resolved
Hide resolved
|
@cloud-fan In general, are you ok with the changes? Should I continue? |
|
I think this is a good idea. I wasn't aware of that we can tweak |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
Outdated
Show resolved
Hide resolved
| check("yyyy-MM-dd'T'HH:mm:ss.SX", | ||
| "2019-10-14T09:39:07.1Z", "2019-10-14T09:39:07.1Z") | ||
| check("yyyy-MM-dd'T'HH:mm:ss.SSX", | ||
| "2019-10-14T09:39:07.10Z", "2019-10-14T09:39:07.1Z") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add some negative tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to see a test like xxx.123 with format .SS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It just returns invalid result xxx+1.23. For example:
"2019-10-14T09:39:07.123Z" -> "2019-10-14T09:39:08.23Z". I can add such test but I don't know what it aims to validate.
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
Show resolved
Hide resolved
|
Test build #113784 has finished for PR 26507 at commit
|
|
Test build #113857 has finished for PR 26507 at commit
|
|
jenkins, retest this, please |
|
Test build #113861 has finished for PR 26507 at commit
|
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateTimeUtilsSuite.scala
Outdated
Show resolved
Hide resolved
|
cc @HyukjinKwon |
|
Test build #113893 has finished for PR 26507 at commit
|
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Merged to branch-2.4.
Thank you, @MaxGekk and @cloud-fan .
…JSON/CSV datasources ### What changes were proposed in this pull request? In the PR, I propose parsing timestamp strings up to microsecond precision. To achieve that, I added a sub-class of `GregorianCalendar` to get access to `protected` field `fields` which contains non-normalized parsed fields immediately after parsing. In particular, I assume that the `MILLISECOND` field contains entire seconds fraction converted to `int`. By knowing the expected digits in the fractional part, the parsed field is converted to a fraction up to the microsecond precision. This PR supports additional patterns for seconds fractions from `S` to `SSSSSS` in JSON/CSV options. ### Why are the changes needed? To improve user experience with JSON and CSV datasources, and to allow parsing timestamp strings up to microsecond precision. ### Does this PR introduce any user-facing change? No, the PR extends the set of supported timestamp patterns for the seconds fraction by `S`, `SS`, `SSSS`, `SSSSS`, and `SSSSSS`. ### How was this patch tested? By existing test suites `JsonExpressionSuite`, `JsonFunctionsSuite`, `JsonSuite`, `CsvSuite`, `UnivocityParserSuite`, and added new tests to `DateTimeUtilsSuite`, `JsonFunctionsSuite` for `from_json()` and to `CSVSuite`. Closes #26507 from MaxGekk/fastdateformat-micros. Authored-by: Maxim Gekk <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM too. Sorry for my late response.
|
@MaxGekk . According to your finding, we may revert this from |
This PR fixes bugs as well, see #26507 (comment) . @dongjoon-hyun Try to parse timestamps with fraction patterns |
|
@MaxGekk . The existing bugs are not blockers for the release. But, new bug can be a blocker because it's called as a regression. |
ok. As far as I know this PR haven't introduced any regressions yet or I missed something? |
|
Sure, @MaxGekk . That's the reason why I say 'in the worst case'. We are investigating this area, aren't we? This is an early notice to the relevant peers on this PR. |
|
The document of I think we should apply it to all the places that promise to support pattern string of BTW I don't think this is a serious bug that worth to fail the RC. |
|
Thank you for the decision, @cloud-fan ! |
… legacy date/timestamp formatters ### What changes were proposed in this pull request? In the PR, I propose to add legacy date/timestamp formatters based on `SimpleDateFormat` and `FastDateFormat`: - `LegacyFastTimestampFormatter` - uses `FastDateFormat` and supports parsing/formatting in microsecond precision. The code was borrowed from Spark 2.4, see #26507 & #26582 - `LegacySimpleTimestampFormatter` uses `SimpleDateFormat`, and support the `lenient` mode. When the `lenient` parameter is set to `false`, the parser become much stronger in checking its input. ### Why are the changes needed? Spark 2.4.x uses the following parsers for parsing/formatting date/timestamp strings: - `DateTimeFormat` in CSV/JSON datasource - `SimpleDateFormat` - is used in JDBC datasource, in partitions parsing. - `SimpleDateFormat` in strong mode (`lenient = false`), see https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L124. It is used by the `date_format`, `from_unixtime`, `unix_timestamp` and `to_unix_timestamp` functions. The PR aims to make Spark 3.0 compatible with Spark 2.4.x in all those cases when `spark.sql.legacy.timeParser.enabled` is set to `true`. ### Does this PR introduce any user-facing change? This shouldn't change behavior with default settings. If `spark.sql.legacy.timeParser.enabled` is set to `true`, users should observe behavior of Spark 2.4. ### How was this patch tested? - Modified tests in `DateExpressionsSuite` to check the legacy parser - `SimpleDateFormat`. - Added `CSVLegacyTimeParserSuite` and `JsonLegacyTimeParserSuite` to run `CSVSuite` and `JsonSuite` with the legacy parser - `FastDateFormat`. Closes #27524 from MaxGekk/timestamp-formatter-legacy-fallback. Authored-by: Maxim Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
… legacy date/timestamp formatters ### What changes were proposed in this pull request? In the PR, I propose to add legacy date/timestamp formatters based on `SimpleDateFormat` and `FastDateFormat`: - `LegacyFastTimestampFormatter` - uses `FastDateFormat` and supports parsing/formatting in microsecond precision. The code was borrowed from Spark 2.4, see #26507 & #26582 - `LegacySimpleTimestampFormatter` uses `SimpleDateFormat`, and support the `lenient` mode. When the `lenient` parameter is set to `false`, the parser become much stronger in checking its input. ### Why are the changes needed? Spark 2.4.x uses the following parsers for parsing/formatting date/timestamp strings: - `DateTimeFormat` in CSV/JSON datasource - `SimpleDateFormat` - is used in JDBC datasource, in partitions parsing. - `SimpleDateFormat` in strong mode (`lenient = false`), see https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L124. It is used by the `date_format`, `from_unixtime`, `unix_timestamp` and `to_unix_timestamp` functions. The PR aims to make Spark 3.0 compatible with Spark 2.4.x in all those cases when `spark.sql.legacy.timeParser.enabled` is set to `true`. ### Does this PR introduce any user-facing change? This shouldn't change behavior with default settings. If `spark.sql.legacy.timeParser.enabled` is set to `true`, users should observe behavior of Spark 2.4. ### How was this patch tested? - Modified tests in `DateExpressionsSuite` to check the legacy parser - `SimpleDateFormat`. - Added `CSVLegacyTimeParserSuite` and `JsonLegacyTimeParserSuite` to run `CSVSuite` and `JsonSuite` with the legacy parser - `FastDateFormat`. Closes #27524 from MaxGekk/timestamp-formatter-legacy-fallback. Authored-by: Maxim Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit c198620) Signed-off-by: Wenchen Fan <[email protected]>
… legacy date/timestamp formatters ### What changes were proposed in this pull request? In the PR, I propose to add legacy date/timestamp formatters based on `SimpleDateFormat` and `FastDateFormat`: - `LegacyFastTimestampFormatter` - uses `FastDateFormat` and supports parsing/formatting in microsecond precision. The code was borrowed from Spark 2.4, see apache#26507 & apache#26582 - `LegacySimpleTimestampFormatter` uses `SimpleDateFormat`, and support the `lenient` mode. When the `lenient` parameter is set to `false`, the parser become much stronger in checking its input. ### Why are the changes needed? Spark 2.4.x uses the following parsers for parsing/formatting date/timestamp strings: - `DateTimeFormat` in CSV/JSON datasource - `SimpleDateFormat` - is used in JDBC datasource, in partitions parsing. - `SimpleDateFormat` in strong mode (`lenient = false`), see https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L124. It is used by the `date_format`, `from_unixtime`, `unix_timestamp` and `to_unix_timestamp` functions. The PR aims to make Spark 3.0 compatible with Spark 2.4.x in all those cases when `spark.sql.legacy.timeParser.enabled` is set to `true`. ### Does this PR introduce any user-facing change? This shouldn't change behavior with default settings. If `spark.sql.legacy.timeParser.enabled` is set to `true`, users should observe behavior of Spark 2.4. ### How was this patch tested? - Modified tests in `DateExpressionsSuite` to check the legacy parser - `SimpleDateFormat`. - Added `CSVLegacyTimeParserSuite` and `JsonLegacyTimeParserSuite` to run `CSVSuite` and `JsonSuite` with the legacy parser - `FastDateFormat`. Closes apache#27524 from MaxGekk/timestamp-formatter-legacy-fallback. Authored-by: Maxim Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
In the PR, I propose parsing timestamp strings up to microsecond precision. To achieve that, I added a sub-class of
GregorianCalendarto get access toprotectedfieldfieldswhich contains non-normalized parsed fields immediately after parsing. In particular, I assume that theMILLISECONDfield contains entire seconds fraction converted toint. By knowing the expected digits in the fractional part, the parsed field is converted to a fraction up to the microsecond precision.This PR supports additional patterns for seconds fractions from
StoSSSSSSin JSON/CSV options.Why are the changes needed?
To improve user experience with JSON and CSV datasources, and to allow parsing timestamp strings up to microsecond precision.
Does this PR introduce any user-facing change?
No, the PR extends the set of supported timestamp patterns for the seconds fraction by
S,SS,SSSS,SSSSS, andSSSSSS.How was this patch tested?
By existing test suites
JsonExpressionSuite,JsonFunctionsSuite,JsonSuite,CsvSuite,UnivocityParserSuite, and added new tests toDateTimeUtilsSuite,JsonFunctionsSuiteforfrom_json()and toCSVSuite.