[SPARK-29904][SQL][2.4] Parse timestamps in microsecond precision by JSON/CSV datasources #26507

MaxGekk · 2019-11-13T16:38:56Z

What changes were proposed in this pull request?

In the PR, I propose parsing timestamp strings up to microsecond precision. To achieve that, I added a sub-class of GregorianCalendar to get access to protected field fields which contains non-normalized parsed fields immediately after parsing. In particular, I assume that the MILLISECOND field contains entire seconds fraction converted to int. By knowing the expected digits in the fractional part, the parsed field is converted to a fraction up to the microsecond precision.

This PR supports additional patterns for seconds fractions from S to SSSSSS in JSON/CSV options.

Why are the changes needed?

To improve user experience with JSON and CSV datasources, and to allow parsing timestamp strings up to microsecond precision.

Does this PR introduce any user-facing change?

No, the PR extends the set of supported timestamp patterns for the seconds fraction by S, SS, SSSS, SSSSS, and SSSSSS.

How was this patch tested?

By existing test suites JsonExpressionSuite, JsonFunctionsSuite, JsonSuite, CsvSuite, UnivocityParserSuite, and added new tests to DateTimeUtilsSuite, JsonFunctionsSuite for from_json() and to CSVSuite.

…micros

MaxGekk · 2019-11-13T16:41:00Z

@cloud-fan Please, take a look at this.

cloud-fan · 2019-11-13T17:31:13Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala

  }
+
+  class MicrosCalendar(tz: TimeZone) extends GregorianCalendar(tz, Locale.US) {
+    def getMicros(): SQLTimestamp = {


Can we have some comments to explain the behavior? Seems to me it's

for .1, it's 1000 microseconds

for .1234, it's 1234 microseconds

for .1234567, it's 123456 microseconds

Do we have a simple rule? The rule for interval is pretty simple: adding 0 at the end until the second fraction has 9 digits, then parse the 9 digits to nanoseconds.

Do we have a simple rule?

I haven't found simpler approach for now. The difference between interval and timestamp is the former one may have time zone at the end or anything else. We cannot say to users don't use the pattern like mm:ss.SSSSSSXXX yyyy/MM/dd

So what we get here is the milliseconds FastDateFormat extracts from the string. I believe FastDateFormat can handle the part after seconds. e.g. 12:12:12.1234Z, the milliseconds part should be 1234.

Can we have some comments to explain the behavior? Seems to me it's
for .1, it's 1000 microseconds

This is 100 * 1000 microsecond but SimpleDateFormat and FastDateFormat have weird behavior. The example below on 2.4 without my changes:

scala> val df = Seq("""{"a":"2019-10-14T09:39:07.1Z"}""").toDF df: org.apache.spark.sql.DataFrame = [value: string] scala> val res = df.select(from_json('value, schema, Map("timestampFormat" -> "yyyy-MM-dd'T'HH:mm:ss.SXXX"))) res: org.apache.spark.sql.DataFrame = [jsontostructs(value): struct<a: timestamp>] scala> res.show(false) +-------------------------+ |jsontostructs(value) | +-------------------------+ |[2019-10-14 12:39:07.001]| +-------------------------+

scala> val res = df.select(from_json('value, schema, Map("timestampFormat" -> "yyyy-MM-dd'T'HH:mm:ss.SSSXXX"))) res: org.apache.spark.sql.DataFrame = [jsontostructs(value): struct<a: timestamp>] scala> res.show(false) +-------------------------+ |jsontostructs(value) | +-------------------------+ |[2019-10-14 12:39:07.001]| +-------------------------+

So .1 cannot be parsed correctly only 0.100:

scala> val df = Seq("""{"a":"2019-10-14T09:39:07.100Z"}""").toDF df: org.apache.spark.sql.DataFrame = [value: string] scala> val res = df.select(from_json('value, schema, Map("timestampFormat" -> "yyyy-MM-dd'T'HH:mm:ss.SSSXXX"))) res: org.apache.spark.sql.DataFrame = [jsontostructs(value): struct<a: timestamp>] scala> res.show(false) +-----------------------+ |jsontostructs(value) | +-----------------------+ |[2019-10-14 12:39:07.1]| +-----------------------+

What I see in the source code of SimpleDateFormat, it just casts the fraction part to int. .001 and .01 are the same and equal to 1.

Currently, this following check fails:

check("yyyy-MM-dd'T'HH:mm:ss.SSSSSSXXX", "2019-10-14T09:39:07.000010Z", "2019-10-14T09:39:07.000010Z")

Expected :1571045947000010 Actual :1571045947010000

because.000010 is parsed to 10 inside of SimpleDateFormat.

The commit 86ac2b2 should fix that

cloud-fan · 2019-11-13T18:07:37Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateTimeUtilsSuite.scala

+      assert(actual === expected)
+    }
+    check("yyyy-MM-dd'T'HH:mm:ss.SSSSSSSXXX",
+      "2019-10-14T09:39:07.3220000Z", "2019-10-14T09:39:07.322Z")


is this behavior the same with master branch?

SparkQA · 2019-11-13T20:07:10Z

Test build #113716 has finished for PR 26507 at commit 3d7a611.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-11-14T00:24:52Z

Test build #113727 has finished for PR 26507 at commit 453ee5c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-11-14T02:36:35Z

Test build #113734 has finished for PR 26507 at commit 86ac2b2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-11-14T03:18:20Z

is there some standard to define the behavior? For example, 11:11:11.123, how shall we interpret the .123 here? or it's related to the timestamp format?

MaxGekk · 2019-11-14T08:08:40Z

is there some standard to define the behavior?

I think regular mathematical definition is applicable here: https://en.wikipedia.org/wiki/Fraction_(mathematics)#Decimal_fractions_and_percentages

For example, 11:11:11.123, how shall we interpret the .123 here? or it's related to the timestamp format?

Java 8 DateTimeFormatter says that S is fraction-of-second: https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html . So, it mean 11.123 = 11123/1000 seconds and 11.123456 = 11123456/1000000 = 11 seconds 123 milliseconds 456 microseconds for the pattern ss.SSSSSS

The doc of SimpleDateFormat says S is Millisecond: https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html. It means 11.123456 = 11 seconds 123456 milliseconds = 134 seconds 456 milliseconds for the pattern ss.SSSSSS

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala

MaxGekk · 2019-11-14T10:11:45Z

@cloud-fan In general, are you ok with the changes? Should I continue?

cloud-fan · 2019-11-14T10:14:57Z

I think this is a good idea. I wasn't aware of that we can tweak SimpleDateFormat in this way.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala

cloud-fan · 2019-11-14T12:46:13Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateTimeUtilsSuite.scala

+    check("yyyy-MM-dd'T'HH:mm:ss.SX",
+      "2019-10-14T09:39:07.1Z", "2019-10-14T09:39:07.1Z")
+    check("yyyy-MM-dd'T'HH:mm:ss.SSX",
+      "2019-10-14T09:39:07.10Z", "2019-10-14T09:39:07.1Z")


can we add some negative tests?

I'd like to see a test like xxx.123 with format .SS

It just returns invalid result xxx+1.23. For example:
"2019-10-14T09:39:07.123Z" -> "2019-10-14T09:39:08.23Z". I can add such test but I don't know what it aims to validate.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala

SparkQA · 2019-11-14T15:55:11Z

Test build #113784 has finished for PR 26507 at commit 638c640.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-11-15T08:05:02Z

Test build #113857 has finished for PR 26507 at commit 1852511.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class TimestampParser(format: FastDateFormat)

MaxGekk · 2019-11-15T08:13:19Z

jenkins, retest this, please

SparkQA · 2019-11-15T11:28:47Z

Test build #113861 has finished for PR 26507 at commit 1852511.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class TimestampParser(format: FastDateFormat)

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateTimeUtilsSuite.scala

dongjoon-hyun · 2019-11-15T17:37:21Z

cc @HyukjinKwon

SparkQA · 2019-11-15T22:03:23Z

Test build #113893 has finished for PR 26507 at commit 1cbd35b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

+1, LGTM. Merged to branch-2.4.
Thank you, @MaxGekk and @cloud-fan .

…JSON/CSV datasources ### What changes were proposed in this pull request? In the PR, I propose parsing timestamp strings up to microsecond precision. To achieve that, I added a sub-class of `GregorianCalendar` to get access to `protected` field `fields` which contains non-normalized parsed fields immediately after parsing. In particular, I assume that the `MILLISECOND` field contains entire seconds fraction converted to `int`. By knowing the expected digits in the fractional part, the parsed field is converted to a fraction up to the microsecond precision. This PR supports additional patterns for seconds fractions from `S` to `SSSSSS` in JSON/CSV options. ### Why are the changes needed? To improve user experience with JSON and CSV datasources, and to allow parsing timestamp strings up to microsecond precision. ### Does this PR introduce any user-facing change? No, the PR extends the set of supported timestamp patterns for the seconds fraction by `S`, `SS`, `SSSS`, `SSSSS`, and `SSSSSS`. ### How was this patch tested? By existing test suites `JsonExpressionSuite`, `JsonFunctionsSuite`, `JsonSuite`, `CsvSuite`, `UnivocityParserSuite`, and added new tests to `DateTimeUtilsSuite`, `JsonFunctionsSuite` for `from_json()` and to `CSVSuite`. Closes #26507 from MaxGekk/fastdateformat-micros. Authored-by: Maxim Gekk <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

HyukjinKwon

LGTM too. Sorry for my late response.

dongjoon-hyun · 2020-02-03T16:27:15Z

@MaxGekk . According to your finding, we may revert this from branch-2.4 completely in the worst case. Thank you for working on this area. This is reported as BUG fix, but it seems to be more Improvement .

MaxGekk · 2020-02-03T16:45:51Z

This is reported as BUG fix, but it seems to be more Improvement .

This PR fixes bugs as well, see #26507 (comment) . @dongjoon-hyun Try to parse timestamps with fraction patterns S or SS.

dongjoon-hyun · 2020-02-03T16:52:19Z

@MaxGekk . The existing bugs are not blockers for the release. But, new bug can be a blocker because it's called as a regression.

MaxGekk · 2020-02-03T17:00:24Z

But, new bug can be a blocker because it's called as a regression.

ok. As far as I know this PR haven't introduced any regressions yet or I missed something?

dongjoon-hyun · 2020-02-03T17:05:54Z

Sure, @MaxGekk . That's the reason why I say 'in the worst case'. We are investigating this area, aren't we? This is an early notice to the relevant peers on this PR.

cloud-fan · 2020-02-03T17:20:37Z

The document of DataFrameReader.json says that the timestampFormat option supports the pattern string of java.text.SimpleDateFormat. That said, microseconds precision should be supported if the pattern string contains SSS. So this is a bug fix.

I think we should apply it to all the places that promise to support pattern string of java.text.SimpleDateFormat, like to_timestamp. We may also need to fix the master branch as well when we add back the legacy formatter.

BTW I don't think this is a serious bug that worth to fail the RC.

dongjoon-hyun · 2020-02-03T17:43:48Z

Thank you for the decision, @cloud-fan !

… legacy date/timestamp formatters ### What changes were proposed in this pull request? In the PR, I propose to add legacy date/timestamp formatters based on `SimpleDateFormat` and `FastDateFormat`: - `LegacyFastTimestampFormatter` - uses `FastDateFormat` and supports parsing/formatting in microsecond precision. The code was borrowed from Spark 2.4, see #26507 & #26582 - `LegacySimpleTimestampFormatter` uses `SimpleDateFormat`, and support the `lenient` mode. When the `lenient` parameter is set to `false`, the parser become much stronger in checking its input. ### Why are the changes needed? Spark 2.4.x uses the following parsers for parsing/formatting date/timestamp strings: - `DateTimeFormat` in CSV/JSON datasource - `SimpleDateFormat` - is used in JDBC datasource, in partitions parsing. - `SimpleDateFormat` in strong mode (`lenient = false`), see https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L124. It is used by the `date_format`, `from_unixtime`, `unix_timestamp` and `to_unix_timestamp` functions. The PR aims to make Spark 3.0 compatible with Spark 2.4.x in all those cases when `spark.sql.legacy.timeParser.enabled` is set to `true`. ### Does this PR introduce any user-facing change? This shouldn't change behavior with default settings. If `spark.sql.legacy.timeParser.enabled` is set to `true`, users should observe behavior of Spark 2.4. ### How was this patch tested? - Modified tests in `DateExpressionsSuite` to check the legacy parser - `SimpleDateFormat`. - Added `CSVLegacyTimeParserSuite` and `JsonLegacyTimeParserSuite` to run `CSVSuite` and `JsonSuite` with the legacy parser - `FastDateFormat`. Closes #27524 from MaxGekk/timestamp-formatter-legacy-fallback. Authored-by: Maxim Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

… legacy date/timestamp formatters ### What changes were proposed in this pull request? In the PR, I propose to add legacy date/timestamp formatters based on `SimpleDateFormat` and `FastDateFormat`: - `LegacyFastTimestampFormatter` - uses `FastDateFormat` and supports parsing/formatting in microsecond precision. The code was borrowed from Spark 2.4, see #26507 & #26582 - `LegacySimpleTimestampFormatter` uses `SimpleDateFormat`, and support the `lenient` mode. When the `lenient` parameter is set to `false`, the parser become much stronger in checking its input. ### Why are the changes needed? Spark 2.4.x uses the following parsers for parsing/formatting date/timestamp strings: - `DateTimeFormat` in CSV/JSON datasource - `SimpleDateFormat` - is used in JDBC datasource, in partitions parsing. - `SimpleDateFormat` in strong mode (`lenient = false`), see https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L124. It is used by the `date_format`, `from_unixtime`, `unix_timestamp` and `to_unix_timestamp` functions. The PR aims to make Spark 3.0 compatible with Spark 2.4.x in all those cases when `spark.sql.legacy.timeParser.enabled` is set to `true`. ### Does this PR introduce any user-facing change? This shouldn't change behavior with default settings. If `spark.sql.legacy.timeParser.enabled` is set to `true`, users should observe behavior of Spark 2.4. ### How was this patch tested? - Modified tests in `DateExpressionsSuite` to check the legacy parser - `SimpleDateFormat`. - Added `CSVLegacyTimeParserSuite` and `JsonLegacyTimeParserSuite` to run `CSVSuite` and `JsonSuite` with the legacy parser - `FastDateFormat`. Closes #27524 from MaxGekk/timestamp-formatter-legacy-fallback. Authored-by: Maxim Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit c198620) Signed-off-by: Wenchen Fan <[email protected]>

… legacy date/timestamp formatters ### What changes were proposed in this pull request? In the PR, I propose to add legacy date/timestamp formatters based on `SimpleDateFormat` and `FastDateFormat`: - `LegacyFastTimestampFormatter` - uses `FastDateFormat` and supports parsing/formatting in microsecond precision. The code was borrowed from Spark 2.4, see apache#26507 & apache#26582 - `LegacySimpleTimestampFormatter` uses `SimpleDateFormat`, and support the `lenient` mode. When the `lenient` parameter is set to `false`, the parser become much stronger in checking its input. ### Why are the changes needed? Spark 2.4.x uses the following parsers for parsing/formatting date/timestamp strings: - `DateTimeFormat` in CSV/JSON datasource - `SimpleDateFormat` - is used in JDBC datasource, in partitions parsing. - `SimpleDateFormat` in strong mode (`lenient = false`), see https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L124. It is used by the `date_format`, `from_unixtime`, `unix_timestamp` and `to_unix_timestamp` functions. The PR aims to make Spark 3.0 compatible with Spark 2.4.x in all those cases when `spark.sql.legacy.timeParser.enabled` is set to `true`. ### Does this PR introduce any user-facing change? This shouldn't change behavior with default settings. If `spark.sql.legacy.timeParser.enabled` is set to `true`, users should observe behavior of Spark 2.4. ### How was this patch tested? - Modified tests in `DateExpressionsSuite` to check the legacy parser - `SimpleDateFormat`. - Added `CSVLegacyTimeParserSuite` and `JsonLegacyTimeParserSuite` to run `CSVSuite` and `JsonSuite` with the legacy parser - `FastDateFormat`. Closes apache#27524 from MaxGekk/timestamp-formatter-legacy-fallback. Authored-by: Maxim Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

MaxGekk added 4 commits November 13, 2019 19:05

Add fastParseToMicros and tests

897bf33

Use fastParseToMicros in JacksonParser

b18c61a

Add a test for from_json()

915a755

Merge remote-tracking branch 'origin/branch-2.4' into fastdateformat-…

3d7a611

…micros

MaxGekk changed the title ~~[WIP][SQL] Parse timestamps in microsecond precision in JSON datasource~~ [WIP][SQL][2.4] Parse timestamps in microsecond precision in JSON datasource Nov 13, 2019

cloud-fan reviewed Nov 13, 2019

View reviewed changes

MaxGekk added 2 commits November 14, 2019 00:06

Accept only SSS or SSSSSS or without fraction

453ee5c

Generic approach

86ac2b2

cloud-fan reviewed Nov 14, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala Show resolved Hide resolved

MaxGekk added 4 commits November 14, 2019 14:17

Make MicrosCalendar private

9c82538

Optimizations: produce less garbage

9a446aa

Fix imports

ef8284c

Bug fixes

638c640

cloud-fan reviewed Nov 14, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Nov 14, 2019

View reviewed changes

MaxGekk added 3 commits November 14, 2019 16:22

Improve error message

eb58d03

dateTimeParser -> timestampParser

274f25a

Fix CSV as well

66d1100

MaxGekk changed the title ~~[WIP][SQL][2.4] Parse timestamps in microsecond precision in JSON datasource~~ [WIP][SQL][2.4] Parse timestamps in microsecond precision in JSON/CSV datasources Nov 14, 2019

cloud-fan reviewed Nov 14, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala Show resolved Hide resolved

MaxGekk added 3 commits November 15, 2019 09:52

the array of parsed fields -> protected fields

31d7877

Change expected value in a test

8db95ca

Put digitsInFraction and cal inside of TimestampParser

1852511

cloud-fan approved these changes Nov 15, 2019

View reviewed changes

dongjoon-hyun reviewed Nov 15, 2019

View reviewed changes

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateTimeUtilsSuite.scala Outdated Show resolved Hide resolved

Rename a test

1cbd35b

dongjoon-hyun approved these changes Nov 15, 2019

View reviewed changes

dongjoon-hyun closed this Nov 15, 2019

HyukjinKwon reviewed Nov 18, 2019

View reviewed changes

MaxGekk mentioned this pull request Feb 3, 2020

[MINOR][SQL][DOCS][2.4] Fix the timestamp pattern in the example for to_timestamp #27438

Closed

dongjoon-hyun added the SQL label Feb 5, 2020

MaxGekk mentioned this pull request Feb 11, 2020

[SPARK-30788][SQL] Support SimpleDateFormat and FastDateFormat as legacy date/timestamp formatters #27524

Closed

MaxGekk deleted the fastdateformat-micros branch June 5, 2020 19:41

[SPARK-29904][SQL][2.4] Parse timestamps in microsecond precision by JSON/CSV datasources #26507

[SPARK-29904][SQL][2.4] Parse timestamps in microsecond precision by JSON/CSV datasources #26507

Uh oh!

Conversation

MaxGekk commented Nov 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

MaxGekk commented Nov 13, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaxGekk Nov 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaxGekk Nov 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 13, 2019

Uh oh!

SparkQA commented Nov 14, 2019

Uh oh!

SparkQA commented Nov 14, 2019

Uh oh!

cloud-fan commented Nov 14, 2019

Uh oh!

MaxGekk commented Nov 14, 2019

Uh oh!

Uh oh!

MaxGekk commented Nov 14, 2019

Uh oh!

cloud-fan commented Nov 14, 2019

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SparkQA commented Nov 14, 2019

Uh oh!

SparkQA commented Nov 15, 2019

Uh oh!

MaxGekk commented Nov 15, 2019

Uh oh!

SparkQA commented Nov 15, 2019

Uh oh!

Uh oh!

dongjoon-hyun commented Nov 15, 2019

Uh oh!

SparkQA commented Nov 15, 2019

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Feb 3, 2020

Uh oh!

MaxGekk commented Feb 3, 2020

MaxGekk commented Nov 13, 2019 •

edited

Loading

MaxGekk Nov 13, 2019 •

edited

Loading

MaxGekk Nov 13, 2019 •

edited

Loading

dongjoon-hyun commented Feb 3, 2020 •

edited

Loading

cloud-fan commented Feb 3, 2020 •

edited

Loading