Skip to content

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Nov 13, 2019

What changes were proposed in this pull request?

In the PR, I propose parsing timestamp strings up to microsecond precision. To achieve that, I added a sub-class of GregorianCalendar to get access to protected field fields which contains non-normalized parsed fields immediately after parsing. In particular, I assume that the MILLISECOND field contains entire seconds fraction converted to int. By knowing the expected digits in the fractional part, the parsed field is converted to a fraction up to the microsecond precision.

This PR supports additional patterns for seconds fractions from S to SSSSSS in JSON/CSV options.

Why are the changes needed?

To improve user experience with JSON and CSV datasources, and to allow parsing timestamp strings up to microsecond precision.

Does this PR introduce any user-facing change?

No, the PR extends the set of supported timestamp patterns for the seconds fraction by S, SS, SSSS, SSSSS, and SSSSSS.

How was this patch tested?

By existing test suites JsonExpressionSuite, JsonFunctionsSuite, JsonSuite, CsvSuite, UnivocityParserSuite, and added new tests to DateTimeUtilsSuite, JsonFunctionsSuite for from_json() and to CSVSuite.

@MaxGekk
Copy link
Member Author

MaxGekk commented Nov 13, 2019

@cloud-fan Please, take a look at this.

@MaxGekk MaxGekk changed the title [WIP][SQL] Parse timestamps in microsecond precision in JSON datasource [WIP][SQL][2.4] Parse timestamps in microsecond precision in JSON datasource Nov 13, 2019
}

class MicrosCalendar(tz: TimeZone) extends GregorianCalendar(tz, Locale.US) {
def getMicros(): SQLTimestamp = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have some comments to explain the behavior? Seems to me it's

  1. for .1, it's 1000 microseconds
  2. for .1234, it's 1234 microseconds
  3. for .1234567, it's 123456 microseconds

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a simple rule? The rule for interval is pretty simple: adding 0 at the end until the second fraction has 9 digits, then parse the 9 digits to nanoseconds.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a simple rule?

I haven't found simpler approach for now. The difference between interval and timestamp is the former one may have time zone at the end or anything else. We cannot say to users don't use the pattern like mm:ss.SSSSSSXXX yyyy/MM/dd

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what we get here is the milliseconds FastDateFormat extracts from the string. I believe FastDateFormat can handle the part after seconds. e.g. 12:12:12.1234Z, the milliseconds part should be 1234.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have some comments to explain the behavior? Seems to me it's
for .1, it's 1000 microseconds

This is 100 * 1000 microsecond but SimpleDateFormat and FastDateFormat have weird behavior. The example below on 2.4 without my changes:

scala> val df = Seq("""{"a":"2019-10-14T09:39:07.1Z"}""").toDF
df: org.apache.spark.sql.DataFrame = [value: string]

scala> val res = df.select(from_json('value, schema, Map("timestampFormat" -> "yyyy-MM-dd'T'HH:mm:ss.SXXX")))
res: org.apache.spark.sql.DataFrame = [jsontostructs(value): struct<a: timestamp>]

scala> res.show(false)
+-------------------------+
|jsontostructs(value)     |
+-------------------------+
|[2019-10-14 12:39:07.001]|
+-------------------------+
scala> val res = df.select(from_json('value, schema, Map("timestampFormat" -> "yyyy-MM-dd'T'HH:mm:ss.SSSXXX")))
res: org.apache.spark.sql.DataFrame = [jsontostructs(value): struct<a: timestamp>]

scala> res.show(false)
+-------------------------+
|jsontostructs(value)     |
+-------------------------+
|[2019-10-14 12:39:07.001]|
+-------------------------+

So .1 cannot be parsed correctly only 0.100:

scala> val df = Seq("""{"a":"2019-10-14T09:39:07.100Z"}""").toDF
df: org.apache.spark.sql.DataFrame = [value: string]

scala> val res = df.select(from_json('value, schema, Map("timestampFormat" -> "yyyy-MM-dd'T'HH:mm:ss.SSSXXX")))
res: org.apache.spark.sql.DataFrame = [jsontostructs(value): struct<a: timestamp>]

scala> res.show(false)
+-----------------------+
|jsontostructs(value)   |
+-----------------------+
|[2019-10-14 12:39:07.1]|
+-----------------------+

Copy link
Member Author

@MaxGekk MaxGekk Nov 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I see in the source code of SimpleDateFormat, it just casts the fraction part to int. .001 and .01 are the same and equal to 1.

Copy link
Member Author

@MaxGekk MaxGekk Nov 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, this following check fails:

    check("yyyy-MM-dd'T'HH:mm:ss.SSSSSSXXX",
      "2019-10-14T09:39:07.000010Z", "2019-10-14T09:39:07.000010Z")
Expected :1571045947000010
Actual   :1571045947010000

because.000010 is parsed to 10 inside of SimpleDateFormat.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commit 86ac2b2 should fix that

assert(actual === expected)
}
check("yyyy-MM-dd'T'HH:mm:ss.SSSSSSSXXX",
"2019-10-14T09:39:07.3220000Z", "2019-10-14T09:39:07.322Z")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this behavior the same with master branch?

@SparkQA
Copy link

SparkQA commented Nov 13, 2019

Test build #113716 has finished for PR 26507 at commit 3d7a611.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 14, 2019

Test build #113727 has finished for PR 26507 at commit 453ee5c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 14, 2019

Test build #113734 has finished for PR 26507 at commit 86ac2b2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

is there some standard to define the behavior? For example, 11:11:11.123, how shall we interpret the .123 here? or it's related to the timestamp format?

@MaxGekk
Copy link
Member Author

MaxGekk commented Nov 14, 2019

is there some standard to define the behavior?

I think regular mathematical definition is applicable here: https://en.wikipedia.org/wiki/Fraction_(mathematics)#Decimal_fractions_and_percentages

For example, 11:11:11.123, how shall we interpret the .123 here? or it's related to the timestamp format?

Java 8 DateTimeFormatter says that S is fraction-of-second: https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html . So, it mean 11.123 = 11123/1000 seconds and 11.123456 = 11123456/1000000 = 11 seconds 123 milliseconds 456 microseconds for the pattern ss.SSSSSS

The doc of SimpleDateFormat says S is Millisecond: https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html. It means 11.123456 = 11 seconds 123456 milliseconds = 134 seconds 456 milliseconds for the pattern ss.SSSSSS

@MaxGekk
Copy link
Member Author

MaxGekk commented Nov 14, 2019

@cloud-fan In general, are you ok with the changes? Should I continue?

@cloud-fan
Copy link
Contributor

I think this is a good idea. I wasn't aware of that we can tweak SimpleDateFormat in this way.

check("yyyy-MM-dd'T'HH:mm:ss.SX",
"2019-10-14T09:39:07.1Z", "2019-10-14T09:39:07.1Z")
check("yyyy-MM-dd'T'HH:mm:ss.SSX",
"2019-10-14T09:39:07.10Z", "2019-10-14T09:39:07.1Z")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add some negative tests?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see a test like xxx.123 with format .SS

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It just returns invalid result xxx+1.23. For example:
"2019-10-14T09:39:07.123Z" -> "2019-10-14T09:39:08.23Z". I can add such test but I don't know what it aims to validate.

@MaxGekk MaxGekk changed the title [WIP][SQL][2.4] Parse timestamps in microsecond precision in JSON datasource [WIP][SQL][2.4] Parse timestamps in microsecond precision in JSON/CSV datasources Nov 14, 2019
@SparkQA
Copy link

SparkQA commented Nov 14, 2019

Test build #113784 has finished for PR 26507 at commit 638c640.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 15, 2019

Test build #113857 has finished for PR 26507 at commit 1852511.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class TimestampParser(format: FastDateFormat)

@MaxGekk
Copy link
Member Author

MaxGekk commented Nov 15, 2019

jenkins, retest this, please

@SparkQA
Copy link

SparkQA commented Nov 15, 2019

Test build #113861 has finished for PR 26507 at commit 1852511.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class TimestampParser(format: FastDateFormat)

@dongjoon-hyun
Copy link
Member

cc @HyukjinKwon

@SparkQA
Copy link

SparkQA commented Nov 15, 2019

Test build #113893 has finished for PR 26507 at commit 1cbd35b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Merged to branch-2.4.
Thank you, @MaxGekk and @cloud-fan .

dongjoon-hyun pushed a commit that referenced this pull request Nov 15, 2019
…JSON/CSV datasources

### What changes were proposed in this pull request?
In the PR, I propose parsing timestamp strings up to microsecond precision. To achieve that, I added a sub-class of `GregorianCalendar` to get access to `protected` field `fields` which contains non-normalized parsed fields immediately after parsing. In particular, I assume that the `MILLISECOND` field contains entire seconds fraction converted to `int`. By knowing the expected digits in the fractional part, the parsed field is converted to a fraction up to the microsecond precision.

This PR supports additional patterns for seconds fractions from `S` to `SSSSSS` in JSON/CSV options.

### Why are the changes needed?
To improve user experience with JSON and CSV datasources, and to allow parsing timestamp strings up to microsecond precision.

### Does this PR introduce any user-facing change?
No, the PR extends the set of supported timestamp patterns for the seconds fraction by `S`, `SS`, `SSSS`, `SSSSS`, and `SSSSSS`.

### How was this patch tested?
By existing test suites `JsonExpressionSuite`, `JsonFunctionsSuite`, `JsonSuite`, `CsvSuite`, `UnivocityParserSuite`, and added new tests to `DateTimeUtilsSuite`, `JsonFunctionsSuite` for `from_json()` and to `CSVSuite`.

Closes #26507 from MaxGekk/fastdateformat-micros.

Authored-by: Maxim Gekk <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM too. Sorry for my late response.

@dongjoon-hyun
Copy link
Member

@MaxGekk . According to your finding, we may revert this from branch-2.4 completely in the worst case. Thank you for working on this area. This is reported as BUG fix, but it seems to be more Improvement .

@MaxGekk
Copy link
Member Author

MaxGekk commented Feb 3, 2020

This is reported as BUG fix, but it seems to be more Improvement .

This PR fixes bugs as well, see #26507 (comment) . @dongjoon-hyun Try to parse timestamps with fraction patterns S or SS.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Feb 3, 2020

@MaxGekk . The existing bugs are not blockers for the release. But, new bug can be a blocker because it's called as a regression.

@MaxGekk
Copy link
Member Author

MaxGekk commented Feb 3, 2020

But, new bug can be a blocker because it's called as a regression.

ok. As far as I know this PR haven't introduced any regressions yet or I missed something?

@dongjoon-hyun
Copy link
Member

Sure, @MaxGekk . That's the reason why I say 'in the worst case'. We are investigating this area, aren't we? This is an early notice to the relevant peers on this PR.

@cloud-fan
Copy link
Contributor

cloud-fan commented Feb 3, 2020

The document of DataFrameReader.json says that the timestampFormat option supports the pattern string of java.text.SimpleDateFormat. That said, microseconds precision should be supported if the pattern string contains SSS. So this is a bug fix.

I think we should apply it to all the places that promise to support pattern string of java.text.SimpleDateFormat, like to_timestamp. We may also need to fix the master branch as well when we add back the legacy formatter.

BTW I don't think this is a serious bug that worth to fail the RC.

@dongjoon-hyun
Copy link
Member

Thank you for the decision, @cloud-fan !

cloud-fan pushed a commit that referenced this pull request Feb 12, 2020
… legacy date/timestamp formatters

### What changes were proposed in this pull request?
In the PR, I propose to add legacy date/timestamp formatters based on `SimpleDateFormat` and `FastDateFormat`:
- `LegacyFastTimestampFormatter` - uses `FastDateFormat` and supports parsing/formatting in microsecond precision. The code was borrowed from Spark 2.4, see #26507 & #26582
- `LegacySimpleTimestampFormatter` uses `SimpleDateFormat`, and support the `lenient` mode. When the `lenient` parameter is set to `false`, the parser become much stronger in checking its input.

### Why are the changes needed?
Spark 2.4.x uses the following parsers for parsing/formatting date/timestamp strings:
- `DateTimeFormat` in CSV/JSON datasource
- `SimpleDateFormat` - is used in JDBC datasource, in partitions parsing.
- `SimpleDateFormat` in strong mode (`lenient = false`), see https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L124. It is used by the `date_format`, `from_unixtime`, `unix_timestamp` and `to_unix_timestamp` functions.

The PR aims to make Spark 3.0 compatible with Spark 2.4.x in all those cases when `spark.sql.legacy.timeParser.enabled` is set to `true`.

### Does this PR introduce any user-facing change?
This shouldn't change behavior with default settings. If `spark.sql.legacy.timeParser.enabled` is set to `true`, users should observe behavior of Spark 2.4.

### How was this patch tested?
- Modified tests in `DateExpressionsSuite` to check the legacy parser - `SimpleDateFormat`.
- Added `CSVLegacyTimeParserSuite` and `JsonLegacyTimeParserSuite` to run `CSVSuite` and `JsonSuite` with the legacy parser - `FastDateFormat`.

Closes #27524 from MaxGekk/timestamp-formatter-legacy-fallback.

Authored-by: Maxim Gekk <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
cloud-fan pushed a commit that referenced this pull request Feb 12, 2020
… legacy date/timestamp formatters

### What changes were proposed in this pull request?
In the PR, I propose to add legacy date/timestamp formatters based on `SimpleDateFormat` and `FastDateFormat`:
- `LegacyFastTimestampFormatter` - uses `FastDateFormat` and supports parsing/formatting in microsecond precision. The code was borrowed from Spark 2.4, see #26507 & #26582
- `LegacySimpleTimestampFormatter` uses `SimpleDateFormat`, and support the `lenient` mode. When the `lenient` parameter is set to `false`, the parser become much stronger in checking its input.

### Why are the changes needed?
Spark 2.4.x uses the following parsers for parsing/formatting date/timestamp strings:
- `DateTimeFormat` in CSV/JSON datasource
- `SimpleDateFormat` - is used in JDBC datasource, in partitions parsing.
- `SimpleDateFormat` in strong mode (`lenient = false`), see https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L124. It is used by the `date_format`, `from_unixtime`, `unix_timestamp` and `to_unix_timestamp` functions.

The PR aims to make Spark 3.0 compatible with Spark 2.4.x in all those cases when `spark.sql.legacy.timeParser.enabled` is set to `true`.

### Does this PR introduce any user-facing change?
This shouldn't change behavior with default settings. If `spark.sql.legacy.timeParser.enabled` is set to `true`, users should observe behavior of Spark 2.4.

### How was this patch tested?
- Modified tests in `DateExpressionsSuite` to check the legacy parser - `SimpleDateFormat`.
- Added `CSVLegacyTimeParserSuite` and `JsonLegacyTimeParserSuite` to run `CSVSuite` and `JsonSuite` with the legacy parser - `FastDateFormat`.

Closes #27524 from MaxGekk/timestamp-formatter-legacy-fallback.

Authored-by: Maxim Gekk <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit c198620)
Signed-off-by: Wenchen Fan <[email protected]>
sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
… legacy date/timestamp formatters

### What changes were proposed in this pull request?
In the PR, I propose to add legacy date/timestamp formatters based on `SimpleDateFormat` and `FastDateFormat`:
- `LegacyFastTimestampFormatter` - uses `FastDateFormat` and supports parsing/formatting in microsecond precision. The code was borrowed from Spark 2.4, see apache#26507 & apache#26582
- `LegacySimpleTimestampFormatter` uses `SimpleDateFormat`, and support the `lenient` mode. When the `lenient` parameter is set to `false`, the parser become much stronger in checking its input.

### Why are the changes needed?
Spark 2.4.x uses the following parsers for parsing/formatting date/timestamp strings:
- `DateTimeFormat` in CSV/JSON datasource
- `SimpleDateFormat` - is used in JDBC datasource, in partitions parsing.
- `SimpleDateFormat` in strong mode (`lenient = false`), see https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L124. It is used by the `date_format`, `from_unixtime`, `unix_timestamp` and `to_unix_timestamp` functions.

The PR aims to make Spark 3.0 compatible with Spark 2.4.x in all those cases when `spark.sql.legacy.timeParser.enabled` is set to `true`.

### Does this PR introduce any user-facing change?
This shouldn't change behavior with default settings. If `spark.sql.legacy.timeParser.enabled` is set to `true`, users should observe behavior of Spark 2.4.

### How was this patch tested?
- Modified tests in `DateExpressionsSuite` to check the legacy parser - `SimpleDateFormat`.
- Added `CSVLegacyTimeParserSuite` and `JsonLegacyTimeParserSuite` to run `CSVSuite` and `JsonSuite` with the legacy parser - `FastDateFormat`.

Closes apache#27524 from MaxGekk/timestamp-formatter-legacy-fallback.

Authored-by: Maxim Gekk <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
@MaxGekk MaxGekk deleted the fastdateformat-micros branch June 5, 2020 19:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants