[SPARK-31030][SQL] Backward Compatibility for Parsing and formatting Datetime #27830

xuanyuanking · 2020-03-06T08:08:15Z

What changes were proposed in this pull request?

In Spark version 2.4 and earlier, datetime parsing, formatting and conversion are performed by using the hybrid calendar (Julian + Gregorian).
Since the Proleptic Gregorian calendar is de-facto calendar worldwide, as well as the chosen one in ANSI SQL standard, Spark 3.0 switches to it by using Java 8 API classes (the java.time packages that are based on ISO chronology ). The switching job is completed in SPARK-26651.
But after the switching, there are some patterns not compatible between Java 8 and Java 7, Spark needs its own definition on the patterns rather than depends on Java API.
In this PR, we achieve this by writing the document and shadow the incompatible letters. See more details in SPARK-31030

Why are the changes needed?

For backward compatibility.

Does this PR introduce any user-facing change?

No.
After we define our own datetime parsing and formatting patterns, it's same to old Spark version.

How was this patch tested?

Existing and new added UT.
Locally document test:

SparkQA · 2020-03-06T10:47:40Z

Test build #119451 has finished for PR 27830 at commit 462c63c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

docs/sql-ref-datetime-pattern.md

MaxGekk · 2020-03-06T16:31:50Z

docs/sql-ref-datetime-pattern.md

+
+- Number/Text: If the count of pattern letters is 3 or greater, use the Text rules above. Otherwise use the Number rules above.
+
+- Fraction: Outputs the nano-of-second field as a fraction-of-second. The nano-of-second value has nine digits, thus the count of pattern letters is from 1 to 9. If it is less than 9, then the nano-of-second value is truncated, with only the most significant digits being output.


Currently, Spark doesn't support fraction in nanosecond precision. It can mislead users.

Thanks for the comment, update the fraction section in 621a00e.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala

MaxGekk · 2020-03-06T16:37:48Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala

+            // parse. When it is successfully parsed, throw an exception and ask users to change
+            // the pattern strings or turn on the legacy mode; otherwise, return NULL as what Spark
+            // 2.4 does.
+            .replace("u", "e")


'u' can be escaped in the pattern like 'update time' uuuu-MM-dd. Replacing every 'u' will lead to wrong pattern, and nothing matches to it.

Actually the quoted text has been considered, let me add comments to emphasize.

SparkQA · 2020-03-06T20:37:25Z

Test build #119477 has finished for PR 27830 at commit b3b5ee4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

xuanyuanking · 2020-03-09T09:31:24Z

cc @MaxGekk @cloud-fan

docs/sql-ref-datetime-pattern.md

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala

cloud-fan · 2020-03-09T11:01:19Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala

+
+  /**
+   * Since the Proleptic Gregorian calendar is de-facto calendar worldwide, as well as the chosen
+   * one in ANSI SQL standard, Spark 3.0 switches to it by using Java 8 API classes. However, the


It's a bit confusing to say java 7 & 8 as the old APIs are also available in java 8.

How about SimpleDateFormat and DateTimeFormatter?

Thanks, done in e846fbb.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala

docs/sql-ref-datetime-pattern.md

cloud-fan · 2020-03-09T11:04:48Z

docs/sql-ref-datetime-pattern.md

+
+The count of pattern letters determines the format.
+
+- Text: The text style is determined based on the number of pattern letters used. Less than 4 pattern letters will use the short form. Exactly 4 pattern letters will use the full form.


how about more than 5 letters?

We'll get IllegalArgumentException.

Let's document it.

Sure, done in e846fbb.

cloud-fan · 2020-03-09T11:08:27Z

docs/sql-ref-datetime-pattern.md

+
+- Fraction: Outputs the micro-of-second field as a fraction-of-second. The micro-of-second value has six digits, thus the count of pattern letters is from 1 to 6. If it is less than 6, then the micro-of-second value is truncated, with only the most significant digits being output.
+
+- Year: The count of letters determines the minimum field width below which padding is used. If the count of letters is two, then a reduced two digit form is used. For printing, this outputs the rightmost two digits. For parsing, this will parse using the base value of 2000, resulting in a year within the range 2000 to 2099 inclusive. If the count of letters is less than four (but not two), then the sign is only output for negative years. Otherwise, the sign is output if the pad width is exceeded.


Otherwise, the sign is output if the pad width is exceeded.

This is not true when G is present, right?

Right, emphasize in e846fbb.

SparkQA · 2020-03-09T13:30:32Z

Test build #119557 has finished for PR 27830 at commit 621a00e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-03-09T13:54:41Z

Test build #119558 has finished for PR 27830 at commit 82aa515.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-03-10T10:43:51Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala

  }
+
+  /**
+   * Since the Proleptic Gregorian calendar is de-facto calendar worldwide, as well as the chosen


In Spark 3.0, we switch to the Proleptic Gregorian calendar and use DateTimeFormatter for parsing/formatting datetime values. The pattern string is incompatible with the one defined by SimpleDateFormat in Spark 2.4 and earlier. This function ...

Thanks, done in 5382508

cloud-fan · 2020-03-10T10:44:40Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala

+    pattern.split("'").zipWithIndex.map {
+      case (patternPart, index) =>
+        if (index % 2 == 0) {
+          // The meaning of 'u' was day number of week in Java 7, it was changed to year in Java 8.


Java 8 -> DateTimeFormatter

Thanks, also rephrase the whole comment.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala

SparkQA · 2020-03-10T13:29:14Z

Test build #119613 has finished for PR 27830 at commit e846fbb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-03-10T17:52:52Z

Test build #119620 has finished for PR 27830 at commit 5382508.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-03-11T06:11:30Z

thanks, merging to master/3.0!

…Datetime In Spark version 2.4 and earlier, datetime parsing, formatting and conversion are performed by using the hybrid calendar (Julian + Gregorian). Since the Proleptic Gregorian calendar is de-facto calendar worldwide, as well as the chosen one in ANSI SQL standard, Spark 3.0 switches to it by using Java 8 API classes (the java.time packages that are based on ISO chronology ). The switching job is completed in SPARK-26651. But after the switching, there are some patterns not compatible between Java 8 and Java 7, Spark needs its own definition on the patterns rather than depends on Java API. In this PR, we achieve this by writing the document and shadow the incompatible letters. See more details in [SPARK-31030](https://issues.apache.org/jira/browse/SPARK-31030) For backward compatibility. No. After we define our own datetime parsing and formatting patterns, it's same to old Spark version. Existing and new added UT. Locally document test: ![image](https://user-images.githubusercontent.com/4833765/76064100-f6acc280-5fc3-11ea-9ef7-82e7dc074205.png) Closes #27830 from xuanyuanking/SPARK-31030. Authored-by: Yuanjian Li <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 3493162) Signed-off-by: Wenchen Fan <[email protected]>

xuanyuanking · 2020-03-11T12:35:37Z

Thanks for the review!

tgravescs · 2020-03-12T15:12:50Z

so I only skimmed this but I ran into the config: val LEGACY_TIME_PARSER_ENABLED = buildConf("spark.sql.legacy.timeParser.enabled") in SQLConf.

I assume that can be removed with this change?

cloud-fan · 2020-03-12T17:09:02Z

yea it has been removed in #27889

…Datetime ### What changes were proposed in this pull request? In Spark version 2.4 and earlier, datetime parsing, formatting and conversion are performed by using the hybrid calendar (Julian + Gregorian). Since the Proleptic Gregorian calendar is de-facto calendar worldwide, as well as the chosen one in ANSI SQL standard, Spark 3.0 switches to it by using Java 8 API classes (the java.time packages that are based on ISO chronology ). The switching job is completed in SPARK-26651. But after the switching, there are some patterns not compatible between Java 8 and Java 7, Spark needs its own definition on the patterns rather than depends on Java API. In this PR, we achieve this by writing the document and shadow the incompatible letters. See more details in [SPARK-31030](https://issues.apache.org/jira/browse/SPARK-31030) ### Why are the changes needed? For backward compatibility. ### Does this PR introduce any user-facing change? No. After we define our own datetime parsing and formatting patterns, it's same to old Spark version. ### How was this patch tested? Existing and new added UT. Locally document test: ![image](https://user-images.githubusercontent.com/4833765/76064100-f6acc280-5fc3-11ea-9ef7-82e7dc074205.png) Closes apache#27830 from xuanyuanking/SPARK-31030. Authored-by: Yuanjian Li <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

implement the function method

462c63c

MaxGekk reviewed Mar 6, 2020

View reviewed changes

xuanyuanking added 2 commits March 7, 2020 02:05

UT fix

d32d8a7

fix typo

b3b5ee4

xuanyuanking added 2 commits March 9, 2020 16:44

address comments

621a00e

fix style

82aa515

MaxGekk approved these changes Mar 9, 2020

View reviewed changes

cloud-fan reviewed Mar 9, 2020

View reviewed changes

docs/sql-ref-datetime-pattern.md Show resolved Hide resolved

cloud-fan reviewed Mar 9, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Mar 9, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Mar 9, 2020

View reviewed changes

docs/sql-ref-datetime-pattern.md Outdated Show resolved Hide resolved

cloud-fan reviewed Mar 9, 2020

View reviewed changes

address comments

e846fbb

cloud-fan reviewed Mar 10, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala Outdated Show resolved Hide resolved

cloud-fan approved these changes Mar 10, 2020

View reviewed changes

address comment

5382508

cloud-fan closed this in 3493162 Mar 11, 2020

xuanyuanking deleted the SPARK-31030 branch March 11, 2020 12:35


		- Number/Text: If the count of pattern letters is 3 or greater, use the Text rules above. Otherwise use the Number rules above.

		- Fraction: Outputs the nano-of-second field as a fraction-of-second. The nano-of-second value has nine digits, thus the count of pattern letters is from 1 to 9. If it is less than 9, then the nano-of-second value is truncated, with only the most significant digits being output.


		The count of pattern letters determines the format.

		- Text: The text style is determined based on the number of pattern letters used. Less than 4 pattern letters will use the short form. Exactly 4 pattern letters will use the full form.


		- Fraction: Outputs the micro-of-second field as a fraction-of-second. The micro-of-second value has six digits, thus the count of pattern letters is from 1 to 6. If it is less than 6, then the micro-of-second value is truncated, with only the most significant digits being output.

		- Year: The count of letters determines the minimum field width below which padding is used. If the count of letters is two, then a reduced two digit form is used. For printing, this outputs the rightmost two digits. For parsing, this will parse using the base value of 2000, resulting in a year within the range 2000 to 2099 inclusive. If the count of letters is less than four (but not two), then the sign is only output for negative years. Otherwise, the sign is output if the pad width is exceeded.

[SPARK-31030][SQL] Backward Compatibility for Parsing and formatting Datetime #27830

[SPARK-31030][SQL] Backward Compatibility for Parsing and formatting Datetime #27830

Uh oh!

Conversation

xuanyuanking commented Mar 6, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Mar 6, 2020

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 6, 2020

Uh oh!

xuanyuanking commented Mar 9, 2020

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 9, 2020

Uh oh!

SparkQA commented Mar 9, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SparkQA commented Mar 10, 2020

Uh oh!

SparkQA commented Mar 10, 2020

Uh oh!

cloud-fan commented Mar 11, 2020

Uh oh!

xuanyuanking commented Mar 11, 2020

Uh oh!

tgravescs commented Mar 12, 2020

Uh oh!

cloud-fan commented Mar 12, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants