Skip to content

Conversation

@xuanyuanking
Copy link
Member

What changes were proposed in this pull request?

In Spark version 2.4 and earlier, datetime parsing, formatting and conversion are performed by using the hybrid calendar (Julian + Gregorian).
Since the Proleptic Gregorian calendar is de-facto calendar worldwide, as well as the chosen one in ANSI SQL standard, Spark 3.0 switches to it by using Java 8 API classes (the java.time packages that are based on ISO chronology ). The switching job is completed in SPARK-26651.
But after the switching, there are some patterns not compatible between Java 8 and Java 7, Spark needs its own definition on the patterns rather than depends on Java API.
In this PR, we achieve this by writing the document and shadow the incompatible letters. See more details in SPARK-31030

Why are the changes needed?

For backward compatibility.

Does this PR introduce any user-facing change?

No.
After we define our own datetime parsing and formatting patterns, it's same to old Spark version.

How was this patch tested?

Existing and new added UT.
Locally document test:
image

@SparkQA
Copy link

SparkQA commented Mar 6, 2020

Test build #119451 has finished for PR 27830 at commit 462c63c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


- Number/Text: If the count of pattern letters is 3 or greater, use the Text rules above. Otherwise use the Number rules above.

- Fraction: Outputs the nano-of-second field as a fraction-of-second. The nano-of-second value has nine digits, thus the count of pattern letters is from 1 to 9. If it is less than 9, then the nano-of-second value is truncated, with only the most significant digits being output.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, Spark doesn't support fraction in nanosecond precision. It can mislead users.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment, update the fraction section in 621a00e.

// parse. When it is successfully parsed, throw an exception and ask users to change
// the pattern strings or turn on the legacy mode; otherwise, return NULL as what Spark
// 2.4 does.
.replace("u", "e")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'u' can be escaped in the pattern like 'update time' uuuu-MM-dd. Replacing every 'u' will lead to wrong pattern, and nothing matches to it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the quoted text has been considered, let me add comments to emphasize.

@SparkQA
Copy link

SparkQA commented Mar 6, 2020

Test build #119477 has finished for PR 27830 at commit b3b5ee4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@xuanyuanking
Copy link
Member Author

cc @MaxGekk @cloud-fan


/**
* Since the Proleptic Gregorian calendar is de-facto calendar worldwide, as well as the chosen
* one in ANSI SQL standard, Spark 3.0 switches to it by using Java 8 API classes. However, the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit confusing to say java 7 & 8 as the old APIs are also available in java 8.

How about SimpleDateFormat and DateTimeFormatter?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, done in e846fbb.


The count of pattern letters determines the format.

- Text: The text style is determined based on the number of pattern letters used. Less than 4 pattern letters will use the short form. Exactly 4 pattern letters will use the full form.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about more than 5 letters?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll get IllegalArgumentException.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's document it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, done in e846fbb.


- Fraction: Outputs the micro-of-second field as a fraction-of-second. The micro-of-second value has six digits, thus the count of pattern letters is from 1 to 6. If it is less than 6, then the micro-of-second value is truncated, with only the most significant digits being output.

- Year: The count of letters determines the minimum field width below which padding is used. If the count of letters is two, then a reduced two digit form is used. For printing, this outputs the rightmost two digits. For parsing, this will parse using the base value of 2000, resulting in a year within the range 2000 to 2099 inclusive. If the count of letters is less than four (but not two), then the sign is only output for negative years. Otherwise, the sign is output if the pad width is exceeded.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, the sign is output if the pad width is exceeded.

This is not true when G is present, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, emphasize in e846fbb.

@SparkQA
Copy link

SparkQA commented Mar 9, 2020

Test build #119557 has finished for PR 27830 at commit 621a00e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 9, 2020

Test build #119558 has finished for PR 27830 at commit 82aa515.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}

/**
* Since the Proleptic Gregorian calendar is de-facto calendar worldwide, as well as the chosen
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Spark 3.0, we switch to the Proleptic Gregorian calendar and use DateTimeFormatter for parsing/formatting datetime values. The pattern string is incompatible with the one defined by SimpleDateFormat in Spark 2.4 and earlier. This function ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, done in 5382508

pattern.split("'").zipWithIndex.map {
case (patternPart, index) =>
if (index % 2 == 0) {
// The meaning of 'u' was day number of week in Java 7, it was changed to year in Java 8.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Java 8 -> DateTimeFormatter

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, also rephrase the whole comment.

@SparkQA
Copy link

SparkQA commented Mar 10, 2020

Test build #119613 has finished for PR 27830 at commit e846fbb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 10, 2020

Test build #119620 has finished for PR 27830 at commit 5382508.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master/3.0!

@cloud-fan cloud-fan closed this in 3493162 Mar 11, 2020
cloud-fan pushed a commit that referenced this pull request Mar 11, 2020
…Datetime

In Spark version 2.4 and earlier, datetime parsing, formatting and conversion are performed by using the hybrid calendar (Julian + Gregorian).
Since the Proleptic Gregorian calendar is de-facto calendar worldwide, as well as the chosen one in ANSI SQL standard, Spark 3.0 switches to it by using Java 8 API classes (the java.time packages that are based on ISO chronology ). The switching job is completed in SPARK-26651.
But after the switching, there are some patterns not compatible between Java 8 and Java 7, Spark needs its own definition on the patterns rather than depends on Java API.
In this PR, we achieve this by writing the document and shadow the incompatible letters. See more details in [SPARK-31030](https://issues.apache.org/jira/browse/SPARK-31030)

For backward compatibility.

No.
After we define our own datetime parsing and formatting patterns, it's same to old Spark version.

Existing and new added UT.
Locally document test:
![image](https://user-images.githubusercontent.com/4833765/76064100-f6acc280-5fc3-11ea-9ef7-82e7dc074205.png)

Closes #27830 from xuanyuanking/SPARK-31030.

Authored-by: Yuanjian Li <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 3493162)
Signed-off-by: Wenchen Fan <[email protected]>
@xuanyuanking
Copy link
Member Author

Thanks for the review!

@xuanyuanking xuanyuanking deleted the SPARK-31030 branch March 11, 2020 12:35
@tgravescs
Copy link
Contributor

so I only skimmed this but I ran into the config: val LEGACY_TIME_PARSER_ENABLED = buildConf("spark.sql.legacy.timeParser.enabled") in SQLConf.

I assume that can be removed with this change?

@cloud-fan
Copy link
Contributor

yea it has been removed in #27889

sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
…Datetime

### What changes were proposed in this pull request?
In Spark version 2.4 and earlier, datetime parsing, formatting and conversion are performed by using the hybrid calendar (Julian + Gregorian).
Since the Proleptic Gregorian calendar is de-facto calendar worldwide, as well as the chosen one in ANSI SQL standard, Spark 3.0 switches to it by using Java 8 API classes (the java.time packages that are based on ISO chronology ). The switching job is completed in SPARK-26651.
But after the switching, there are some patterns not compatible between Java 8 and Java 7, Spark needs its own definition on the patterns rather than depends on Java API.
In this PR, we achieve this by writing the document and shadow the incompatible letters. See more details in [SPARK-31030](https://issues.apache.org/jira/browse/SPARK-31030)

### Why are the changes needed?
For backward compatibility.

### Does this PR introduce any user-facing change?
No.
After we define our own datetime parsing and formatting patterns, it's same to old Spark version.

### How was this patch tested?
Existing and new added UT.
Locally document test:
![image](https://user-images.githubusercontent.com/4833765/76064100-f6acc280-5fc3-11ea-9ef7-82e7dc074205.png)

Closes apache#27830 from xuanyuanking/SPARK-31030.

Authored-by: Yuanjian Li <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants