[SPARK-31879][SQL] Using GB as default Locale for datetime formatters #28692

yaooqinn · 2020-06-01T08:45:41Z

What changes were proposed in this pull request?

This PR switches the default Locale from the US to GB to change the behavior of the first day of the week from Sunday-started to Monday-started as same as v2.4

Why are the changes needed?

cases

spark-sql> select to_timestamp('2020-1-1', 'YYYY-w-u');
2019-12-29 00:00:00
spark-sql> set spark.sql.legacy.timeParserPolicy=legacy;
spark.sql.legacy.timeParserPolicy	legacy
spark-sql> select to_timestamp('2020-1-1', 'YYYY-w-u');
2019-12-30 00:00:00

reasons

These week-based fields need Locale to express their semantics, the first day of the week varies from country to country.

From the Java doc of WeekFields

    /**
     * Gets the first day-of-week.
     * <p>
     * The first day-of-week varies by culture.
     * For example, the US uses Sunday, while France and the ISO-8601 standard use Monday.
     * This method returns the first day using the standard {@code DayOfWeek} enum.
     *
     * @return the first day-of-week, not null
     */
    public DayOfWeek getFirstDayOfWeek() {
        return firstDayOfWeek;
    }

But for the SimpleDateFormat, the day-of-week is not localized

u	Day number of week (1 = Monday, ..., 7 = Sunday)	Number	1

Currently, the default locale we use is the US, so the result moved a day backward.

For other countries, please refer to First Day of the Week in Different Countries

With this change, it restores the first day of week calculating for functions when using the default locale.

Does this PR introduce any user-facing change?

Yes, but the behavior change is used to restore the old one of v2.4

How was this patch tested?

add unit tests

cloud-fan · 2020-06-01T12:29:33Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateFormatter.scala

+  /**
+   * This is change from Locale.US to GB, because:
+   * The first day-of-week varies by culture.
+   * For example, the US uses Sunday, while the United Kingdom and the ISO-8601 standard use Monday.


is this the only difference between US and en-GB?

It seems that we don't need care about the other differences, e.g. currency symbol
FYI,
http://www.localeplanet.com/java/en-GB/index.html and http://www.localeplanet.com/java/en-US/index.html

the timeZone is not the same, but it's a separate field we don't get it from Locale right?

is there any localized timezone related pattern letter?

We have O, OOOO and ZZZZ that are localized, but there are decided by the zoneID(spark.sql.session.timeZone), not related to Locate here.

SparkQA · 2020-06-01T13:10:58Z

Test build #123369 has finished for PR 28692 at commit ab0517d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-06-01T14:04:50Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateFormatter.scala


-  val defaultLocale: Locale = Locale.US
+  /**
+   * This is change from Locale.US to GB, because:


Let's make the doc shorter

Before Spark 3.0, the first day-of-week is always Monday. Since Spark 3.0, it depends on the locale. We pick GB as the default locale instead of US, to be compatible with Spark 2.x, as US locale uses Sunday as the first day-of-week. See SPARK-31879.

srowen · 2020-06-01T17:10:20Z

Wow, really? Monday is considered the first day of week in the US locale?
I'm a little uneasy about defaulting to GB, as it might have other implications.
What does the root locale do here?

SparkQA · 2020-06-01T19:11:45Z

Test build #123378 has finished for PR 28692 at commit 3339274.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

yaooqinn · 2020-06-02T01:03:27Z

retest this please

SparkQA · 2020-06-02T01:31:35Z

Test build #123410 has finished for PR 28692 at commit 3339274.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-06-02T01:39:34Z

good idea using ROOT locale! @yaooqinn can you try it?

yaooqinn · 2020-06-02T01:48:48Z

Locale.ROOT was my first choice but it didn't work. However, even it works, I don't think it's a good idea for a distributed system.

yaooqinn · 2020-06-02T01:51:08Z

+  val defaultLocale: Locale = Locale.ROOT

   def defaultPattern(): String = s"${DateFormatter.defaultPattern} HH:mm:ss"

diff --git a/sql/core/src/test/resources/sql-tests/results/ansi/datetime.sql.out b/sql/core/src/test/resources/sql-tests/results/ansi/datetime.sql.out
index a9a3bccadc..94fcc3b4ad 100644
--- a/sql/core/src/test/resources/sql-tests/results/ansi/datetime.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/ansi/datetime.sql.out
@@ -1032,4 +1032,4 @@ select to_timestamp('2020-01-01', 'YYYY-ww-uu')
 -- !query schema
 struct<to_timestamp(2020-01-01, YYYY-ww-uu):timestamp>
 -- !query output
-2019-12-30 00:00:00
+2019-12-29 00:00:00

cloud-fan · 2020-06-02T02:14:29Z

It's a bit unfortunate that ROOT locale also uses Sunday as the first day. I checked the pattern string doc: https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html , there are only 2 localized pattern letter: e: localized day-of-week and O: localized zone-offset. US and DB have no difference for these two.

cloud-fan · 2020-06-02T02:14:55Z

retest this please

SparkQA · 2020-06-02T07:05:02Z

Test build #123413 has finished for PR 28692 at commit 3339274.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

yaooqinn · 2020-06-02T07:08:05Z

retest this please

yaooqinn · 2020-06-02T07:08:54Z

retest this please

SparkQA · 2020-06-02T11:44:14Z

Test build #123421 has finished for PR 28692 at commit 3339274.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-06-02T12:55:14Z

sql/core/src/test/resources/sql-tests/inputs/datetime.sql

 select date_format(date '2018-11-17', 'yyyyyyyyyyy-MM-dd');
+
+-- SPARK-31879: the first day of week
+select to_timestamp('2020-01-01', 'YYYY-ww-uu');


can we use formatting in the test? so this doesn't conflict with #28706

srowen · 2020-06-02T13:21:48Z

Just for my info, why was Monday the first day of week in 2.4 then? we didn't use TZ to determine it?
Also is that the right answer from a SQL standard or Hive perspective?

yaooqinn · 2020-06-02T15:16:19Z

why was Monday the first day of week in 2.4 then?

In Spark version 2.4 and earlier, datetime parsing and formatting are performed by the old Java 7 SimpleDateFormat API, the Monday the first day of week thing is decided by SimpleDateFormat. Since Spark 3.0, we switch to the new Java 8 DateTimeFormatter to use the Proleptic Gregorian calendar, which is required by the ISO and SQL standards.

we didn't use TZ to determine it?

The first day of week is a date field, not a time field, TZ is short for time zone actually.

Also is that the right answer from a SQL standard

The first day of week is not directly decided by ANSI SQL Framework but from its normative reference ISO standard. It is Monday start in ISO. The day of weekDateTimeFormatter become localized and by default we use Locale.US where a week starts from Sunday. We have to restore this for backward compatibility. There may more solutions options provided at SPARK-31879

or Hive perspective?

I think the backward compatibility between 3.0 and 2.4 shall go first.

srowen · 2020-06-02T15:34:16Z

Oops right I meant Locale, not TZ.
OK makes sense.

SparkQA · 2020-06-02T18:14:54Z

Test build #123433 has finished for PR 28692 at commit f94c72b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-06-03T06:07:04Z

@yaooqinn please also create a PR to explain what gets changed in the meaning of u in the migration guide. I'm merging this PR to master/3.0 first, thanks!

This PR switches the default Locale from the `US` to `GB` to change the behavior of the first day of the week from Sunday-started to Monday-started as same as v2.4 ```sql spark-sql> select to_timestamp('2020-1-1', 'YYYY-w-u'); 2019-12-29 00:00:00 spark-sql> set spark.sql.legacy.timeParserPolicy=legacy; spark.sql.legacy.timeParserPolicy legacy spark-sql> select to_timestamp('2020-1-1', 'YYYY-w-u'); 2019-12-30 00:00:00 ``` These week-based fields need Locale to express their semantics, the first day of the week varies from country to country. From the Java doc of WeekFields ```java /** * Gets the first day-of-week. * <p> * The first day-of-week varies by culture. * For example, the US uses Sunday, while France and the ISO-8601 standard use Monday. * This method returns the first day using the standard {code DayOfWeek} enum. * * return the first day-of-week, not null */ public DayOfWeek getFirstDayOfWeek() { return firstDayOfWeek; } ``` But for the SimpleDateFormat, the day-of-week is not localized ``` u Day number of week (1 = Monday, ..., 7 = Sunday) Number 1 ``` Currently, the default locale we use is the US, so the result moved a day backward. For other countries, please refer to [First Day of the Week in Different Countries](http://chartsbin.com/view/41671) With this change, it restores the first day of week calculating for functions when using the default locale. Yes, but the behavior change is used to restore the old one of v2.4 add unit tests Closes apache#28692 from yaooqinn/SPARK-31879. Authored-by: Kent Yao <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit c59f51b) Signed-off-by: Wenchen Fan <[email protected]>

bart-samwel · 2020-06-03T10:50:05Z

@yaooqinn @cloud-fan Does this locale setting change anything else in the date parsing?

cloud-fan · 2020-06-03T11:34:29Z

As I mentioned in: #28692 (comment)

There are only 2 localized pattern letter: e: localized day-of-week and O: localized zone-offset. @yaooqinn tested it locally, there is no difference between US and GB for these 2 pattern letters, except that GB treats Monday as the first day of week.

cloud-fan · 2020-06-03T11:52:08Z

please also create a PR to explain what gets changed in the meaning of u in the migration guide.

@bart-samwel maybe you were confused by this. The meaning was already changed when we switch to the java 8 formatter API. In the legacy formatter, u means Day number of week (1 = Monday, ..., 7 = Sunday), which is number, in the new formatter, we replace u with e, but e is localized day-of-week, which is number/text.

There is no such a pattern letter in the java 8 formatter API, which has the same meaning of the legacy u, the e is the closest. Parsing is OK as we ban all week-based pattern letters in parsing, see #28706

yaooqinn · 2020-06-03T12:57:07Z

Hi @bart-samwel, we are improving test coverage for the datetime patterns. Can you help review here https://github.com/apache/spark/pull/28718/files#diff-e342ba37e6a036a13fa8373de2ea9470R1047 ?

cloud-fan · 2020-06-03T17:51:24Z

This breaks the Jenkins Java 11 job, because Java 11 changes the behavior of the GB locale (not sure about other locales, but US locale doesn't change) when formatting AM/PM by lower-casing it, see https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.7-hive-2.3-jdk-11/539/testReport/org.apache.spark.sql/SQLQueryTestSuite/datetime_legacy_sql/ . I can't believe it but that's the reality...

Note: users can still hit this Java 11 behavior change if they set locale to GB manually. It's true for Spark 2.4 as well, but Spark 2.4 doesn't support Java 11.

I'm reverting it from master/3.0, we need to find another way to fix the reported behavior change (or accept it).

yaooqinn · 2020-06-03T18:01:00Z

Thanks for taking care of this. It looks like the culture in GB has been changed a lot during JDK8 to 11, lol...

Since we have banned 'u'(localized day-of-week) for parsing, maybe 'E' is better than 'e' for substitution.

dongjoon-hyun · 2020-06-03T18:47:57Z

Thank you for recovering Java 11!

bart-samwel · 2020-06-03T19:45:37Z

Thanks for taking care of this. It looks like the culture in GB has been changed a lot during JDK8 to 11, lol...

Since we have banned 'u'(localized day-of-week) for parsing, maybe 'E' is better than 'e' for substitution.

I'm thinking that since we can't properly support the old behavior of u, we're better off forbidding it, because otherwise we're doing a silent behavior change. The error can suggest using e or E instead, with a warning about the potential differences.

yaooqinn · 2020-06-04T05:50:36Z

Thanks for taking care of this. It looks like the culture in GB has been changed a lot during JDK8 to 11, lol...
Since we have banned 'u'(localized day-of-week) for parsing, maybe 'E' is better than 'e' for substitution.

I'm thinking that since we can't properly support the old behavior of u, we're better off forbidding it, because otherwise we're doing a silent behavior change. The error can suggest using e or E instead, with a warning about the potential differences.

FYI, #28719 (comment)

…ormatting too # What changes were proposed in this pull request? After all these attempts #28692 and #28719 an #28727. they all have limitations as mentioned in their discussions. Maybe the only way is to forbid them all ### Why are the changes needed? These week-based fields need Locale to express their semantics, the first day of the week varies from country to country. From the Java doc of WeekFields ```java /** * Gets the first day-of-week. * <p> * The first day-of-week varies by culture. * For example, the US uses Sunday, while France and the ISO-8601 standard use Monday. * This method returns the first day using the standard {code DayOfWeek} enum. * * return the first day-of-week, not null */ public DayOfWeek getFirstDayOfWeek() { return firstDayOfWeek; } ``` But for the SimpleDateFormat, the day-of-week is not localized ``` u Day number of week (1 = Monday, ..., 7 = Sunday) Number 1 ``` Currently, the default locale we use is the US, so the result moved a day or a year or a week backward. e.g. For the date `2019-12-29(Sunday)`, in the Sunday Start system(e.g. en-US), it belongs to 2020 of week-based-year, in the Monday Start system(en-GB), it goes to 2019. the week-of-week-based-year(w) will be affected too ```sql spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY', 'locale', 'en-US')); 2020 spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY', 'locale', 'en-GB')); 2019 spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-US')); 2020-01-01 spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-GB')); 2019-52-07 spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2020-01-05', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-US')); 2020-02-01 spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2020-01-05', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-GB')); 2020-01-07 ``` For other countries, please refer to [First Day of the Week in Different Countries](http://chartsbin.com/view/41671) ### Does this PR introduce _any_ user-facing change? With this change, user can not use 'YwuW', but 'e' for 'u' instead. This can at least turn this not to be a silent data change. ### How was this patch tested? add unit tests Closes #28728 from yaooqinn/SPARK-31879-NEW2. Authored-by: Kent Yao <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…ormatting too After all these attempts apache#28692 and apache#28719 an apache#28727. they all have limitations as mentioned in their discussions. Maybe the only way is to forbid them all These week-based fields need Locale to express their semantics, the first day of the week varies from country to country. From the Java doc of WeekFields ```java /** * Gets the first day-of-week. * <p> * The first day-of-week varies by culture. * For example, the US uses Sunday, while France and the ISO-8601 standard use Monday. * This method returns the first day using the standard {code DayOfWeek} enum. * * return the first day-of-week, not null */ public DayOfWeek getFirstDayOfWeek() { return firstDayOfWeek; } ``` But for the SimpleDateFormat, the day-of-week is not localized ``` u Day number of week (1 = Monday, ..., 7 = Sunday) Number 1 ``` Currently, the default locale we use is the US, so the result moved a day or a year or a week backward. e.g. For the date `2019-12-29(Sunday)`, in the Sunday Start system(e.g. en-US), it belongs to 2020 of week-based-year, in the Monday Start system(en-GB), it goes to 2019. the week-of-week-based-year(w) will be affected too ```sql spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY', 'locale', 'en-US')); 2020 spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY', 'locale', 'en-GB')); 2019 spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-US')); 2020-01-01 spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-GB')); 2019-52-07 spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2020-01-05', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-US')); 2020-02-01 spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2020-01-05', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-GB')); 2020-01-07 ``` For other countries, please refer to [First Day of the Week in Different Countries](http://chartsbin.com/view/41671) With this change, user can not use 'YwuW', but 'e' for 'u' instead. This can at least turn this not to be a silent data change. add unit tests Closes apache#28728 from yaooqinn/SPARK-31879-NEW2. Authored-by: Kent Yao <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 9d5b5d0) Signed-off-by: Wenchen Fan <[email protected]>

[SPARK-31879][SQL] Using GB as default Locale for datetime formatters

ab0517d

probot-autolabeler bot added the SQL label Jun 1, 2020

cloud-fan reviewed Jun 1, 2020

View reviewed changes

comment

3339274

cloud-fan approved these changes Jun 1, 2020

View reviewed changes

cloud-fan reviewed Jun 2, 2020

View reviewed changes

use formatter

f94c72b

maropu approved these changes Jun 3, 2020

View reviewed changes

cloud-fan closed this in c59f51b Jun 3, 2020

cloud-fan mentioned this pull request Jun 3, 2020

[SPARK-31899][SQL] Forbid datetime pattern letter u #28719

Closed

This was referenced Jun 4, 2020

[SPARK-31879][SQL][test-java11] Preset monday start(iso) for the first day of week by reflection #28727

Closed

[SPARK-31879][SQL][test-java11] Make week-based pattern invalid for formatting too #28728

Closed

[SPARK-31879][SQL] Using GB as default Locale for datetime formatters #28692

[SPARK-31879][SQL] Using GB as default Locale for datetime formatters #28692

Uh oh!

Conversation

yaooqinn commented Jun 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

cases

reasons

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

cloud-fan Jun 1, 2020

Choose a reason for hiding this comment

Uh oh!

yaooqinn Jun 1, 2020

Choose a reason for hiding this comment

Uh oh!

cloud-fan Jun 1, 2020

Choose a reason for hiding this comment

Uh oh!

yaooqinn Jun 1, 2020

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 1, 2020

Uh oh!

cloud-fan Jun 1, 2020

Choose a reason for hiding this comment

Uh oh!

srowen commented Jun 1, 2020

Uh oh!

SparkQA commented Jun 1, 2020

Uh oh!

yaooqinn commented Jun 2, 2020

Uh oh!

SparkQA commented Jun 2, 2020

Uh oh!

cloud-fan commented Jun 2, 2020

Uh oh!

yaooqinn commented Jun 2, 2020

Uh oh!

yaooqinn commented Jun 2, 2020

Uh oh!

cloud-fan commented Jun 2, 2020

Uh oh!

cloud-fan commented Jun 2, 2020

Uh oh!

SparkQA commented Jun 2, 2020

Uh oh!

yaooqinn commented Jun 2, 2020

Uh oh!

yaooqinn commented Jun 2, 2020

Uh oh!

SparkQA commented Jun 2, 2020

Uh oh!

cloud-fan Jun 2, 2020

Choose a reason for hiding this comment

Uh oh!

srowen commented Jun 2, 2020

Uh oh!

yaooqinn commented Jun 2, 2020

Uh oh!

srowen commented Jun 2, 2020

Uh oh!

SparkQA commented Jun 2, 2020

Uh oh!

cloud-fan commented Jun 3, 2020

Uh oh!

bart-samwel commented Jun 3, 2020

Uh oh!

cloud-fan commented Jun 3, 2020

Uh oh!

cloud-fan commented Jun 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yaooqinn commented Jun 3, 2020

Uh oh!

cloud-fan commented Jun 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

yaooqinn commented Jun 1, 2020 •

edited

Loading

cloud-fan commented Jun 3, 2020 •

edited

Loading

cloud-fan commented Jun 3, 2020 •

edited

Loading