-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-31879][SQL] Using GB as default Locale for datetime formatters #28692
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| /** | ||
| * This is change from Locale.US to GB, because: | ||
| * The first day-of-week varies by culture. | ||
| * For example, the US uses Sunday, while the United Kingdom and the ISO-8601 standard use Monday. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this the only difference between US and en-GB?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that we don't need care about the other differences, e.g. currency symbol
FYI,
http://www.localeplanet.com/java/en-GB/index.html and http://www.localeplanet.com/java/en-US/index.html
the timeZone is not the same, but it's a separate field we don't get it from Locale right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there any localized timezone related pattern letter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have O, OOOO and ZZZZ that are localized, but there are decided by the zoneID(spark.sql.session.timeZone), not related to Locate here.
|
Test build #123369 has finished for PR 28692 at commit
|
|
|
||
| val defaultLocale: Locale = Locale.US | ||
| /** | ||
| * This is change from Locale.US to GB, because: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make the doc shorter
Before Spark 3.0, the first day-of-week is always Monday. Since Spark 3.0, it depends on the locale.
We pick GB as the default locale instead of US, to be compatible with Spark 2.x, as US locale uses
Sunday as the first day-of-week. See SPARK-31879.
|
Wow, really? Monday is considered the first day of week in the US locale? |
|
Test build #123378 has finished for PR 28692 at commit
|
|
retest this please |
|
Test build #123410 has finished for PR 28692 at commit
|
|
good idea using ROOT locale! @yaooqinn can you try it? |
|
Locale.ROOT was my first choice but it didn't work. However, even it works, I don't think it's a good idea for a distributed system. |
+ val defaultLocale: Locale = Locale.ROOT
def defaultPattern(): String = s"${DateFormatter.defaultPattern} HH:mm:ss"
diff --git a/sql/core/src/test/resources/sql-tests/results/ansi/datetime.sql.out b/sql/core/src/test/resources/sql-tests/results/ansi/datetime.sql.out
index a9a3bccadc..94fcc3b4ad 100644
--- a/sql/core/src/test/resources/sql-tests/results/ansi/datetime.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/ansi/datetime.sql.out
@@ -1032,4 +1032,4 @@ select to_timestamp('2020-01-01', 'YYYY-ww-uu')
-- !query schema
struct<to_timestamp(2020-01-01, YYYY-ww-uu):timestamp>
-- !query output
-2019-12-30 00:00:00
+2019-12-29 00:00:00 |
|
It's a bit unfortunate that ROOT locale also uses Sunday as the first day. I checked the pattern string doc: https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html , there are only 2 localized pattern letter: |
|
retest this please |
|
Test build #123413 has finished for PR 28692 at commit
|
|
retest this please |
1 similar comment
|
retest this please |
|
Test build #123421 has finished for PR 28692 at commit
|
| select date_format(date '2018-11-17', 'yyyyyyyyyyy-MM-dd'); | ||
|
|
||
| -- SPARK-31879: the first day of week | ||
| select to_timestamp('2020-01-01', 'YYYY-ww-uu'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use formatting in the test? so this doesn't conflict with #28706
|
Just for my info, why was Monday the first day of week in 2.4 then? we didn't use TZ to determine it? |
In Spark version 2.4 and earlier, datetime parsing and formatting are performed by the old Java 7
The
The
I think the backward compatibility between 3.0 and 2.4 shall go first. |
|
Oops right I meant Locale, not TZ. |
|
Test build #123433 has finished for PR 28692 at commit
|
|
@yaooqinn please also create a PR to explain what gets changed in the meaning of |
This PR switches the default Locale from the `US` to `GB` to change the behavior of the first day of the week from Sunday-started to Monday-started as same as v2.4
```sql
spark-sql> select to_timestamp('2020-1-1', 'YYYY-w-u');
2019-12-29 00:00:00
spark-sql> set spark.sql.legacy.timeParserPolicy=legacy;
spark.sql.legacy.timeParserPolicy legacy
spark-sql> select to_timestamp('2020-1-1', 'YYYY-w-u');
2019-12-30 00:00:00
```
These week-based fields need Locale to express their semantics, the first day of the week varies from country to country.
From the Java doc of WeekFields
```java
/**
* Gets the first day-of-week.
* <p>
* The first day-of-week varies by culture.
* For example, the US uses Sunday, while France and the ISO-8601 standard use Monday.
* This method returns the first day using the standard {code DayOfWeek} enum.
*
* return the first day-of-week, not null
*/
public DayOfWeek getFirstDayOfWeek() {
return firstDayOfWeek;
}
```
But for the SimpleDateFormat, the day-of-week is not localized
```
u Day number of week (1 = Monday, ..., 7 = Sunday) Number 1
```
Currently, the default locale we use is the US, so the result moved a day backward.
For other countries, please refer to [First Day of the Week in Different Countries](http://chartsbin.com/view/41671)
With this change, it restores the first day of week calculating for functions when using the default locale.
Yes, but the behavior change is used to restore the old one of v2.4
add unit tests
Closes apache#28692 from yaooqinn/SPARK-31879.
Authored-by: Kent Yao <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit c59f51b)
Signed-off-by: Wenchen Fan <[email protected]>
|
@yaooqinn @cloud-fan Does this locale setting change anything else in the date parsing? |
|
As I mentioned in: #28692 (comment) There are only 2 localized pattern letter: |
@bart-samwel maybe you were confused by this. The meaning was already changed when we switch to the java 8 formatter API. In the legacy formatter, There is no such a pattern letter in the java 8 formatter API, which has the same meaning of the legacy |
|
Hi @bart-samwel, we are improving test coverage for the datetime patterns. Can you help review here https://github.com/apache/spark/pull/28718/files#diff-e342ba37e6a036a13fa8373de2ea9470R1047 ? |
|
This breaks the Jenkins Java 11 job, because Java 11 changes the behavior of the GB locale (not sure about other locales, but US locale doesn't change) when formatting AM/PM by lower-casing it, see https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.7-hive-2.3-jdk-11/539/testReport/org.apache.spark.sql/SQLQueryTestSuite/datetime_legacy_sql/ . I can't believe it but that's the reality... Note: users can still hit this Java 11 behavior change if they set locale to GB manually. It's true for Spark 2.4 as well, but Spark 2.4 doesn't support Java 11. I'm reverting it from master/3.0, we need to find another way to fix the reported behavior change (or accept it). |
|
Thanks for taking care of this. It looks like the culture in GB has been changed a lot during JDK8 to 11, lol... Since we have banned 'u'(localized day-of-week) for parsing, maybe 'E' is better than 'e' for substitution. |
|
Thank you for recovering Java 11! |
I'm thinking that since we can't properly support the old behavior of |
FYI, #28719 (comment) |
…ormatting too # What changes were proposed in this pull request? After all these attempts #28692 and #28719 an #28727. they all have limitations as mentioned in their discussions. Maybe the only way is to forbid them all ### Why are the changes needed? These week-based fields need Locale to express their semantics, the first day of the week varies from country to country. From the Java doc of WeekFields ```java /** * Gets the first day-of-week. * <p> * The first day-of-week varies by culture. * For example, the US uses Sunday, while France and the ISO-8601 standard use Monday. * This method returns the first day using the standard {code DayOfWeek} enum. * * return the first day-of-week, not null */ public DayOfWeek getFirstDayOfWeek() { return firstDayOfWeek; } ``` But for the SimpleDateFormat, the day-of-week is not localized ``` u Day number of week (1 = Monday, ..., 7 = Sunday) Number 1 ``` Currently, the default locale we use is the US, so the result moved a day or a year or a week backward. e.g. For the date `2019-12-29(Sunday)`, in the Sunday Start system(e.g. en-US), it belongs to 2020 of week-based-year, in the Monday Start system(en-GB), it goes to 2019. the week-of-week-based-year(w) will be affected too ```sql spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY', 'locale', 'en-US')); 2020 spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY', 'locale', 'en-GB')); 2019 spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-US')); 2020-01-01 spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-GB')); 2019-52-07 spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2020-01-05', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-US')); 2020-02-01 spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2020-01-05', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-GB')); 2020-01-07 ``` For other countries, please refer to [First Day of the Week in Different Countries](http://chartsbin.com/view/41671) ### Does this PR introduce _any_ user-facing change? With this change, user can not use 'YwuW', but 'e' for 'u' instead. This can at least turn this not to be a silent data change. ### How was this patch tested? add unit tests Closes #28728 from yaooqinn/SPARK-31879-NEW2. Authored-by: Kent Yao <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…ormatting too After all these attempts apache#28692 and apache#28719 an apache#28727. they all have limitations as mentioned in their discussions. Maybe the only way is to forbid them all These week-based fields need Locale to express their semantics, the first day of the week varies from country to country. From the Java doc of WeekFields ```java /** * Gets the first day-of-week. * <p> * The first day-of-week varies by culture. * For example, the US uses Sunday, while France and the ISO-8601 standard use Monday. * This method returns the first day using the standard {code DayOfWeek} enum. * * return the first day-of-week, not null */ public DayOfWeek getFirstDayOfWeek() { return firstDayOfWeek; } ``` But for the SimpleDateFormat, the day-of-week is not localized ``` u Day number of week (1 = Monday, ..., 7 = Sunday) Number 1 ``` Currently, the default locale we use is the US, so the result moved a day or a year or a week backward. e.g. For the date `2019-12-29(Sunday)`, in the Sunday Start system(e.g. en-US), it belongs to 2020 of week-based-year, in the Monday Start system(en-GB), it goes to 2019. the week-of-week-based-year(w) will be affected too ```sql spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY', 'locale', 'en-US')); 2020 spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY', 'locale', 'en-GB')); 2019 spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-US')); 2020-01-01 spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-GB')); 2019-52-07 spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2020-01-05', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-US')); 2020-02-01 spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2020-01-05', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-GB')); 2020-01-07 ``` For other countries, please refer to [First Day of the Week in Different Countries](http://chartsbin.com/view/41671) With this change, user can not use 'YwuW', but 'e' for 'u' instead. This can at least turn this not to be a silent data change. add unit tests Closes apache#28728 from yaooqinn/SPARK-31879-NEW2. Authored-by: Kent Yao <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 9d5b5d0) Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
This PR switches the default Locale from the
UStoGBto change the behavior of the first day of the week from Sunday-started to Monday-started as same as v2.4Why are the changes needed?
cases
reasons
These week-based fields need Locale to express their semantics, the first day of the week varies from country to country.
From the Java doc of WeekFields
But for the SimpleDateFormat, the day-of-week is not localized
Currently, the default locale we use is the US, so the result moved a day backward.
For other countries, please refer to First Day of the Week in Different Countries
With this change, it restores the first day of week calculating for functions when using the default locale.
Does this PR introduce any user-facing change?
Yes, but the behavior change is used to restore the old one of v2.4
How was this patch tested?
add unit tests