Skip to content

Conversation

@yaooqinn
Copy link
Member

@yaooqinn yaooqinn commented Jun 1, 2020

What changes were proposed in this pull request?

This PR switches the default Locale from the US to GB to change the behavior of the first day of the week from Sunday-started to Monday-started as same as v2.4

Why are the changes needed?

cases

spark-sql> select to_timestamp('2020-1-1', 'YYYY-w-u');
2019-12-29 00:00:00
spark-sql> set spark.sql.legacy.timeParserPolicy=legacy;
spark.sql.legacy.timeParserPolicy	legacy
spark-sql> select to_timestamp('2020-1-1', 'YYYY-w-u');
2019-12-30 00:00:00

reasons

These week-based fields need Locale to express their semantics, the first day of the week varies from country to country.

From the Java doc of WeekFields

    /**
     * Gets the first day-of-week.
     * <p>
     * The first day-of-week varies by culture.
     * For example, the US uses Sunday, while France and the ISO-8601 standard use Monday.
     * This method returns the first day using the standard {@code DayOfWeek} enum.
     *
     * @return the first day-of-week, not null
     */
    public DayOfWeek getFirstDayOfWeek() {
        return firstDayOfWeek;
    }

But for the SimpleDateFormat, the day-of-week is not localized

u	Day number of week (1 = Monday, ..., 7 = Sunday)	Number	1

Currently, the default locale we use is the US, so the result moved a day backward.

For other countries, please refer to First Day of the Week in Different Countries

With this change, it restores the first day of week calculating for functions when using the default locale.

Does this PR introduce any user-facing change?

Yes, but the behavior change is used to restore the old one of v2.4

How was this patch tested?

add unit tests

/**
* This is change from Locale.US to GB, because:
* The first day-of-week varies by culture.
* For example, the US uses Sunday, while the United Kingdom and the ISO-8601 standard use Monday.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this the only difference between US and en-GB?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that we don't need care about the other differences, e.g. currency symbol
FYI,
http://www.localeplanet.com/java/en-GB/index.html and http://www.localeplanet.com/java/en-US/index.html

the timeZone is not the same, but it's a separate field we don't get it from Locale right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any localized timezone related pattern letter?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have O, OOOO and ZZZZ that are localized, but there are decided by the zoneID(spark.sql.session.timeZone), not related to Locate here.

@SparkQA
Copy link

SparkQA commented Jun 1, 2020

Test build #123369 has finished for PR 28692 at commit ab0517d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


val defaultLocale: Locale = Locale.US
/**
* This is change from Locale.US to GB, because:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make the doc shorter

Before Spark 3.0, the first day-of-week is always Monday. Since Spark 3.0, it depends on the locale.
We pick GB as the default locale instead of US, to be compatible with Spark 2.x, as US locale uses
Sunday as the first day-of-week. See SPARK-31879.

@srowen
Copy link
Member

srowen commented Jun 1, 2020

Wow, really? Monday is considered the first day of week in the US locale?
I'm a little uneasy about defaulting to GB, as it might have other implications.
What does the root locale do here?

@SparkQA
Copy link

SparkQA commented Jun 1, 2020

Test build #123378 has finished for PR 28692 at commit 3339274.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yaooqinn
Copy link
Member Author

yaooqinn commented Jun 2, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Jun 2, 2020

Test build #123410 has finished for PR 28692 at commit 3339274.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

good idea using ROOT locale! @yaooqinn can you try it?

@yaooqinn
Copy link
Member Author

yaooqinn commented Jun 2, 2020

Locale.ROOT was my first choice but it didn't work. However, even it works, I don't think it's a good idea for a distributed system.

@yaooqinn
Copy link
Member Author

yaooqinn commented Jun 2, 2020

+  val defaultLocale: Locale = Locale.ROOT

   def defaultPattern(): String = s"${DateFormatter.defaultPattern} HH:mm:ss"

diff --git a/sql/core/src/test/resources/sql-tests/results/ansi/datetime.sql.out b/sql/core/src/test/resources/sql-tests/results/ansi/datetime.sql.out
index a9a3bccadc..94fcc3b4ad 100644
--- a/sql/core/src/test/resources/sql-tests/results/ansi/datetime.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/ansi/datetime.sql.out
@@ -1032,4 +1032,4 @@ select to_timestamp('2020-01-01', 'YYYY-ww-uu')
 -- !query schema
 struct<to_timestamp(2020-01-01, YYYY-ww-uu):timestamp>
 -- !query output
-2019-12-30 00:00:00
+2019-12-29 00:00:00

@cloud-fan
Copy link
Contributor

It's a bit unfortunate that ROOT locale also uses Sunday as the first day. I checked the pattern string doc: https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html , there are only 2 localized pattern letter: e: localized day-of-week and O: localized zone-offset. US and DB have no difference for these two.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Jun 2, 2020

Test build #123413 has finished for PR 28692 at commit 3339274.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yaooqinn
Copy link
Member Author

yaooqinn commented Jun 2, 2020

retest this please

1 similar comment
@yaooqinn
Copy link
Member Author

yaooqinn commented Jun 2, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Jun 2, 2020

Test build #123421 has finished for PR 28692 at commit 3339274.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

select date_format(date '2018-11-17', 'yyyyyyyyyyy-MM-dd');

-- SPARK-31879: the first day of week
select to_timestamp('2020-01-01', 'YYYY-ww-uu');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use formatting in the test? so this doesn't conflict with #28706

@srowen
Copy link
Member

srowen commented Jun 2, 2020

Just for my info, why was Monday the first day of week in 2.4 then? we didn't use TZ to determine it?
Also is that the right answer from a SQL standard or Hive perspective?

@yaooqinn
Copy link
Member Author

yaooqinn commented Jun 2, 2020

why was Monday the first day of week in 2.4 then?

In Spark version 2.4 and earlier, datetime parsing and formatting are performed by the old Java 7 SimpleDateFormat API, the Monday the first day of week thing is decided by SimpleDateFormat. Since Spark 3.0, we switch to the new Java 8 DateTimeFormatter to use the Proleptic Gregorian calendar, which is required by the ISO and SQL standards.

we didn't use TZ to determine it?

The first day of week is a date field, not a time field, TZ is short for time zone actually.

Also is that the right answer from a SQL standard

The first day of week is not directly decided by ANSI SQL Framework but from its normative reference ISO standard. It is Monday start in ISO. The day of weekDateTimeFormatter become localized and by default we use Locale.US where a week starts from Sunday. We have to restore this for backward compatibility. There may more solutions options provided at SPARK-31879

or Hive perspective?

I think the backward compatibility between 3.0 and 2.4 shall go first.

@srowen
Copy link
Member

srowen commented Jun 2, 2020

Oops right I meant Locale, not TZ.
OK makes sense.

@SparkQA
Copy link

SparkQA commented Jun 2, 2020

Test build #123433 has finished for PR 28692 at commit f94c72b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

@yaooqinn please also create a PR to explain what gets changed in the meaning of u in the migration guide. I'm merging this PR to master/3.0 first, thanks!

@cloud-fan cloud-fan closed this in c59f51b Jun 3, 2020
cloud-fan pushed a commit to cloud-fan/spark that referenced this pull request Jun 3, 2020
This PR switches the default Locale from the `US` to `GB` to change the behavior of the first day of the week from Sunday-started to Monday-started as same as v2.4

```sql
spark-sql> select to_timestamp('2020-1-1', 'YYYY-w-u');
2019-12-29 00:00:00
spark-sql> set spark.sql.legacy.timeParserPolicy=legacy;
spark.sql.legacy.timeParserPolicy	legacy
spark-sql> select to_timestamp('2020-1-1', 'YYYY-w-u');
2019-12-30 00:00:00
```

These week-based fields need Locale to express their semantics, the first day of the week varies from country to country.

From the Java doc of WeekFields
```java
    /**
     * Gets the first day-of-week.
     * <p>
     * The first day-of-week varies by culture.
     * For example, the US uses Sunday, while France and the ISO-8601 standard use Monday.
     * This method returns the first day using the standard {code DayOfWeek} enum.
     *
     * return the first day-of-week, not null
     */
    public DayOfWeek getFirstDayOfWeek() {
        return firstDayOfWeek;
    }
```

But for the SimpleDateFormat, the day-of-week is not localized

```
u	Day number of week (1 = Monday, ..., 7 = Sunday)	Number	1
```

Currently, the default locale we use is the US, so the result moved a day backward.

For other countries, please refer to [First Day of the Week in Different Countries](http://chartsbin.com/view/41671)

With this change, it restores the first day of week calculating for functions when using the default locale.

Yes, but the behavior change is used to restore the old one of v2.4

add unit tests

Closes apache#28692 from yaooqinn/SPARK-31879.

Authored-by: Kent Yao <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit c59f51b)
Signed-off-by: Wenchen Fan <[email protected]>
@bart-samwel
Copy link

@yaooqinn @cloud-fan Does this locale setting change anything else in the date parsing?

@cloud-fan
Copy link
Contributor

As I mentioned in: #28692 (comment)

There are only 2 localized pattern letter: e: localized day-of-week and O: localized zone-offset. @yaooqinn tested it locally, there is no difference between US and GB for these 2 pattern letters, except that GB treats Monday as the first day of week.

@cloud-fan
Copy link
Contributor

cloud-fan commented Jun 3, 2020

please also create a PR to explain what gets changed in the meaning of u in the migration guide.

@bart-samwel maybe you were confused by this. The meaning was already changed when we switch to the java 8 formatter API. In the legacy formatter, u means Day number of week (1 = Monday, ..., 7 = Sunday), which is number, in the new formatter, we replace u with e, but e is localized day-of-week, which is number/text.

There is no such a pattern letter in the java 8 formatter API, which has the same meaning of the legacy u, the e is the closest. Parsing is OK as we ban all week-based pattern letters in parsing, see #28706

@yaooqinn
Copy link
Member Author

yaooqinn commented Jun 3, 2020

Hi @bart-samwel, we are improving test coverage for the datetime patterns. Can you help review here https://github.com/apache/spark/pull/28718/files#diff-e342ba37e6a036a13fa8373de2ea9470R1047 ?

@cloud-fan
Copy link
Contributor

cloud-fan commented Jun 3, 2020

This breaks the Jenkins Java 11 job, because Java 11 changes the behavior of the GB locale (not sure about other locales, but US locale doesn't change) when formatting AM/PM by lower-casing it, see https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.7-hive-2.3-jdk-11/539/testReport/org.apache.spark.sql/SQLQueryTestSuite/datetime_legacy_sql/ . I can't believe it but that's the reality...

Note: users can still hit this Java 11 behavior change if they set locale to GB manually. It's true for Spark 2.4 as well, but Spark 2.4 doesn't support Java 11.

I'm reverting it from master/3.0, we need to find another way to fix the reported behavior change (or accept it).

@yaooqinn
Copy link
Member Author

yaooqinn commented Jun 3, 2020

Thanks for taking care of this. It looks like the culture in GB has been changed a lot during JDK8 to 11, lol...

Since we have banned 'u'(localized day-of-week) for parsing, maybe 'E' is better than 'e' for substitution.

@dongjoon-hyun
Copy link
Member

Thank you for recovering Java 11!

@bart-samwel
Copy link

Thanks for taking care of this. It looks like the culture in GB has been changed a lot during JDK8 to 11, lol...

Since we have banned 'u'(localized day-of-week) for parsing, maybe 'E' is better than 'e' for substitution.

I'm thinking that since we can't properly support the old behavior of u, we're better off forbidding it, because otherwise we're doing a silent behavior change. The error can suggest using e or E instead, with a warning about the potential differences.

@yaooqinn
Copy link
Member Author

yaooqinn commented Jun 4, 2020

Thanks for taking care of this. It looks like the culture in GB has been changed a lot during JDK8 to 11, lol...
Since we have banned 'u'(localized day-of-week) for parsing, maybe 'E' is better than 'e' for substitution.

I'm thinking that since we can't properly support the old behavior of u, we're better off forbidding it, because otherwise we're doing a silent behavior change. The error can suggest using e or E instead, with a warning about the potential differences.

FYI, #28719 (comment)

cloud-fan pushed a commit that referenced this pull request Jun 5, 2020
…ormatting too

# What changes were proposed in this pull request?

After all these attempts #28692 and #28719 an #28727.
they all have limitations as mentioned in their discussions.

Maybe the only way is to forbid them all

### Why are the changes needed?

These week-based fields need Locale to express their semantics, the first day of the week varies from country to country.

From the Java doc of WeekFields
```java
    /**
     * Gets the first day-of-week.
     * <p>
     * The first day-of-week varies by culture.
     * For example, the US uses Sunday, while France and the ISO-8601 standard use Monday.
     * This method returns the first day using the standard {code DayOfWeek} enum.
     *
     * return the first day-of-week, not null
     */
    public DayOfWeek getFirstDayOfWeek() {
        return firstDayOfWeek;
    }
```

But for the SimpleDateFormat, the day-of-week is not localized

```
u	Day number of week (1 = Monday, ..., 7 = Sunday)	Number	1
```

Currently, the default locale we use is the US, so the result moved a day or a year or a week backward.

e.g.

For the date `2019-12-29(Sunday)`, in the Sunday Start system(e.g. en-US), it belongs to 2020 of week-based-year, in the Monday Start system(en-GB), it goes to 2019. the week-of-week-based-year(w) will be affected too

```sql
spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY', 'locale', 'en-US'));
2020
spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY', 'locale', 'en-GB'));
2019

spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-US'));
2020-01-01
spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-GB'));
2019-52-07

spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2020-01-05', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-US'));
2020-02-01
spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2020-01-05', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-GB'));
2020-01-07
```

For other countries, please refer to [First Day of the Week in Different Countries](http://chartsbin.com/view/41671)

### Does this PR introduce _any_ user-facing change?
With this change, user can not use 'YwuW',  but 'e' for 'u' instead. This can at least turn this not to be a silent data change.

### How was this patch tested?

add unit tests

Closes #28728 from yaooqinn/SPARK-31879-NEW2.

Authored-by: Kent Yao <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
cloud-fan pushed a commit to cloud-fan/spark that referenced this pull request Jun 5, 2020
…ormatting too

After all these attempts apache#28692 and apache#28719 an apache#28727.
they all have limitations as mentioned in their discussions.

Maybe the only way is to forbid them all

These week-based fields need Locale to express their semantics, the first day of the week varies from country to country.

From the Java doc of WeekFields
```java
    /**
     * Gets the first day-of-week.
     * <p>
     * The first day-of-week varies by culture.
     * For example, the US uses Sunday, while France and the ISO-8601 standard use Monday.
     * This method returns the first day using the standard {code DayOfWeek} enum.
     *
     * return the first day-of-week, not null
     */
    public DayOfWeek getFirstDayOfWeek() {
        return firstDayOfWeek;
    }
```

But for the SimpleDateFormat, the day-of-week is not localized

```
u	Day number of week (1 = Monday, ..., 7 = Sunday)	Number	1
```

Currently, the default locale we use is the US, so the result moved a day or a year or a week backward.

e.g.

For the date `2019-12-29(Sunday)`, in the Sunday Start system(e.g. en-US), it belongs to 2020 of week-based-year, in the Monday Start system(en-GB), it goes to 2019. the week-of-week-based-year(w) will be affected too

```sql
spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY', 'locale', 'en-US'));
2020
spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY', 'locale', 'en-GB'));
2019

spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-US'));
2020-01-01
spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-GB'));
2019-52-07

spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2020-01-05', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-US'));
2020-02-01
spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2020-01-05', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-GB'));
2020-01-07
```

For other countries, please refer to [First Day of the Week in Different Countries](http://chartsbin.com/view/41671)

With this change, user can not use 'YwuW',  but 'e' for 'u' instead. This can at least turn this not to be a silent data change.

add unit tests

Closes apache#28728 from yaooqinn/SPARK-31879-NEW2.

Authored-by: Kent Yao <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 9d5b5d0)
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants