[SPARK-31879][SQL][test-java11] Make week-based pattern invalid for formatting too #28728

yaooqinn · 2020-06-04T14:26:45Z

What changes were proposed in this pull request?

After all these attempts #28692 and #28719 an #28727.
they all have limitations as mentioned in their discussions.

Maybe the only way is to forbid them all

Why are the changes needed?

These week-based fields need Locale to express their semantics, the first day of the week varies from country to country.

From the Java doc of WeekFields

    /**
     * Gets the first day-of-week.
     * <p>
     * The first day-of-week varies by culture.
     * For example, the US uses Sunday, while France and the ISO-8601 standard use Monday.
     * This method returns the first day using the standard {@code DayOfWeek} enum.
     *
     * @return the first day-of-week, not null
     */
    public DayOfWeek getFirstDayOfWeek() {
        return firstDayOfWeek;
    }

But for the SimpleDateFormat, the day-of-week is not localized

u	Day number of week (1 = Monday, ..., 7 = Sunday)	Number	1

Currently, the default locale we use is the US, so the result moved a day or a year or a week backward.

e.g.

For the date 2019-12-29(Sunday), in the Sunday Start system(e.g. en-US), it belongs to 2020 of week-based-year, in the Monday Start system(en-GB), it goes to 2019. the week-of-week-based-year(w) will be affected too

spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY', 'locale', 'en-US'));
2020
spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY', 'locale', 'en-GB'));
2019

spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-US'));
2020-01-01
spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-GB'));
2019-52-07

spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2020-01-05', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-US'));
2020-02-01
spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2020-01-05', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-GB'));
2020-01-07

For other countries, please refer to First Day of the Week in Different Countries

Does this PR introduce any user-facing change?

With this change, user can not use 'YwuW', but 'e' for 'u' instead. This can at least turn this not to be a silent data change.

How was this patch tested?

add unit tests

cloud-fan · 2020-06-04T14:57:19Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeFormatterHelper.scala

    formatter.format(LocalDate.of(2000, 1, 1)) == "1 1"
  }
-  final val unsupportedLetters = Set('A', 'c', 'e', 'n', 'N', 'p')
+  final val unsupportedLetters = Set('A', 'c', 'n', 'N', 'p', 'Y', 'W', 'w', 'u')


how about we add a new list weekBasedLetters? They deserve a dedicated error message, to suggest users using the SQL function EXTRACT.

cloud-fan · 2020-06-04T14:58:31Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateFormatterSuite.scala

    intercept[SparkUpgradeException](formatter.parse("02-29"))
  }
+
+  override def checkFormatterCreation(pattern: String, isParsing: Boolean): Unit = {


can we move it to the beginning of this class? to avoid code conflict when we add new tests

cloud-fan · 2020-06-04T14:58:46Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/TimestampFormatterSuite.scala

-      .foreach { pattern =>
-        intercept[SparkUpgradeException](TimestampFormatter(pattern, UTC).format(0))
-    }
+  override def checkFormatterCreation(pattern: String, isParsing: Boolean): Unit = {


SparkQA · 2020-06-04T16:22:45Z

Test build #123534 has finished for PR 28728 at commit c8ac3f0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

yaooqinn · 2020-06-04T16:36:56Z

retest this please

cloud-fan · 2020-06-04T16:53:11Z

docs/sql-ref-datetime-pattern.md

-|**W**|week-of-month|number(1)|4|
 |**E**|day-of-week|text|Tue; Tuesday|
-|**u**|localized day-of-week|number/text|2; 02; Tue; Tuesday|
+|**e**|localized day-of-week|number/text|2; 02; Tue; Tuesday|


shall we forbid e as well? It's not supported in 2.4 either.

Forgot to update the doc, thanks for reminding

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateFormatter.scala

cloud-fan · 2020-06-04T18:06:27Z

I'm fine with it, what do you think? @bart-samwel @maropu @MaxGekk @HyukjinKwon @dongjoon-hyun

SparkQA · 2020-06-04T19:23:41Z

Test build #123536 has finished for PR 28728 at commit ca79e03.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-06-04T19:31:39Z

Test build #123535 has finished for PR 28728 at commit 7b95d6c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-06-04T20:54:39Z

Test build #123537 has finished for PR 28728 at commit 7671d96.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-06-04T21:46:12Z

Test build #123538 has finished for PR 28728 at commit d7fc6d9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

bart-samwel · 2020-06-05T07:32:38Z

@cloud-fan Yes, let's forbid the legacy week-based letters that were non-localized. We can keep the new format letters, those are new features then.

cloud-fan · 2020-06-05T08:13:50Z

thanks, merging to master/3.0!

…ormatting too After all these attempts apache#28692 and apache#28719 an apache#28727. they all have limitations as mentioned in their discussions. Maybe the only way is to forbid them all These week-based fields need Locale to express their semantics, the first day of the week varies from country to country. From the Java doc of WeekFields ```java /** * Gets the first day-of-week. * <p> * The first day-of-week varies by culture. * For example, the US uses Sunday, while France and the ISO-8601 standard use Monday. * This method returns the first day using the standard {code DayOfWeek} enum. * * return the first day-of-week, not null */ public DayOfWeek getFirstDayOfWeek() { return firstDayOfWeek; } ``` But for the SimpleDateFormat, the day-of-week is not localized ``` u Day number of week (1 = Monday, ..., 7 = Sunday) Number 1 ``` Currently, the default locale we use is the US, so the result moved a day or a year or a week backward. e.g. For the date `2019-12-29(Sunday)`, in the Sunday Start system(e.g. en-US), it belongs to 2020 of week-based-year, in the Monday Start system(en-GB), it goes to 2019. the week-of-week-based-year(w) will be affected too ```sql spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY', 'locale', 'en-US')); 2020 spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY', 'locale', 'en-GB')); 2019 spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-US')); 2020-01-01 spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-GB')); 2019-52-07 spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2020-01-05', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-US')); 2020-02-01 spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2020-01-05', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-GB')); 2020-01-07 ``` For other countries, please refer to [First Day of the Week in Different Countries](http://chartsbin.com/view/41671) With this change, user can not use 'YwuW', but 'e' for 'u' instead. This can at least turn this not to be a silent data change. add unit tests Closes apache#28728 from yaooqinn/SPARK-31879-NEW2. Authored-by: Kent Yao <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 9d5b5d0) Signed-off-by: Wenchen Fan <[email protected]>

[SPARK-31879][SQL] Make week-based pattern invalid for formatting too

c8ac3f0

probot-autolabeler bot added DOCS SQL labels Jun 4, 2020

yaooqinn marked this pull request as draft June 4, 2020 14:27

update

1d6b6f3

yaooqinn marked this pull request as ready for review June 4, 2020 14:45

doc

7b95d6c

cloud-fan reviewed Jun 4, 2020

View reviewed changes

address comments

ca79e03

tests

7671d96

yaooqinn changed the title ~~[SPARK-31879][SQL] Make week-based pattern invalid for formatting too~~ [SPARK-31879][SQL][test-java11] Make week-based pattern invalid for formatting too Jun 4, 2020

cloud-fan reviewed Jun 4, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateFormatter.scala Show resolved Hide resolved

doc

d7fc6d9

cloud-fan closed this in 9d5b5d0 Jun 5, 2020

[SPARK-31879][SQL][test-java11] Make week-based pattern invalid for formatting too #28728

[SPARK-31879][SQL][test-java11] Make week-based pattern invalid for formatting too #28728

Uh oh!

Conversation

yaooqinn commented Jun 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

cloud-fan Jun 4, 2020

Choose a reason for hiding this comment

Uh oh!

cloud-fan Jun 4, 2020

Choose a reason for hiding this comment

Uh oh!

cloud-fan Jun 4, 2020

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 4, 2020

Uh oh!

yaooqinn commented Jun 4, 2020

Uh oh!

cloud-fan Jun 4, 2020

Choose a reason for hiding this comment

Uh oh!

yaooqinn Jun 4, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cloud-fan commented Jun 4, 2020

Uh oh!

SparkQA commented Jun 4, 2020

Uh oh!

SparkQA commented Jun 4, 2020

Uh oh!

SparkQA commented Jun 4, 2020

Uh oh!

SparkQA commented Jun 4, 2020

Uh oh!

bart-samwel commented Jun 5, 2020

Uh oh!

cloud-fan commented Jun 5, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yaooqinn commented Jun 4, 2020 •

edited

Loading