Skip to content

Conversation

@yaooqinn
Copy link
Member

What changes were proposed in this pull request?

In 3.0, the new approach in datetime parser can not extract the week-based-year field, so it goes
idempotently to 1970.

spark-sql> explain select to_timestamp('1969-01-01', 'YYYY-MM-dd');
== Physical Plan ==
*(1) Project [-28800000000 AS to_timestamp(1969-01-01, YYYY-MM-dd)#37]
+- *(1) Scan OneRowRelation[]

== Physical Plan ==
*(1) Project [-28800000000 AS to_timestamp(2000-01-01, YYYY-MM-dd)#73]
+- *(1) Scan OneRowRelation[]

In Legacy mode a.k.a version 2.4, the result will be the last Sunday of the past year(It is weird too!)

spark-sql> select to_timestamp('1969-01-01', 'YYYY-MM-dd');
1968-12-29 00:00:00
spark-sql> select to_timestamp('1969-01-02', 'YYYY-MM-dd');
1968-12-29 00:00:00
spark-sql> select to_timestamp('1969-01-03', 'YYYY-MM-dd');
1968-12-29 00:00:00
spark-sql> select to_timestamp('1969-01-04', 'YYYY-MM-dd');
1968-12-29 00:00:00
spark-sql> select to_timestamp('1969-01-05', 'YYYY-MM-dd');
1968-12-29 00:00:00
spark-sql> select to_timestamp('1969-01-06', 'YYYY-MM-dd');
1968-12-29 00:00:00
spark-sql> select to_timestamp('1969-01-07', 'YYYY-MM-dd');
1968-12-29 00:00:00
spark-sql> select to_timestamp('1969-01-31', 'YYYY-MM-dd');
1968-12-29 00:00:00
spark-sql> select to_timestamp('1969-02-28', 'YYYY-MM-dd');
1968-12-29 00:00:00
spark-sql> select to_timestamp('1969-12-28', 'YYYY-MM-dd');
1968-12-29 00:00:00
spark-sql> select to_timestamp('1969-12-31', 'YYYY-MM-dd');
1968-12-29 00:00:00

In this PR, I propose to restore the behavior in 2.4 before we have final conclusion and do not block releasing 3.0. Since in 3.0 it is much weirder.

FYI, Postgres may set a good example here, https://www.postgresql.org/docs/9.0/functions-formatting.html

Caution
--
While to_date will reject a mixture of Gregorian and ISO week-numbering date fields, to_char will not, since output format specifications like YYYY-MM-DD (IYYY-IDDD) can be useful. But avoid writing something like IYYY-MM-DD; that would yield surprising results near the start of the year.

Why are the changes needed?

bugfix and behavior change restoration

Does this PR introduce any user-facing change?

No, the behavior will be restored.

How was this patch tested?

added new tests

@yaooqinn
Copy link
Member Author

@SparkQA
Copy link

SparkQA commented May 29, 2020

Test build #123295 has finished for PR 28674 at commit f2230b2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

Can we define the 2.4 behavior first? The behavior looks very weird to me. If it doesn't make sense, we can narrow down the scope or even forbid it.

@yaooqinn
Copy link
Member Author

yaooqinn commented Jun 1, 2020

Can we define the 2.4 behavior first? The behavior looks very weird to me.

I found it very difficult to follow the behavior of 2.4. It is also hard to fully define it.

If it doesn't make sense, we can narrow down the scope or even forbid it.

I agree that we can narrow down the scope of our new formatter

First of all, I suggest that we fail the 'Y' field to work with other non-week-based date fields y/M/L/d, etc as they are silently omitted.

@cloud-fan
Copy link
Contributor

this makes sense to me that week-based date fields should not co-exist with non-week-based date fields. We should also clearly define the default values if year or month or day is not specified.

@SparkQA
Copy link

SparkQA commented Jun 1, 2020

Test build #123368 has finished for PR 28674 at commit 9e91729.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 1, 2020

Test build #123366 has finished for PR 28674 at commit bdab8bd.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

val weekBased = mayWeekBased(accessor, weekFields)
if (weekBased && mayNonWeekBased(accessor)) {
throw new DateTimeException(
s"Can not mix week-based and non-week-based date fields together for parsing dates")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this happen when we create the formatter?

@SparkQA
Copy link

SparkQA commented Jun 1, 2020

Test build #123376 has finished for PR 28674 at commit 0dcd5df.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan cloud-fan closed this in afe95bd Jun 3, 2020
cloud-fan pushed a commit that referenced this pull request Jun 3, 2020
This PR disables week-based date filed for parsing

closes #28674

1. It's an un-fixable behavior change to fill the gap between SimpleDateFormat and DateTimeFormater and backward-compatibility for different JDKs.A lot of effort has been made to prove it at #28674

2. The existing behavior itself in 2.4 is confusing, e.g.

```sql
spark-sql> select to_timestamp('1', 'w');
1969-12-28 00:00:00
spark-sql> select to_timestamp('1', 'u');
1970-01-05 00:00:00
```
  The 'u' here seems not to go to the Monday of the first week in week-based form or the first day of the year in non-week-based form but go to the Monday of the second week in week-based form.

And, e.g.
```sql
spark-sql> select to_timestamp('2020 2020', 'YYYY yyyy');
2020-01-01 00:00:00
spark-sql> select to_timestamp('2020 2020', 'yyyy YYYY');
2019-12-29 00:00:00
spark-sql> select to_timestamp('2020 2020 1', 'YYYY yyyy w');
NULL
spark-sql> select to_timestamp('2020 2020 1', 'yyyy YYYY w');
2019-12-29 00:00:00
```

  I think we don't need to introduce all the weird behavior from Java

3. The current test coverage for week-based date fields is almost 0%, which indicates that we've never imagined using it.

4. Avoiding JDK bugs

https://issues.apache.org/jira/browse/SPARK-31880

Yes, the 'Y/W/w/u/F/E' pattern cannot be used datetime parsing functions.

more tests added

Closes #28706 from yaooqinn/SPARK-31892.

Authored-by: Kent Yao <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit afe95bd)
Signed-off-by: Wenchen Fan <[email protected]>
@HyukjinKwon
Copy link
Member

Merged to master and branch-3.0.

@cloud-fan
Copy link
Contributor

@HyukjinKwon this PR is closed by afe95bd

It's not merged...

@HyukjinKwon
Copy link
Member

Okay, understood now .. let's be diligent on updating related JIRAs btw in particular the blockers. So SPARK-31868 is won't fix?

@cloud-fan
Copy link
Contributor

yea, let me close it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants