[SPARK-35780][SQL] Support DATE/TIMESTAMP literals across the full range #32959

linhongliu-db · 2021-06-18T03:59:23Z

What changes were proposed in this pull request?

DATE/TIMESTAMP literals support years 0000 to 9999. However, internally we support a range that is much larger.
We can add or subtract large intervals from a date/timestamp and the system will happily process and display large negative and positive dates.

Since we obviously cannot put this genie back into the bottle the only thing we can do is allow matching DATE/TIMESTAMP literals.

Why are the changes needed?

make spark more usable and bug fix

Does this PR introduce any user-facing change?

Yes, after this PR, below SQL will have different results

select cast('-10000-1-2' as date) as date_col
-- before PR: NULL
-- after PR: -10000-1-2

select cast('2021-4294967297-11' as date) as date_col
-- before PR: 2021-01-11
-- after PR: NULL

How was this patch tested?

newly added test cases

SparkQA · 2021-06-18T04:17:51Z

Test build #139960 has finished for PR 32959 at commit da0102b.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-06-18T04:20:19Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44487/

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateTimeUtilsSuite.scala

linhongliu-db · 2021-07-01T10:10:13Z

sql/core/src/test/resources/sql-tests/results/ansi/datetime.sql.out

 -- !query output
-java.time.DateTimeException
-Cannot cast 7 to DateType.
+728567 00:00:00.000000000


cc @cloud-fan, this is an existing mismatch between ANSI and non-ANSI mode. For non-ANSI mode, this query will throw exception

Let's still request at least 4 digits for year

SparkQA · 2021-07-01T11:08:29Z

Test build #140505 has finished for PR 32959 at commit 6966058.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2021-07-01T12:11:03Z

Test build #140510 has finished for PR 32959 at commit 633781f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-07-01T12:28:46Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45023/

SparkQA · 2021-07-01T13:05:02Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45023/

SparkQA · 2021-07-01T17:28:57Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45040/

SparkQA · 2021-07-01T17:56:04Z

Test build #140526 has finished for PR 32959 at commit b250bc7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-07-01T18:05:08Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45040/

linhongliu-db · 2021-07-02T07:31:13Z

...erver/src/test/scala/org/apache/spark/sql/hive/thriftserver/ThriftServerQueryTestSuite.scala

    "date.sql",
+    "datetime.sql",
+    "datetime-legacy.sql",
+    "ansi/datetime.sql",


same reason to "date.sql" that thriftserver couldn't handle negative year

linhongliu-db · 2021-07-02T07:35:53Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala

  test("SPARK-30960, SPARK-31641: parse date/timestamp string with legacy format") {
-    val julianDay = -141704 // 1582-01-01 in Julian calendar
    val ds = Seq(
-      s"{'t': '2020-1-12 3:23:34.12', 'd': '2020-1-12 T', 'd2': '12345', 'd3': '$julianDay'}"


'12345' and '-141704' are treated as epoch days before this PR because it's out of the 0000-9999 range.
this is used for backward compatibility with JSON data generated by spark 1.5.
But this compatibility is very confusing, for example, before this PR:
'9999' will be converted to '9999-01-01' while '10000' will be converted to '1997-05-19'
So I suggest just removing this compatibility

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala

SparkQA · 2021-07-02T08:30:55Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45078/

SparkQA · 2021-07-02T12:05:52Z

Test build #140566 has finished for PR 32959 at commit 5b4fe62.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-07-03T09:37:14Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45125/

SparkQA · 2021-07-03T10:10:49Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45125/

SparkQA · 2021-07-12T15:13:35Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45434/

linhongliu-db · 2021-07-12T15:45:45Z

cc @cloud-fan, comments are addressed and tests are passed

SparkQA · 2021-07-12T15:48:16Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45434/

SparkQA · 2021-07-12T18:22:21Z

Test build #140922 has finished for PR 32959 at commit 8d69c88.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

…nge-datetime

SparkQA · 2021-07-13T07:14:38Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45472/

SparkQA · 2021-07-13T07:31:07Z

Test build #140958 has finished for PR 32959 at commit cd330a6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class GetTimestamp(
case class ParseToTimestamp(
case class MakeTimestampNTZ(
case class MakeTimestampLTZ(
static class IntegerUpdater implements ParquetVectorUpdater
trait HDFSBackedStateStoreMap
class NoPrefixHDFSBackedStateStoreMap extends HDFSBackedStateStoreMap
class PrefixScannableHDFSBackedStateStoreMap(
class HDFSBackedReadStateStore(val version: Long, map: HDFSBackedStateStoreMap)
class HDFSBackedStateStore(val version: Long, mapToUpdate: HDFSBackedStateStoreMap)
sealed trait RocksDBStateEncoder
class PrefixKeyScanStateEncoder(
class NoPrefixKeyStateEncoder(keySchema: StructType, valueSchema: StructType)

SparkQA · 2021-07-13T07:50:49Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45472/

cloud-fan · 2021-07-13T12:12:40Z

@linhongliu-db please fix the code conflicts.

…nge-datetime

SparkQA · 2021-07-13T16:34:30Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45490/

SparkQA · 2021-07-13T17:11:23Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45490/

SparkQA · 2021-07-13T20:11:49Z

Test build #140976 has finished for PR 32959 at commit 4723f8e.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class ParseToTimestampLTZ(
case class DomainJoin(

…nge-datetime

SparkQA · 2021-07-14T09:16:01Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45522/

cloud-fan · 2021-07-14T10:11:37Z

thanks, merging to master/3.2 (since it's timestamp related)

### What changes were proposed in this pull request? DATE/TIMESTAMP literals support years 0000 to 9999. However, internally we support a range that is much larger. We can add or subtract large intervals from a date/timestamp and the system will happily process and display large negative and positive dates. Since we obviously cannot put this genie back into the bottle the only thing we can do is allow matching DATE/TIMESTAMP literals. ### Why are the changes needed? make spark more usable and bug fix ### Does this PR introduce _any_ user-facing change? Yes, after this PR, below SQL will have different results ```sql select cast('-10000-1-2' as date) as date_col -- before PR: NULL -- after PR: -10000-1-2 ``` ```sql select cast('2021-4294967297-11' as date) as date_col -- before PR: 2021-01-11 -- after PR: NULL ``` ### How was this patch tested? newly added test cases Closes #32959 from linhongliu-db/SPARK-35780-full-range-datetime. Lead-authored-by: Linhong Liu <[email protected]> Co-authored-by: Linhong Liu <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit b866457) Signed-off-by: Wenchen Fan <[email protected]>

SparkQA · 2021-07-14T13:10:27Z

Test build #141008 has finished for PR 32959 at commit 538463a.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class LocalTimestamp(timeZoneId: Option[String] = None) extends LeafExpression
sealed trait StreamingSessionWindowStateManager extends Serializable
class StreamingSessionWindowStateManagerImplV1(
class StreamingSessionWindowHelper(sessionExpression: Attribute, inputSchema: Seq[Attribute])

### What changes were proposed in this pull request? In PR #32959, we found some weird datetime strings that can be parsed. ([details](#32959 (comment))) This PR blocks the invalid datetime string. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? Yes, below strings will have different results when cast to datetime. ```sql select cast('12::' as timestamp); -- Before: 2021-07-07 12:00:00, After: NULL select cast('T' as timestamp); -- Before: 2021-07-07 00:00:00, After: NULL ``` ### How was this patch tested? some new test cases Closes #33490 from linhongliu-db/SPARK-35780-block-invalid-format. Authored-by: Linhong Liu <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? In PR #32959, we found some weird datetime strings that can be parsed. ([details](#32959 (comment))) This PR blocks the invalid datetime string. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? Yes, below strings will have different results when cast to datetime. ```sql select cast('12::' as timestamp); -- Before: 2021-07-07 12:00:00, After: NULL select cast('T' as timestamp); -- Before: 2021-07-07 00:00:00, After: NULL ``` ### How was this patch tested? some new test cases Closes #33490 from linhongliu-db/SPARK-35780-block-invalid-format. Authored-by: Linhong Liu <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit ed0e351) Signed-off-by: Wenchen Fan <[email protected]>

cxzl25 · 2022-05-30T14:36:31Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala

-    if (s == null) {
+    def isValidDigits(segment: Int, digits: Int): Boolean = {
+      // An integer is able to represent a date within [+-]5 million years.
+      var maxDigitsYear = 7


Can I implement a configuration item that configures the range of digits allowed for the year?

I found that it was writing to tables in different formats and the results would behave differently.

create table t(c1 date) stored as textfile; insert overwrite table t select cast( '22022-05-01' as date); select * from t1; -- output null

create table t(c1 date) stored as orcfile; insert overwrite table t select cast( '22022-05-01' as date); select * from t1; -- output +22022-05-01

Because orc/parquet date stores integers, but textfile and sequencefile store text.

But if you use hive jdbc, the query will fail, because java.sql.Date only supports 4-digit years.

Caused by: java.lang.IllegalArgumentException at java.sql.Date.valueOf(Date.java:143) at org.apache.hive.jdbc.HiveBaseResultSet.evaluate(HiveBaseResultSet.java:447

It's expected that not all the data sources and BI clients support datetime values larger than 10000-01-01, the question is when the failure should happen.

It looks to me that the Hive table should fail to write 22022-05-01 with textfile source, and the hive jdbc should fail at the client-side saying 22022-05-01 is not supported.

BTW, I don't think it's possible to add a Spark config to forbid large datetime values. The literal is just one place, there are many other datetime operations that may produce large datetime values, which have been there before this PR.

Thanks for your explanation, make sense.

There may be some dates that were treated as abnormal by users in previous Spark versions, and can be handled normally in Spark 3.2, although they are normal dates.
Because I didn't see this behavior change in the migration guide before noticing this PR.

Yea the impact on BI clients was missed, though strictly speaking BI clients are not part of Spark.

…g dates in "yyyyMMdd" format with CORRECTED time parser policy ### What changes were proposed in this pull request? This PR fixes a correctness issue when reading a CSV or a JSON file with dates in "yyyyMMdd" format: ``` name,mydate 1,2020011 2,20201203 ``` or ``` {"date": "2020011"} {"date": "20201203"} ``` Prior to #32959, reading this CSV file would return: ``` +----+--------------+ |name|mydate | +----+--------------+ |1 |null | |2 |2020-12-03 | +----+--------------+ ``` However, after the patch, the invalid date is parsed because of the much more lenient parsing in `DateTimeUtils.stringToDate`, the method treats `2020011` as a full year: ``` +----+--------------+ |name|mydate | +----+--------------+ |1 |+2020011-01-01| |2 |2020-12-03 | +----+--------------+ ``` Similar result would be observed in JSON. This PR attempts to address correctness issue by introducing a new configuration option `enableDateTimeParsingFallback` which allows to enable/disable the backward compatible parsing. Currently, by default we will fall back to the backward compatible behavior only if parser policy is legacy and no custom pattern was set (this is defined in `UnivocityParser` and `JacksonParser` for csv and json respectively). ### Why are the changes needed? Fixes a correctness issue in Spark 3.4. ### Does this PR introduce _any_ user-facing change? In order to avoid correctness issues when reading CSV or JSON files with a custom pattern, a new configuration option `enableDateTimeParsingFallback` has been added to control whether or not the code would fall back to the backward compatible behavior of parsing dates and timestamps in CSV and JSON data sources. - If the config is enabled and the date cannot be parsed, we will fall back to `DateTimeUtils.stringToDate`. - If the config is enabled and the timestamp cannot be parsed, `DateTimeUtils.stringToTimestamp` will be used. - Otherwise, depending on the parser policy and a custom pattern, the value will be parsed as null. ### How was this patch tested? I added unit tests for CSV and JSON to verify the fix and the config option. Closes #37147 from sadikovi/fix-csv-date-inference. Authored-by: Ivan Sadikov <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

linhongliu-db marked this pull request as draft June 18, 2021 03:59

github-actions bot added the SQL label Jun 18, 2021

cloud-fan reviewed Jun 21, 2021

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Jun 21, 2021

View reviewed changes

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateTimeUtilsSuite.scala Outdated Show resolved Hide resolved

linhongliu-db added 2 commits July 1, 2021 17:12

support full range of datetime

457d3b6

add test cases

633781f

linhongliu-db force-pushed the SPARK-35780-full-range-datetime branch from 6966058 to 633781f Compare July 1, 2021 09:20

linhongliu-db changed the title ~~[WIP][SPARK-35780][SQL] Support DATE/TIMESTAMP literals across the full range~~ [SPARK-35780][SQL] Support DATE/TIMESTAMP literals across the full range Jul 1, 2021

linhongliu-db commented Jul 1, 2021

View reviewed changes

linhongliu-db marked this pull request as ready for review July 1, 2021 10:10

fix tests

b250bc7

fix tests

5b4fe62

linhongliu-db commented Jul 2, 2021

View reviewed changes

cloud-fan reviewed Jul 2, 2021

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala Show resolved Hide resolved

cloud-fan reviewed Jul 2, 2021

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala Show resolved Hide resolved

fix segment overflow

e94885c

fix test

8d69c88

Merge remote-tracking branch 'apache/master' into SPARK-35780-full-ra…

cd330a6

…nge-datetime

cloud-fan approved these changes Jul 13, 2021

View reviewed changes

Merge remote-tracking branch 'apache/master' into SPARK-35780-full-ra…

4723f8e

…nge-datetime

Merge remote-tracking branch 'apache/master' into SPARK-35780-full-ra…

538463a

…nge-datetime

cloud-fan approved these changes Jul 14, 2021

View reviewed changes

cloud-fan closed this in b866457 Jul 14, 2021

linhongliu-db mentioned this pull request Jul 23, 2021

[SPARK-36286][SQL] Block some invalid datetime string #33490

Closed

sperlingxx mentioned this pull request Sep 10, 2021

Stop parsing special dates for Spark 3.2+ NVIDIA/spark-rapids#3439

Merged

cxzl25 reviewed May 30, 2022

View reviewed changes

sadikovi mentioned this pull request Jul 11, 2022

[SPARK-39731][SQL] Fix issue in CSV and JSON data sources when parsing dates in "yyyyMMdd" format with CORRECTED time parser policy #37147

Closed

amahussein mentioned this pull request Aug 1, 2022

[SPARK-39731][SQL] Fix issue in CSV data sources when parsing dates in "yyyyMMdd" format with CORRECTED time parser policy NVIDIA/spark-rapids#6190

Closed

[SPARK-35780][SQL] Support DATE/TIMESTAMP literals across the full range #32959

[SPARK-35780][SQL] Support DATE/TIMESTAMP literals across the full range #32959

Uh oh!

Conversation

linhongliu-db commented Jun 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Jun 18, 2021

Uh oh!

SparkQA commented Jun 18, 2021

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 1, 2021

Uh oh!

SparkQA commented Jul 1, 2021

Uh oh!

SparkQA commented Jul 1, 2021

Uh oh!

SparkQA commented Jul 1, 2021

Uh oh!

SparkQA commented Jul 1, 2021

Uh oh!

SparkQA commented Jul 1, 2021

Uh oh!

SparkQA commented Jul 1, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

SparkQA commented Jul 2, 2021

Uh oh!

SparkQA commented Jul 2, 2021

Uh oh!

SparkQA commented Jul 3, 2021

Uh oh!

SparkQA commented Jul 3, 2021

Uh oh!

SparkQA commented Jul 12, 2021

Uh oh!

linhongliu-db commented Jul 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Jul 12, 2021

Uh oh!

SparkQA commented Jul 12, 2021

Uh oh!

SparkQA commented Jul 13, 2021

Uh oh!

SparkQA commented Jul 13, 2021

Uh oh!

SparkQA commented Jul 13, 2021

Uh oh!

cloud-fan commented Jul 13, 2021

Uh oh!

SparkQA commented Jul 13, 2021

Uh oh!

SparkQA commented Jul 13, 2021

Uh oh!

SparkQA commented Jul 13, 2021

Uh oh!

SparkQA commented Jul 14, 2021

Uh oh!

cloud-fan commented Jul 14, 2021

Uh oh!

SparkQA commented Jul 14, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

linhongliu-db commented Jun 18, 2021 •

edited

Loading

linhongliu-db commented Jul 12, 2021 •

edited

Loading