Skip to content

Conversation

@linhongliu-db
Copy link
Contributor

@linhongliu-db linhongliu-db commented Jun 18, 2021

What changes were proposed in this pull request?

DATE/TIMESTAMP literals support years 0000 to 9999. However, internally we support a range that is much larger.
We can add or subtract large intervals from a date/timestamp and the system will happily process and display large negative and positive dates.

Since we obviously cannot put this genie back into the bottle the only thing we can do is allow matching DATE/TIMESTAMP literals.

Why are the changes needed?

make spark more usable and bug fix

Does this PR introduce any user-facing change?

Yes, after this PR, below SQL will have different results

select cast('-10000-1-2' as date) as date_col
-- before PR: NULL
-- after PR: -10000-1-2
select cast('2021-4294967297-11' as date) as date_col
-- before PR: 2021-01-11
-- after PR: NULL

How was this patch tested?

newly added test cases

@linhongliu-db linhongliu-db marked this pull request as draft June 18, 2021 03:59
@github-actions github-actions bot added the SQL label Jun 18, 2021
@SparkQA
Copy link

SparkQA commented Jun 18, 2021

Test build #139960 has finished for PR 32959 at commit da0102b.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 18, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44487/

@linhongliu-db linhongliu-db force-pushed the SPARK-35780-full-range-datetime branch from 6966058 to 633781f Compare July 1, 2021 09:20
@linhongliu-db linhongliu-db changed the title [WIP][SPARK-35780][SQL] Support DATE/TIMESTAMP literals across the full range [SPARK-35780][SQL] Support DATE/TIMESTAMP literals across the full range Jul 1, 2021
-- !query output
java.time.DateTimeException
Cannot cast 7 to DateType.
728567 00:00:00.000000000
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @cloud-fan, this is an existing mismatch between ANSI and non-ANSI mode. For non-ANSI mode, this query will throw exception

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's still request at least 4 digits for year

@linhongliu-db linhongliu-db marked this pull request as ready for review July 1, 2021 10:10
@SparkQA
Copy link

SparkQA commented Jul 1, 2021

Test build #140505 has finished for PR 32959 at commit 6966058.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 1, 2021

Test build #140510 has finished for PR 32959 at commit 633781f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 1, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45023/

@SparkQA
Copy link

SparkQA commented Jul 1, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45023/

@SparkQA
Copy link

SparkQA commented Jul 1, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45040/

@SparkQA
Copy link

SparkQA commented Jul 1, 2021

Test build #140526 has finished for PR 32959 at commit b250bc7.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 1, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45040/

"date.sql",
"datetime.sql",
"datetime-legacy.sql",
"ansi/datetime.sql",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same reason to "date.sql" that thriftserver couldn't handle negative year

test("SPARK-30960, SPARK-31641: parse date/timestamp string with legacy format") {
val julianDay = -141704 // 1582-01-01 in Julian calendar
val ds = Seq(
s"{'t': '2020-1-12 3:23:34.12', 'd': '2020-1-12 T', 'd2': '12345', 'd3': '$julianDay'}"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'12345' and '-141704' are treated as epoch days before this PR because it's out of the 0000-9999 range.
this is used for backward compatibility with JSON data generated by spark 1.5.
But this compatibility is very confusing, for example, before this PR:
'9999' will be converted to '9999-01-01' while '10000' will be converted to '1997-05-19'
So I suggest just removing this compatibility

@SparkQA
Copy link

SparkQA commented Jul 2, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45078/

@SparkQA
Copy link

SparkQA commented Jul 2, 2021

Test build #140566 has finished for PR 32959 at commit 5b4fe62.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 3, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45125/

@SparkQA
Copy link

SparkQA commented Jul 3, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45125/

@SparkQA
Copy link

SparkQA commented Jul 12, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45434/

@linhongliu-db
Copy link
Contributor Author

linhongliu-db commented Jul 12, 2021

cc @cloud-fan, comments are addressed and tests are passed

@SparkQA
Copy link

SparkQA commented Jul 12, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45434/

@SparkQA
Copy link

SparkQA commented Jul 12, 2021

Test build #140922 has finished for PR 32959 at commit 8d69c88.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 13, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45472/

@SparkQA
Copy link

SparkQA commented Jul 13, 2021

Test build #140958 has finished for PR 32959 at commit cd330a6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class GetTimestamp(
  • case class ParseToTimestamp(
  • case class MakeTimestampNTZ(
  • case class MakeTimestampLTZ(
  • static class IntegerUpdater implements ParquetVectorUpdater
  • trait HDFSBackedStateStoreMap
  • class NoPrefixHDFSBackedStateStoreMap extends HDFSBackedStateStoreMap
  • class PrefixScannableHDFSBackedStateStoreMap(
  • class HDFSBackedReadStateStore(val version: Long, map: HDFSBackedStateStoreMap)
  • class HDFSBackedStateStore(val version: Long, mapToUpdate: HDFSBackedStateStoreMap)
  • sealed trait RocksDBStateEncoder
  • class PrefixKeyScanStateEncoder(
  • class NoPrefixKeyStateEncoder(keySchema: StructType, valueSchema: StructType)

@SparkQA
Copy link

SparkQA commented Jul 13, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45472/

@cloud-fan
Copy link
Contributor

@linhongliu-db please fix the code conflicts.

@SparkQA
Copy link

SparkQA commented Jul 13, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45490/

@SparkQA
Copy link

SparkQA commented Jul 13, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45490/

@SparkQA
Copy link

SparkQA commented Jul 13, 2021

Test build #140976 has finished for PR 32959 at commit 4723f8e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class ParseToTimestampLTZ(
  • case class DomainJoin(

@SparkQA
Copy link

SparkQA commented Jul 14, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45522/

@cloud-fan
Copy link
Contributor

thanks, merging to master/3.2 (since it's timestamp related)

@cloud-fan cloud-fan closed this in b866457 Jul 14, 2021
cloud-fan pushed a commit that referenced this pull request Jul 14, 2021
### What changes were proposed in this pull request?
DATE/TIMESTAMP literals support years 0000 to 9999. However, internally we support a range that is much larger.
We can add or subtract large intervals from a date/timestamp and the system will happily process and display large negative and positive dates.

Since we obviously cannot put this genie back into the bottle the only thing we can do is allow matching DATE/TIMESTAMP literals.

### Why are the changes needed?
make spark more usable and bug fix

### Does this PR introduce _any_ user-facing change?
Yes, after this PR, below SQL will have different results
```sql
select cast('-10000-1-2' as date) as date_col
-- before PR: NULL
-- after PR: -10000-1-2
```

```sql
select cast('2021-4294967297-11' as date) as date_col
-- before PR: 2021-01-11
-- after PR: NULL
```

### How was this patch tested?
newly added test cases

Closes #32959 from linhongliu-db/SPARK-35780-full-range-datetime.

Lead-authored-by: Linhong Liu <[email protected]>
Co-authored-by: Linhong Liu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit b866457)
Signed-off-by: Wenchen Fan <[email protected]>
@SparkQA
Copy link

SparkQA commented Jul 14, 2021

Test build #141008 has finished for PR 32959 at commit 538463a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class LocalTimestamp(timeZoneId: Option[String] = None) extends LeafExpression
  • sealed trait StreamingSessionWindowStateManager extends Serializable
  • class StreamingSessionWindowStateManagerImplV1(
  • class StreamingSessionWindowHelper(sessionExpression: Attribute, inputSchema: Seq[Attribute])

cloud-fan pushed a commit that referenced this pull request Jul 29, 2021
### What changes were proposed in this pull request?
In PR #32959, we found some weird datetime strings that can be parsed. ([details](#32959 (comment)))
This PR blocks the invalid datetime string.

### Why are the changes needed?
bug fix

### Does this PR introduce _any_ user-facing change?
Yes, below strings will have different results when cast to datetime.
```sql
select cast('12::' as timestamp); -- Before: 2021-07-07 12:00:00, After: NULL
select cast('T' as timestamp); -- Before: 2021-07-07 00:00:00, After: NULL
```

### How was this patch tested?
some new test cases

Closes #33490 from linhongliu-db/SPARK-35780-block-invalid-format.

Authored-by: Linhong Liu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
cloud-fan pushed a commit that referenced this pull request Jul 29, 2021
### What changes were proposed in this pull request?
In PR #32959, we found some weird datetime strings that can be parsed. ([details](#32959 (comment)))
This PR blocks the invalid datetime string.

### Why are the changes needed?
bug fix

### Does this PR introduce _any_ user-facing change?
Yes, below strings will have different results when cast to datetime.
```sql
select cast('12::' as timestamp); -- Before: 2021-07-07 12:00:00, After: NULL
select cast('T' as timestamp); -- Before: 2021-07-07 00:00:00, After: NULL
```

### How was this patch tested?
some new test cases

Closes #33490 from linhongliu-db/SPARK-35780-block-invalid-format.

Authored-by: Linhong Liu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit ed0e351)
Signed-off-by: Wenchen Fan <[email protected]>
if (s == null) {
def isValidDigits(segment: Int, digits: Int): Boolean = {
// An integer is able to represent a date within [+-]5 million years.
var maxDigitsYear = 7
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I implement a configuration item that configures the range of digits allowed for the year?

I found that it was writing to tables in different formats and the results would behave differently.

create table t(c1 date) stored as textfile;
insert overwrite table t select cast( '22022-05-01' as date);
select * from t1; -- output null
create table t(c1 date) stored as orcfile;
insert overwrite table t select cast( '22022-05-01' as date);
select * from t1; -- output +22022-05-01

Because orc/parquet date stores integers, but textfile and sequencefile store text.

image

But if you use hive jdbc, the query will fail, because java.sql.Date only supports 4-digit years.

Caused by: java.lang.IllegalArgumentException
  at java.sql.Date.valueOf(Date.java:143)
  at org.apache.hive.jdbc.HiveBaseResultSet.evaluate(HiveBaseResultSet.java:447

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's expected that not all the data sources and BI clients support datetime values larger than 10000-01-01, the question is when the failure should happen.

It looks to me that the Hive table should fail to write 22022-05-01 with textfile source, and the hive jdbc should fail at the client-side saying 22022-05-01 is not supported.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, I don't think it's possible to add a Spark config to forbid large datetime values. The literal is just one place, there are many other datetime operations that may produce large datetime values, which have been there before this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your explanation, make sense.

There may be some dates that were treated as abnormal by users in previous Spark versions, and can be handled normally in Spark 3.2, although they are normal dates.
Because I didn't see this behavior change in the migration guide before noticing this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea the impact on BI clients was missed, though strictly speaking BI clients are not part of Spark.

cloud-fan pushed a commit that referenced this pull request Jul 27, 2022
…g dates in "yyyyMMdd" format with CORRECTED time parser policy

### What changes were proposed in this pull request?

This PR fixes a correctness issue when reading a CSV or a JSON file with dates in "yyyyMMdd" format:
```
name,mydate
1,2020011
2,20201203
```
or
```
{"date": "2020011"}
{"date": "20201203"}
```

Prior to #32959, reading this CSV file would return:
```
+----+--------------+
|name|mydate        |
+----+--------------+
|1   |null          |
|2   |2020-12-03    |
+----+--------------+
```

However, after the patch, the invalid date is parsed because of the much more lenient parsing in `DateTimeUtils.stringToDate`, the method treats `2020011` as a full year:
```
+----+--------------+
|name|mydate        |
+----+--------------+
|1   |+2020011-01-01|
|2   |2020-12-03    |
+----+--------------+
```
Similar result would be observed in JSON.

This PR attempts to address correctness issue by introducing a new configuration option `enableDateTimeParsingFallback` which allows to enable/disable the backward compatible parsing.

Currently, by default we will fall back to the backward compatible behavior only if parser policy is legacy and no custom pattern was set (this is defined in `UnivocityParser` and `JacksonParser` for csv and json respectively).

### Why are the changes needed?
Fixes a correctness issue in Spark 3.4.

### Does this PR introduce _any_ user-facing change?

In order to avoid correctness issues when reading CSV or JSON files with a custom pattern, a new configuration option `enableDateTimeParsingFallback` has been added to control whether or not the code would fall back to the backward compatible behavior of parsing dates and timestamps in CSV and JSON data sources.
- If the config is enabled and the date cannot be parsed, we will fall back to `DateTimeUtils.stringToDate`.
- If the config is enabled and the timestamp cannot be parsed, `DateTimeUtils.stringToTimestamp` will be used.
- Otherwise, depending on the parser policy and a custom pattern, the value will be parsed as null.

### How was this patch tested?

I added unit tests for CSV and JSON to verify the fix and the config option.

Closes #37147 from sadikovi/fix-csv-date-inference.

Authored-by: Ivan Sadikov <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants