-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-49082][SQL] Support widening Date to TimestampNTZ in Avro reader #50315
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| Row(LocalDateTime.of(-5877641, 6, 23, 0, 0)), | ||
| Row(LocalDateTime.of(5881580, 7, 11, 0, 0))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@johanl-db These test cases fail with an ArithmeticException. However, this implementation is based on the Parquet implementation. I noticed that the Parquet reader will also fail when attempting to upcast any date earlier than -290308-12-22 BCE and later than +294247-01-10 CE with the same ArithmeticException due to long overflow.
I think this is because the Date Spark type supports dates from June 23 -5877641 CE to July 11 +5881580 CE, but TimestampNTZ supports -290308-12-21 BCE 19:59:06 to +294247-01-10 CE 04:00:54, which is a smaller range. In the context of type widening, shouldn't this widening be unsupported by all readers since Date stores a larger range of values than TimestampNTZ (even though it has less precision)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not great..
I think for now we can document that limitation. Iceberg allows widening date -> timestamp without timezone and also calls this out: https://iceberg.apache.org/spec/#schema-evolution
Hopefully very few users are using date values outside -290308-12-21 BCE - +294247-01-10 CE
Moving forward, we'll probably want to address this by moving the error from read time to write time, i.e. check that all values in the table fit the timestamp range before applying the type change (which we should be able to tell from stats only in most cases).
Note: this is a Delta issue really, so not issue for this spark PR in itself
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, thanks. I have updated the test cases in this PR and added a comment documenting this behavior.
|
Merged to master. |
### What changes were proposed in this pull request? This change adds support for widening type promotions from `Date` to `TimestampNTZ` in `AvroDeserializer. This PR is a follow-up to apache#47582 which adds support for other widening type promotions. ### Why are the changes needed? When reading Avro files with a mix of Date and TimestampNTZ for a given column, the reader should be able to read all files and promote Date to TimestampNTZ instead of throwing an error when reading files with Date. Although [SPARK-49082](https://issues.apache.org/jira/browse/SPARK-49082) was resolved by apache#47582, that PR did not include Date -> TimestampNTZ widening. The change in this PR is very similar to apache#44368 which adds support for Date -> TimestampNTZ widening for the Parquet reader. ### Does this PR introduce _any_ user-facing change? Yes, users will no longer see an error when attempting to read a file containing Date when the read schema contains TimestampNTZ. The time will be set to 00:00, as has been done in apache#44368. ### How was this patch tested? New test in `AvroSuite`. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#50315 from aldenlau-db/SPARK-49082. Authored-by: Alden Lau <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
What changes were proposed in this pull request?
This change adds support for widening type promotions from
DatetoTimestampNTZin `AvroDeserializer. This PR is a follow-up to #47582 which adds support for other widening type promotions.Why are the changes needed?
When reading Avro files with a mix of Date and TimestampNTZ for a given column, the reader should be able to read all files and promote Date to TimestampNTZ instead of throwing an error when reading files with Date.
Although SPARK-49082 was resolved by #47582, that PR did not include Date -> TimestampNTZ widening. The change in this PR is very similar to #44368 which adds support for Date -> TimestampNTZ widening for the Parquet reader.
Does this PR introduce any user-facing change?
Yes, users will no longer see an error when attempting to read a file containing Date when the read schema contains TimestampNTZ. The time will be set to 00:00, as has been done in #44368.
How was this patch tested?
New test in
AvroSuite.Was this patch authored or co-authored using generative AI tooling?
No