Skip to content

Conversation

@aldenlau-db
Copy link
Contributor

@aldenlau-db aldenlau-db commented Mar 19, 2025

What changes were proposed in this pull request?

This change adds support for widening type promotions from Date to TimestampNTZ in `AvroDeserializer. This PR is a follow-up to #47582 which adds support for other widening type promotions.

Why are the changes needed?

When reading Avro files with a mix of Date and TimestampNTZ for a given column, the reader should be able to read all files and promote Date to TimestampNTZ instead of throwing an error when reading files with Date.

Although SPARK-49082 was resolved by #47582, that PR did not include Date -> TimestampNTZ widening. The change in this PR is very similar to #44368 which adds support for Date -> TimestampNTZ widening for the Parquet reader.

Does this PR introduce any user-facing change?

Yes, users will no longer see an error when attempting to read a file containing Date when the read schema contains TimestampNTZ. The time will be set to 00:00, as has been done in #44368.

How was this patch tested?

New test in AvroSuite.

Was this patch authored or co-authored using generative AI tooling?

No

@aldenlau-db aldenlau-db changed the title Initial commit [SPARK-49082][SQL] Support widening Date to TimestampNTZ in Avro reader Mar 19, 2025
@aldenlau-db aldenlau-db reopened this Mar 19, 2025
@aldenlau-db aldenlau-db changed the title [SPARK-49082][SQL] Support widening Date to TimestampNTZ in Avro reader [WIP][SPARK-49082][SQL] Support widening Date to TimestampNTZ in Avro reader Mar 19, 2025
@aldenlau-db aldenlau-db changed the title [WIP][SPARK-49082][SQL] Support widening Date to TimestampNTZ in Avro reader [SPARK-49082][SQL] Support widening Date to TimestampNTZ in Avro reader Mar 19, 2025
@aldenlau-db
Copy link
Contributor Author

cc @johanl-db @sandip-db

Comment on lines 983 to 984
Row(LocalDateTime.of(-5877641, 6, 23, 0, 0)),
Row(LocalDateTime.of(5881580, 7, 11, 0, 0)))
Copy link
Contributor Author

@aldenlau-db aldenlau-db Mar 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johanl-db These test cases fail with an ArithmeticException. However, this implementation is based on the Parquet implementation. I noticed that the Parquet reader will also fail when attempting to upcast any date earlier than -290308-12-22 BCE and later than +294247-01-10 CE with the same ArithmeticException due to long overflow.

I think this is because the Date Spark type supports dates from June 23 -5877641 CE to July 11 +5881580 CE, but TimestampNTZ supports -290308-12-21 BCE 19:59:06 to +294247-01-10 CE 04:00:54, which is a smaller range. In the context of type widening, shouldn't this widening be unsupported by all readers since Date stores a larger range of values than TimestampNTZ (even though it has less precision)?

Copy link
Contributor

@johanl-db johanl-db Mar 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not great..

I think for now we can document that limitation. Iceberg allows widening date -> timestamp without timezone and also calls this out: https://iceberg.apache.org/spec/#schema-evolution
Hopefully very few users are using date values outside -290308-12-21 BCE - +294247-01-10 CE

Moving forward, we'll probably want to address this by moving the error from read time to write time, i.e. check that all values in the table fit the timestamp range before applying the type change (which we should be able to tell from stats only in most cases).

Note: this is a Delta issue really, so not issue for this spark PR in itself

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thanks. I have updated the test cases in this PR and added a comment documenting this behavior.

@aldenlau-db aldenlau-db requested a review from johanl-db March 20, 2025 19:13
@HyukjinKwon
Copy link
Member

Merged to master.

SauronShepherd pushed a commit to SauronShepherd/spark that referenced this pull request Mar 25, 2025
### What changes were proposed in this pull request?

This change adds support for widening type promotions from `Date` to `TimestampNTZ` in `AvroDeserializer. This PR is a follow-up to apache#47582 which adds support for other widening type promotions.

### Why are the changes needed?

When reading Avro files with a mix of Date and TimestampNTZ for a given column, the reader should be able to read all files and promote Date to TimestampNTZ instead of throwing an error when reading files with Date.

Although [SPARK-49082](https://issues.apache.org/jira/browse/SPARK-49082) was resolved by apache#47582, that PR did not include Date -> TimestampNTZ widening. The change in this PR is very similar to apache#44368 which adds support for Date -> TimestampNTZ widening for the Parquet reader.

### Does this PR introduce _any_ user-facing change?

Yes, users will no longer see an error when attempting to read a file containing Date when the read schema contains TimestampNTZ. The time will be set to 00:00, as has been done in apache#44368.

### How was this patch tested?

New test in `AvroSuite`.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#50315 from aldenlau-db/SPARK-49082.

Authored-by: Alden Lau <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants