-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-46466][SQL] Vectorized parquet reader should never do rebase for timestamp ntz #44428
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@cloud-fan Thanks for the ping. I will review this PR tomorrow. |
| } | ||
| } | ||
|
|
||
| test("write and read TimestampNTZ with legacy rebase mode") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you don't mind, please add a test prefix, SPARK-46466: here for easy tracing this correctness bug.
|
I updated the JIRA as a correctness blocker issue for Apache Spark 3.5.1 and 3.4.3. Thank you for the fix. |
...re/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala
Outdated
Show resolved
Hide resolved
…rces/parquet/ParquetQuerySuite.scala
| // For unsigned int64, it stores as plain signed int64 in Parquet when dictionary | ||
| // fallbacks. We read them as decimal values. | ||
| return new UnsignedLongUpdater(); | ||
| } else if (isTimestamp(sparkType) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we remove isTimestamp is not used any more?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
MaxGekk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TIMESTAMP_NTZ is a new data type in Spark and has no legacy files that need to do calendar rebase.
But the behaviour of rebasing of TIMESTAMP_NTZ has been released already by at least Spark 3.5 (and maybe 3.4). So, Spark users might write TIMESTAMP_NTZ with the rebase SQL config set to LEGACY. How will they read the data back after switching to Spar 4.0.0. Seems we need a legacy config to restore the previous behaviour, don't we?
| } | ||
|
|
||
| void validateTimestampType(DataType sparkType) { | ||
| void validateTimestampNTZType(DataType sparkType) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The input parameter
sparkTypeis no longer in use, can it be removed? - Although the old code is like this, in non-ea mode,
assertis invalid. - Is it possible to reduce the access scope of the
validateTimestampNTZTypemethod toprivatein this pr? It does not need to be accessed by other classes.
beliefer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except one comment.
| } else if (sparkType == DataTypes.TimestampNTZType && | ||
| isTimestampTypeMatched(LogicalTypeAnnotation.TimeUnit.MICROS)) { | ||
| validateTimestampNTZType(sparkType); | ||
| // TIMESTAMP_NTZ is a new data type and has no legacy files that need to do rebase. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a question. TIMESTAMP_NTZ has been released at 3.5.0, why there has no legacy files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
legacy here means parquet files written before the calendar switch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it now. You means we never rebase the time zone for TIMESTAMP_NTZ.
No, we never rebase ntz values during writing. It's a pure bug that we never rebase ntz during writing but may rebase during reading. |
|
+1, LGTM. Merging to master. |
|
@cloud-fan BTW, this should be backported to 3.4 and 3.5, correct? Update: In any case it conflicts with branch-3.5 and branch-3.4. Please, open a separate PRs with backports if it is needed. |
…or timestamp ntz This fixes a correctness bug. The TIMESTAMP_NTZ is a new data type in Spark and has no legacy files that need to do calendar rebase. However, the vectorized parquet reader treat it the same as LTZ and may do rebase if the parquet file was written with the legacy rebase mode. This PR fixes it to never do rebase for NTZ. bug fix Yes, now we can correctly write and read back NTZ value even if the date is before 1582. new test No Closes apache#44428 from cloud-fan/ntz. Lead-authored-by: Wenchen Fan <[email protected]> Co-authored-by: Wenchen Fan <[email protected]> Signed-off-by: Max Gekk <[email protected]>
…ase for timestamp ntz backport #44428 ### What changes were proposed in this pull request? This fixes a correctness bug. The TIMESTAMP_NTZ is a new data type in Spark and has no legacy files that need to do calendar rebase. However, the vectorized parquet reader treat it the same as LTZ and may do rebase if the parquet file was written with the legacy rebase mode. This PR fixes it to never do rebase for NTZ. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? Yes, now we can correctly write and read back NTZ value even if the date is before 1582. ### How was this patch tested? new test ### Was this patch authored or co-authored using generative AI tooling? No Closes #44446 from cloud-fan/ntz2. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…ase for timestamp ntz backport #44428 ### What changes were proposed in this pull request? This fixes a correctness bug. The TIMESTAMP_NTZ is a new data type in Spark and has no legacy files that need to do calendar rebase. However, the vectorized parquet reader treat it the same as LTZ and may do rebase if the parquet file was written with the legacy rebase mode. This PR fixes it to never do rebase for NTZ. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? Yes, now we can correctly write and read back NTZ value even if the date is before 1582. ### How was this patch tested? new test ### Was this patch authored or co-authored using generative AI tooling? No Closes #44446 from cloud-fan/ntz2. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 0948e24) Signed-off-by: Wenchen Fan <[email protected]>
|
Late LGTM! Thanks for the fix. |
…ase for timestamp ntz backport apache#44428 ### What changes were proposed in this pull request? This fixes a correctness bug. The TIMESTAMP_NTZ is a new data type in Spark and has no legacy files that need to do calendar rebase. However, the vectorized parquet reader treat it the same as LTZ and may do rebase if the parquet file was written with the legacy rebase mode. This PR fixes it to never do rebase for NTZ. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? Yes, now we can correctly write and read back NTZ value even if the date is before 1582. ### How was this patch tested? new test ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#44446 from cloud-fan/ntz2. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 0948e24) Signed-off-by: Wenchen Fan <[email protected]>
…ase for timestamp ntz (apache#364) backport apache#44428 ### What changes were proposed in this pull request? This fixes a correctness bug. The TIMESTAMP_NTZ is a new data type in Spark and has no legacy files that need to do calendar rebase. However, the vectorized parquet reader treat it the same as LTZ and may do rebase if the parquet file was written with the legacy rebase mode. This PR fixes it to never do rebase for NTZ. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? Yes, now we can correctly write and read back NTZ value even if the date is before 1582. ### How was this patch tested? new test ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#44446 from cloud-fan/ntz2. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> Co-authored-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
This fixes a correctness bug. The TIMESTAMP_NTZ is a new data type in Spark and has no legacy files that need to do calendar rebase. However, the vectorized parquet reader treat it the same as LTZ and may do rebase if the parquet file was written with the legacy rebase mode. This PR fixes it to never do rebase for NTZ.
Why are the changes needed?
bug fix
Does this PR introduce any user-facing change?
Yes, now we can correctly write and read back NTZ value even if the date is before 1582.
How was this patch tested?
new test
Was this patch authored or co-authored using generative AI tooling?
No