[SPARK-46466][SQL] Vectorized parquet reader should never do rebase for timestamp ntz #44428

cloud-fan · 2023-12-20T14:51:37Z

What changes were proposed in this pull request?

This fixes a correctness bug. The TIMESTAMP_NTZ is a new data type in Spark and has no legacy files that need to do calendar rebase. However, the vectorized parquet reader treat it the same as LTZ and may do rebase if the parquet file was written with the legacy rebase mode. This PR fixes it to never do rebase for NTZ.

Why are the changes needed?

bug fix

Does this PR introduce any user-facing change?

Yes, now we can correctly write and read back NTZ value even if the date is before 1582.

How was this patch tested?

new test

Was this patch authored or co-authored using generative AI tooling?

No

cloud-fan · 2023-12-20T14:52:02Z

cc @gengliangwang @MaxGekk @dongjoon-hyun

MaxGekk · 2023-12-20T19:53:39Z

@cloud-fan Thanks for the ping. I will review this PR tomorrow.

dongjoon-hyun · 2023-12-20T23:32:13Z

...re/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala

    }
  }

+  test("write and read TimestampNTZ with legacy rebase mode") {


If you don't mind, please add a test prefix, SPARK-46466: here for easy tracing this correctness bug.

dongjoon-hyun · 2023-12-20T23:38:07Z

I updated the JIRA as a correctness blocker issue for Apache Spark 3.5.1 and 3.4.3. Thank you for the fix.

...re/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala

…rces/parquet/ParquetQuerySuite.scala

beliefer · 2023-12-21T05:52:49Z

...ain/java/org/apache/spark/sql/execution/datasources/parquet/ParquetVectorUpdaterFactory.java

          // For unsigned int64, it stores as plain signed int64 in Parquet when dictionary
          // fallbacks. We read them as decimal values.
          return new UnsignedLongUpdater();
-        } else if (isTimestamp(sparkType) &&


Shall we remove isTimestamp is not used any more?

MaxGekk

The TIMESTAMP_NTZ is a new data type in Spark and has no legacy files that need to do calendar rebase.

But the behaviour of rebasing of TIMESTAMP_NTZ has been released already by at least Spark 3.5 (and maybe 3.4). So, Spark users might write TIMESTAMP_NTZ with the rebase SQL config set to LEGACY. How will they read the data back after switching to Spar 4.0.0. Seems we need a legacy config to restore the previous behaviour, don't we?

LuciferYang · 2023-12-21T06:55:52Z

...ain/java/org/apache/spark/sql/execution/datasources/parquet/ParquetVectorUpdaterFactory.java

  }

-  void validateTimestampType(DataType sparkType) {
+  void validateTimestampNTZType(DataType sparkType) {


The input parameter sparkType is no longer in use, can it be removed?

Although the old code is like this, in non-ea mode, assert is invalid.

Is it possible to reduce the access scope of the validateTimestampNTZType method to private in this pr? It does not need to be accessed by other classes.

beliefer

LGTM except one comment.

beliefer · 2023-12-21T07:20:59Z

...ain/java/org/apache/spark/sql/execution/datasources/parquet/ParquetVectorUpdaterFactory.java

+        } else if (sparkType == DataTypes.TimestampNTZType &&
+          isTimestampTypeMatched(LogicalTypeAnnotation.TimeUnit.MICROS)) {
+          validateTimestampNTZType(sparkType);
+          // TIMESTAMP_NTZ is a new data type and has no legacy files that need to do rebase.


Just a question. TIMESTAMP_NTZ has been released at 3.5.0, why there has no legacy files?

legacy here means parquet files written before the calendar switch.

Got it now. You means we never rebase the time zone for TIMESTAMP_NTZ.

cloud-fan · 2023-12-21T07:33:41Z

But the behaviour of rebasing of TIMESTAMP_NTZ has been released already by at least Spark 3.5 (and maybe 3.4).

No, we never rebase ntz values during writing. It's a pure bug that we never rebase ntz during writing but may rebase during reading.

MaxGekk · 2023-12-21T09:38:17Z

+1, LGTM. Merging to master.
Thank you, @cloud-fan and @dongjoon-hyun @LuciferYang @beliefer for review.

MaxGekk · 2023-12-21T09:39:14Z

@cloud-fan BTW, this should be backported to 3.4 and 3.5, correct?

Update: In any case it conflicts with branch-3.5 and branch-3.4. Please, open a separate PRs with backports if it is needed.

…or timestamp ntz This fixes a correctness bug. The TIMESTAMP_NTZ is a new data type in Spark and has no legacy files that need to do calendar rebase. However, the vectorized parquet reader treat it the same as LTZ and may do rebase if the parquet file was written with the legacy rebase mode. This PR fixes it to never do rebase for NTZ. bug fix Yes, now we can correctly write and read back NTZ value even if the date is before 1582. new test No Closes apache#44428 from cloud-fan/ntz. Lead-authored-by: Wenchen Fan <[email protected]> Co-authored-by: Wenchen Fan <[email protected]> Signed-off-by: Max Gekk <[email protected]>

…ase for timestamp ntz backport #44428 ### What changes were proposed in this pull request? This fixes a correctness bug. The TIMESTAMP_NTZ is a new data type in Spark and has no legacy files that need to do calendar rebase. However, the vectorized parquet reader treat it the same as LTZ and may do rebase if the parquet file was written with the legacy rebase mode. This PR fixes it to never do rebase for NTZ. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? Yes, now we can correctly write and read back NTZ value even if the date is before 1582. ### How was this patch tested? new test ### Was this patch authored or co-authored using generative AI tooling? No Closes #44446 from cloud-fan/ntz2. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…ase for timestamp ntz backport #44428 ### What changes were proposed in this pull request? This fixes a correctness bug. The TIMESTAMP_NTZ is a new data type in Spark and has no legacy files that need to do calendar rebase. However, the vectorized parquet reader treat it the same as LTZ and may do rebase if the parquet file was written with the legacy rebase mode. This PR fixes it to never do rebase for NTZ. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? Yes, now we can correctly write and read back NTZ value even if the date is before 1582. ### How was this patch tested? new test ### Was this patch authored or co-authored using generative AI tooling? No Closes #44446 from cloud-fan/ntz2. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 0948e24) Signed-off-by: Wenchen Fan <[email protected]>

gengliangwang · 2024-01-02T23:28:25Z

Late LGTM! Thanks for the fix.

…ase for timestamp ntz backport apache#44428 ### What changes were proposed in this pull request? This fixes a correctness bug. The TIMESTAMP_NTZ is a new data type in Spark and has no legacy files that need to do calendar rebase. However, the vectorized parquet reader treat it the same as LTZ and may do rebase if the parquet file was written with the legacy rebase mode. This PR fixes it to never do rebase for NTZ. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? Yes, now we can correctly write and read back NTZ value even if the date is before 1582. ### How was this patch tested? new test ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#44446 from cloud-fan/ntz2. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 0948e24) Signed-off-by: Wenchen Fan <[email protected]>

…ase for timestamp ntz (apache#364) backport apache#44428 ### What changes were proposed in this pull request? This fixes a correctness bug. The TIMESTAMP_NTZ is a new data type in Spark and has no legacy files that need to do calendar rebase. However, the vectorized parquet reader treat it the same as LTZ and may do rebase if the parquet file was written with the legacy rebase mode. This PR fixes it to never do rebase for NTZ. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? Yes, now we can correctly write and read back NTZ value even if the date is before 1582. ### How was this patch tested? new test ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#44446 from cloud-fan/ntz2. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> Co-authored-by: Wenchen Fan <[email protected]>

vectorized parquet reader should never do rebase for timestamp ntz

c2547d6

github-actions bot added the SQL label Dec 20, 2023

dongjoon-hyun reviewed Dec 20, 2023

View reviewed changes

dongjoon-hyun approved these changes Dec 20, 2023

View reviewed changes

cloud-fan commented Dec 21, 2023

View reviewed changes

...re/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala Outdated Show resolved Hide resolved

Update sql/core/src/test/scala/org/apache/spark/sql/execution/datasou…

e3743d7

…rces/parquet/ParquetQuerySuite.scala

beliefer reviewed Dec 21, 2023

View reviewed changes

Update ParquetVectorUpdaterFactory.java

57c0c0b

MaxGekk reviewed Dec 21, 2023

View reviewed changes

LuciferYang reviewed Dec 21, 2023

View reviewed changes

beliefer approved these changes Dec 21, 2023

View reviewed changes

Update ParquetVectorUpdaterFactory.java

286d72a

LuciferYang approved these changes Dec 21, 2023

View reviewed changes

MaxGekk approved these changes Dec 21, 2023

View reviewed changes

beliefer approved these changes Dec 21, 2023

View reviewed changes

MaxGekk closed this in 4d21e55 Dec 21, 2023

This was referenced Dec 21, 2023

[SPARK-46466][SQL][3.5] Vectorized parquet reader should never do rebase for timestamp ntz #44446

Closed

[SPARK-40876][SQL] Widening type promotions in Parquet readers #44368

Closed

[SPARK-46466][SQL] Vectorized parquet reader should never do rebase for timestamp ntz #44428

[SPARK-46466][SQL] Vectorized parquet reader should never do rebase for timestamp ntz #44428

Uh oh!

Conversation

cloud-fan commented Dec 20, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

cloud-fan commented Dec 20, 2023

Uh oh!

MaxGekk commented Dec 20, 2023

Uh oh!

dongjoon-hyun Dec 20, 2023

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Dec 20, 2023

Uh oh!

Uh oh!

beliefer Dec 21, 2023

Choose a reason for hiding this comment

Uh oh!

cloud-fan Dec 21, 2023

Choose a reason for hiding this comment

Uh oh!

MaxGekk left a comment

Choose a reason for hiding this comment

Uh oh!

LuciferYang Dec 21, 2023

Choose a reason for hiding this comment

Uh oh!

beliefer left a comment

Choose a reason for hiding this comment

Uh oh!

beliefer Dec 21, 2023

Choose a reason for hiding this comment

Uh oh!

cloud-fan Dec 21, 2023

Choose a reason for hiding this comment

Uh oh!

beliefer Dec 21, 2023

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Dec 21, 2023

Uh oh!

MaxGekk commented Dec 21, 2023

Uh oh!

MaxGekk commented Dec 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gengliangwang commented Jan 2, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

MaxGekk commented Dec 21, 2023 •

edited

Loading