Skip to content

Conversation

@cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

This fixes a correctness bug. The TIMESTAMP_NTZ is a new data type in Spark and has no legacy files that need to do calendar rebase. However, the vectorized parquet reader treat it the same as LTZ and may do rebase if the parquet file was written with the legacy rebase mode. This PR fixes it to never do rebase for NTZ.

Why are the changes needed?

bug fix

Does this PR introduce any user-facing change?

Yes, now we can correctly write and read back NTZ value even if the date is before 1582.

How was this patch tested?

new test

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the SQL label Dec 20, 2023
@cloud-fan
Copy link
Contributor Author

@MaxGekk
Copy link
Member

MaxGekk commented Dec 20, 2023

@cloud-fan Thanks for the ping. I will review this PR tomorrow.

}
}

test("write and read TimestampNTZ with legacy rebase mode") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't mind, please add a test prefix, SPARK-46466: here for easy tracing this correctness bug.

@dongjoon-hyun
Copy link
Member

I updated the JIRA as a correctness blocker issue for Apache Spark 3.5.1 and 3.4.3. Thank you for the fix.

// For unsigned int64, it stores as plain signed int64 in Parquet when dictionary
// fallbacks. We read them as decimal values.
return new UnsignedLongUpdater();
} else if (isTimestamp(sparkType) &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we remove isTimestamp is not used any more?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member

@MaxGekk MaxGekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TIMESTAMP_NTZ is a new data type in Spark and has no legacy files that need to do calendar rebase.

But the behaviour of rebasing of TIMESTAMP_NTZ has been released already by at least Spark 3.5 (and maybe 3.4). So, Spark users might write TIMESTAMP_NTZ with the rebase SQL config set to LEGACY. How will they read the data back after switching to Spar 4.0.0. Seems we need a legacy config to restore the previous behaviour, don't we?

}

void validateTimestampType(DataType sparkType) {
void validateTimestampNTZType(DataType sparkType) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. The input parameter sparkType is no longer in use, can it be removed?
  2. Although the old code is like this, in non-ea mode, assert is invalid.
  3. Is it possible to reduce the access scope of the validateTimestampNTZType method to private in this pr? It does not need to be accessed by other classes.

Copy link
Contributor

@beliefer beliefer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except one comment.

} else if (sparkType == DataTypes.TimestampNTZType &&
isTimestampTypeMatched(LogicalTypeAnnotation.TimeUnit.MICROS)) {
validateTimestampNTZType(sparkType);
// TIMESTAMP_NTZ is a new data type and has no legacy files that need to do rebase.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question. TIMESTAMP_NTZ has been released at 3.5.0, why there has no legacy files?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

legacy here means parquet files written before the calendar switch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it now. You means we never rebase the time zone for TIMESTAMP_NTZ.

@cloud-fan
Copy link
Contributor Author

But the behaviour of rebasing of TIMESTAMP_NTZ has been released already by at least Spark 3.5 (and maybe 3.4).

No, we never rebase ntz values during writing. It's a pure bug that we never rebase ntz during writing but may rebase during reading.

@MaxGekk
Copy link
Member

MaxGekk commented Dec 21, 2023

+1, LGTM. Merging to master.
Thank you, @cloud-fan and @dongjoon-hyun @LuciferYang @beliefer for review.

@MaxGekk
Copy link
Member

MaxGekk commented Dec 21, 2023

@cloud-fan BTW, this should be backported to 3.4 and 3.5, correct?

Update: In any case it conflicts with branch-3.5 and branch-3.4. Please, open a separate PRs with backports if it is needed.

@MaxGekk MaxGekk closed this in 4d21e55 Dec 21, 2023
cloud-fan added a commit to cloud-fan/spark that referenced this pull request Dec 21, 2023
…or timestamp ntz

This fixes a correctness bug. The TIMESTAMP_NTZ is a new data type in Spark and has no legacy files that need to do calendar rebase. However, the vectorized parquet reader treat it the same as LTZ and may do rebase if the parquet file was written with the legacy rebase mode. This PR fixes it to never do rebase for NTZ.

bug fix

Yes, now we can correctly write and read back NTZ value even if the date is before 1582.

new test

No

Closes apache#44428 from cloud-fan/ntz.

Lead-authored-by: Wenchen Fan <[email protected]>
Co-authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Max Gekk <[email protected]>
cloud-fan added a commit that referenced this pull request Dec 22, 2023
…ase for timestamp ntz

backport #44428

### What changes were proposed in this pull request?

This fixes a correctness bug. The TIMESTAMP_NTZ is a new data type in Spark and has no legacy files that need to do calendar rebase. However, the vectorized parquet reader treat it the same as LTZ and may do rebase if the parquet file was written with the legacy rebase mode. This PR fixes it to never do rebase for NTZ.

### Why are the changes needed?

bug fix

### Does this PR introduce _any_ user-facing change?

Yes, now we can correctly write and read back NTZ value even if the date is before 1582.

### How was this patch tested?

new test

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #44446 from cloud-fan/ntz2.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
cloud-fan added a commit that referenced this pull request Dec 22, 2023
…ase for timestamp ntz

backport #44428

### What changes were proposed in this pull request?

This fixes a correctness bug. The TIMESTAMP_NTZ is a new data type in Spark and has no legacy files that need to do calendar rebase. However, the vectorized parquet reader treat it the same as LTZ and may do rebase if the parquet file was written with the legacy rebase mode. This PR fixes it to never do rebase for NTZ.

### Why are the changes needed?

bug fix

### Does this PR introduce _any_ user-facing change?

Yes, now we can correctly write and read back NTZ value even if the date is before 1582.

### How was this patch tested?

new test

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #44446 from cloud-fan/ntz2.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 0948e24)
Signed-off-by: Wenchen Fan <[email protected]>
@gengliangwang
Copy link
Member

Late LGTM! Thanks for the fix.

szehon-ho pushed a commit to szehon-ho/spark that referenced this pull request Feb 7, 2024
…ase for timestamp ntz

backport apache#44428

### What changes were proposed in this pull request?

This fixes a correctness bug. The TIMESTAMP_NTZ is a new data type in Spark and has no legacy files that need to do calendar rebase. However, the vectorized parquet reader treat it the same as LTZ and may do rebase if the parquet file was written with the legacy rebase mode. This PR fixes it to never do rebase for NTZ.

### Why are the changes needed?

bug fix

### Does this PR introduce _any_ user-facing change?

Yes, now we can correctly write and read back NTZ value even if the date is before 1582.

### How was this patch tested?

new test

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#44446 from cloud-fan/ntz2.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 0948e24)
Signed-off-by: Wenchen Fan <[email protected]>
turboFei pushed a commit to turboFei/spark that referenced this pull request Nov 6, 2025
…ase for timestamp ntz (apache#364)

backport apache#44428

### What changes were proposed in this pull request?

This fixes a correctness bug. The TIMESTAMP_NTZ is a new data type in Spark and has no legacy files that need to do calendar rebase. However, the vectorized parquet reader treat it the same as LTZ and may do rebase if the parquet file was written with the legacy rebase mode. This PR fixes it to never do rebase for NTZ.

### Why are the changes needed?

bug fix

### Does this PR introduce _any_ user-facing change?

Yes, now we can correctly write and read back NTZ value even if the date is before 1582.

### How was this patch tested?

new test

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#44446 from cloud-fan/ntz2.

Authored-by: Wenchen Fan <[email protected]>

Signed-off-by: Wenchen Fan <[email protected]>
Co-authored-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants