[SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses UTC time zone #34741

beliefer · 2021-11-29T11:59:41Z

What changes were proposed in this pull request?

#33588 (comment) show Spark cannot read/write timestamp ntz and ltz correctly. Based on the discussion #34712 (comment), we just to fix read/write timestamp ntz to Orc uses UTC timestamp.

The root cause is Orc write/read timestamp with local timezone in default. The local timezone will be changed.
If the Orc writer write timestamp with local timezone(e.g. America/Los_Angeles), when the Orc reader reading the timestamp with other local timezone(e.g. Europe/Amsterdam), the value of timestamp will be different.

If we let the Orc writer write timestamp with UTC timezone, when the Orc reader reading the timestamp with UTC timezone too, the value of timestamp will be correct.

The related Orc source:
https://github.com/apache/orc/blob/3f1e57cf1cebe58027c1bd48c09eef4e9717a9e3/java/core/src/java/org/apache/orc/impl/WriterImpl.java#L525

https://github.com/apache/orc/blob/1f68ac0c7f2ae804b374500dcf1b4d7abe30ffeb/java/core/src/java/org/apache/orc/impl/TreeReaderFactory.java#L1184

Why are the changes needed?

Fix the bug about read/write timestamp ntz from/to Orc with different times zone.

Does this PR introduce any user-facing change?

No. Orc timestamp ntz is a new feature not release yet.

How was this patch tested?

New tests.

SparkQA · 2021-11-29T13:11:37Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50187/

SparkQA · 2021-11-29T13:56:29Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50187/

SparkQA · 2021-11-29T14:27:19Z

Test build #145717 has finished for PR 34741 at commit 1c6da02.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-11-29T15:18:15Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50191/

SparkQA · 2021-11-29T16:20:14Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50191/

SparkQA · 2021-11-29T18:12:11Z

Test build #145721 has finished for PR 34741 at commit f500cf3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

beliefer · 2021-11-30T00:47:39Z

ping @cloud-fan

bersprockets · 2021-11-30T01:04:07Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala

  def toOrcNTZ(micros: Long): OrcTimestamp = {
-    val seconds = Math.floorDiv(micros, MICROS_PER_SECOND)
-    val nanos = (micros - seconds * MICROS_PER_SECOND) * NANOS_PER_MICROS
+    val utcMicros = DateTimeUtils.toUTCTime(micros, TimeZone.getDefault.getID)


There is at least one issue with dates before 1883-11-17. Railway time zones didn't exist then, and java.time classes (which fromUTCTime/toUTCTime use) care about that.

Also, I surmise there will be an additional issue with pre-Gregorian values, since ORC assumes Julian when "shifting" values on read.

Even my POC has issues with pre-1883-11-17 values when the writer is in the Pacific/Pago_Pago time zone and the reader is in some other time zone, because it also uses the fromUTCTime/toUTCTime utility functions.

When dealing with hybrid Julian times, ORC doesn't have issues shifting values between time zones that didn't exist yet, since the old time-related classes didn't worry about that.

While you're looking at this, I will also try to see if there is a way to safely shift the values before 1883-11-17 between time zones (probably will need julian<->gregorian rebases).

If there isn't a reasonable way, we might need to consider some other ORC datatype for storing these values (maybe Long?). Don't know if that's doable or reasonable...

I'm trying to understand this issue better. From the ORC source code, seems like

ORC writer shifts the timestamp value w.r.t. the JVM local timezone, and record the timezone in file footer

ORC reader shifts the timestamp value w.r.t. both the JVM local timezone and the record writer timezone.

seems like we only need to change the ORC reader to shift the timestamp value by writer timezone?

I'm trying to understand this issue better. From the ORC source code, seems like

Almost: For the first point, the ORC writer stores the timestamp value passed by the caller as-is, and records the timezone in file footer (that's why we need to shift the value before passing it to ORC).

The devil, however, is in the details. I know enough to know there are issues with the above, but not enough to know every issue or solution.

For example, an offset for a particular time zone is not fixed. It changes depending on the point on the timeline.

When shifting, ORC determines the offsets for the two time zones based on the stored timestamp value (millis) (see here). Because ORC uses the old time APIs to do this, we should ensure we pass a correct Hybrid Julian epoch time to ORC (via a Timestamp object), or ORC could get the wrong offsets.

Because of that, I don't think you can pass an arbitrary long value in the Timestamp object and hope to properly reconstruct it on read. You might not know how ORC shifted it.

In addition, the fromUTCTime/toUTCTime/convertTZ utilty methods work correctly, but are not appropriate for our needs here. fromUTCTime/toUTCTime/convertTZ uses the new Java time APIs, which don't work as we'd like for timestamps before the introduction of Railway time (circa 1883-11-17).

Emulating here what fromUTCTime does, you can see the issue here:

scala> val ldtUtc = ZonedDateTime.of(1883, 11, 16, 0, 0, 0, 0, ZoneId.of("UTC")).toLocalDateTime ldtUtc: java.time.LocalDateTime = 1883-11-16T00:00 scala> val zdtShifted = ldtUtc.atZone(ZoneId.of("America/Los_Angeles")) zdtShifted: java.time.ZonedDateTime = 1883-11-16T00:00-07:52:58[America/Los_Angeles] scala>

Note how the time zone that the new APIs applied to the shifted value is -07:52:58, which is not some offset evenly divisible by 1 hour or 30 minutes.

As a result, you end up with a timestamp like this:

scala> val ts = new Timestamp(zdtShifted.toInstant.toEpochMilli) ts: java.sql.Timestamp = 1883-11-15 23:52:58.0 scala>

In fact, you can see a compounded version of this with this PR:

scala> TimeZone.setDefault(TimeZone.getTimeZone("UTC")) scala> sql("select timestamp_ntz'1883-11-16 00:00:00.0' as ts_ntz").write.mode("overwrite").orc("test") scala> TimeZone.setDefault(TimeZone.getTimeZone("America/Los_Angeles")) scala> spark.read.orc("test").show(false) +-------------------+ |ts_ntz | +-------------------+ |1883-11-16 00:07:02| +-------------------+ scala>

Edit: I assume the above is because of pre-railway shifting, I didn't verify that.

I updated my POC to attempt to accomodate for these issues by:

Stealing the timstamp shifting technique from ORC code (thus I avoid using fromUTCTime/toUTCTime)

Passing Hybrid Julian values to ORC on write, and assuming ORC retrieves a Hybrid Julian value on read.

That seems to have eliminated the pre-Railway and pre-Gregorian issues.

But there is more...

The ORC API accepts and returns Timestamp objects for writing and reading ORC timestamp fields. This alone introduces some oddities that will be noticable to end users.

For example, not all timestamp_ntz values can be represented with Timestamp in all time zones. The date/time '2021-03-14 02:15:00.0' doesn't exist in the America/Los_Angeles time zone.

scala> TimeZone.setDefault(TimeZone.getTimeZone("America/Los_Angeles")) scala> val ts = java.sql.Timestamp.valueOf("2021-03-14 02:15:00.0") ts: java.sql.Timestamp = 2021-03-14 03:15:00.0 scala>

So we will have oddities like this (using my POC code):

scala> import java.util.TimeZone import java.util.TimeZone scala> TimeZone.setDefault(TimeZone.getTimeZone("America/Los_Angeles")) scala> sql("select timestamp_ntz'2021-03-14 02:15:00.0' as ts_ntz").write.mode("overwrite").orc("test") scala> spark.read.orc("test").show(false) +-------------------+ |ts_ntz | +-------------------+ |2021-03-14 01:15:00| +-------------------+ scala>

With this PR, you actually see it not with "spring forward", but with "fall back" (not sure why):

scala> import java.util.TimeZone import java.util.TimeZone scala> TimeZone.setDefault(TimeZone.getTimeZone("America/Los_Angeles")) scala> sql("select timestamp_ntz'1996-10-27T09:10:25.088353' as ts_ntz").write.mode("overwrite").orc("test") scala> spark.read.orc("test").show(false) +--------------------------+ |ts_ntz | +--------------------------+ |1996-10-27 08:10:25.088353| +--------------------------+ scala>

So to summarize:

ORC needs the correct epoch values (pre-shifted, rebased to Hybrid Julian) to determine the correct offsets when shifting on read.

You can't use fromUTCTime/toUTCTime (or convertTZ) for shifting pre-Railway datetime values.

For the ORC timestamp type, the ORC API receives and returns Timestamp objects, but Timestamp objects alone introduce oddities.

It would certainly be nice if we could save and retrieve a long value (ala useUTCTime) without affecting timestamp_ltz.

Edit: caveat: I am no expert in the time APIs.

@bersprockets Thank you for the good investigation. I referenced the https://github.com/apache/orc/blob/334bf1f2c605f38c7e75ec81d1dab93c31fc8459/java/core/src/java/org/apache/orc/impl/SerializationUtils.java#L1444 and try to reverse conversion.

This PR runs the first and second example successfully, but the last '1996-10-27T09:10:25.088353' still output '1996-10-27 08:10:25.088353'.

beliefer · 2021-11-30T01:48:20Z

ping @cloud-fan

SparkQA · 2021-11-30T02:13:00Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50205/

SparkQA · 2021-11-30T03:15:09Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50205/

SparkQA · 2021-11-30T06:19:17Z

Test build #145735 has finished for PR 34741 at commit ee11bf4.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class UDFBasicProfiler(BasicProfiler):
case class PrettyPythonUDF(

…K-37463-new

SparkQA · 2021-11-30T15:24:47Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50243/

SparkQA · 2021-11-30T16:12:06Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50245/

SparkQA · 2021-11-30T16:24:16Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50243/

SparkQA · 2021-11-30T16:56:20Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50245/

SparkQA · 2021-11-30T20:04:00Z

Test build #145772 has finished for PR 34741 at commit 8e17559.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-12-01T06:17:25Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50265/

SparkQA · 2021-12-01T07:17:28Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50265/

SparkQA · 2021-12-01T09:07:18Z

Test build #145792 has finished for PR 34741 at commit c289b97.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-12-01T11:12:15Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50284/

SparkQA · 2021-12-01T12:10:42Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50284/

cloud-fan · 2021-12-01T13:51:29Z

@bersprockets After reading more ORC code, I feel the timestamp implementation is quite messy in ORC. Not only the reader side, but also the writer side shifts the timestamp value according to the JVM local timezone: https://github.com/apache/orc/blob/1f68ac0c7f2ae804b374500dcf1b4d7abe30ffeb/java/core/src/java/org/apache/orc/impl/writer/TimestampTreeWriter.java#L112-L113

It seems like the ORC lib (the default behavior) is designed for people who want to deal with java.sql.Timestamp directly, not an engine like Spark that only treats ORC as a storage layer. Spark should set useUTCTimestamp as true but now it's too late as we need to support existing ORC files written by old Spark versions.

To fix the mistake in the storage layer, we need probably years to do a smooth migration. My proposal is:
Phase1:

Write TIMESTAMP_NTZ as ORC int64, with a column property to indicate it's TIMESTAMP_NTZ (writing TIMESTAMP_LTZ should add the column property as well)
Support reading ORC TIMESTAMP_INSTANT as Spark TIMESTAMP_LTZ.
When reading ORC TIMESTAMP, check the column property to get the actual type (LTZ or NTZ).

Phase 2 (After Spark 3.3 becomes the least officially supported version):

Write LTZ as ORC TIMESTAMP_INSTANT
Write NTZ as ORC TIMESTAMP, with useUTCTimestamp set to true.
Set useUTCTimestamp to true in the reader if the ORC file was written by the latest Spark version.

With this proposal, we can achieve:

ORC files written by Spark 3.3 can be correctly read back by Spark 3.3
Old spark versions and other systems will read TIMESTAMP_NTZ as long (not a big issue)
Old ORC files can still be correctly read by Spark 3.3
When phase 2 is completed, all the supported Spark versions and other systems can read the ORC files written by Spark correctly.
When phase 2 is completed, all the supported Spark version can still read old ORC files correctly (we can look at the spark version in the file footer to decide if we should set useUTCTimestamp to true or not for the reader)

also cc @gengliangwang @MaxGekk

SparkQA · 2021-12-01T14:11:54Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50290/

SparkQA · 2021-12-01T14:47:29Z

Test build #145810 has finished for PR 34741 at commit 6aa5053.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-12-01T14:50:05Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50290/

SparkQA · 2021-12-01T17:12:48Z

Test build #145815 has finished for PR 34741 at commit 03a7640.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2021-12-27T05:32:39Z

@cloud-fan I have two concerns about your proposal:

Storing NTZ as ORC int64. will bring an extra burden on supporting this feature in the future releases.
Phase 2 introduces behavior changes and I don't think we can make it easily.

I would prefer storing as ORC NTZ and shifting the time zones on read/write. E.g. Currently LTZ actually works like NTZ from the comment #33588 (comment).
Or, we can try setting useUTCTimestamp and see how to keep the current LTZ behavior

cloud-fan · 2021-12-27T09:30:11Z

@gengliangwang please read the previous discussions in previous PRs. We tried your proposals and none of them worked:

ORC stored timezone per "column chunk" not per file, and we can't read this timezone info in the row-based ORC reader to shift the timestamp values.
useUTCTimestamp is a global conf. If we set it, we break TIMESTAMP_LTZ.

I don't think there is a better option. Phase 2 is not a breaking change if no one is using Spark version < 3.3. It may take years but it's still possible. Before that, we are still in a good shape, the only problem is other systems reading ORC files written by Spark.

gengliangwang · 2021-12-28T08:04:39Z

@cloud-fan I see. Let's go with your proposal.

### What changes were proposed in this pull request? #33588 (comment) show Spark cannot read/write timestamp ntz and ltz correctly. Based on the discussion #34741 (comment), we just to fix read/write timestamp ntz to Orc uses int64. ### Why are the changes needed? Fix the bug about read/write timestamp ntz from/to Orc with different times zone. ### Does this PR introduce _any_ user-facing change? Yes. Orc timestamp ntz is a new feature. ### How was this patch tested? New tests. Closes #34984 from beliefer/SPARK-37463-int64. Authored-by: Jiaan Geng <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? #33588 (comment) show Spark cannot read/write timestamp ntz and ltz correctly. Based on the discussion #34741 (comment), we just to fix read/write timestamp ntz to Orc uses int64. ### Why are the changes needed? Fix the bug about read/write timestamp ntz from/to Orc with different times zone. ### Does this PR introduce _any_ user-facing change? Yes. Orc timestamp ntz is a new feature. ### How was this patch tested? New tests. Closes #34984 from beliefer/SPARK-37463-int64. Authored-by: Jiaan Geng <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit e410d98) Signed-off-by: Wenchen Fan <[email protected]>

cloud-fan · 2022-03-24T14:44:39Z

This is replaced by #34984

[SPARK-37463][SQL] Read/Write Timestamp ntz to Orc uses UTC timestamp

1c6da02

github-actions bot added the SQL label Nov 29, 2021

beliefer added 2 commits November 29, 2021 20:49

Update code

1027e75

Update code

f500cf3

Merge branch 'master' into SPARK-37463-new

ee11bf4

bersprockets reviewed Nov 30, 2021

View reviewed changes

beliefer added 5 commits November 30, 2021 21:52

Update code

f9116cd

Update code

7291996

Merge branch 'SPARK-37463-new' of github.com:beliefer/spark into SPAR…

56821a2

…K-37463-new

Update code

71b180a

Update code

8e17559

Update code

c289b97

Update code

6aa5053

Update code

03a7640

beliefer mentioned this pull request Dec 22, 2021

[SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64 #34984

Closed

cloud-fan closed this Mar 24, 2022

Zouxxyy mentioned this pull request Jun 23, 2024

[Bug] Read orc timestamp incorrectly after change the time zone apache/paimon#3580

Closed

2 tasks

[SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses UTC time zone #34741

[SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses UTC time zone #34741

Uh oh!

Conversation

beliefer commented Nov 29, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Nov 29, 2021

Uh oh!

SparkQA commented Nov 29, 2021

Uh oh!

SparkQA commented Nov 29, 2021

Uh oh!

SparkQA commented Nov 29, 2021

Uh oh!

SparkQA commented Nov 29, 2021

Uh oh!

SparkQA commented Nov 29, 2021

Uh oh!

beliefer commented Nov 30, 2021

Uh oh!

bersprockets Nov 30, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bersprockets Nov 30, 2021

Choose a reason for hiding this comment

Uh oh!

cloud-fan Nov 30, 2021

Choose a reason for hiding this comment

Uh oh!

bersprockets Nov 30, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

beliefer Dec 1, 2021

Choose a reason for hiding this comment

Uh oh!

beliefer commented Nov 30, 2021

Uh oh!

SparkQA commented Nov 30, 2021

Uh oh!

SparkQA commented Nov 30, 2021

Uh oh!

SparkQA commented Nov 30, 2021

Uh oh!

SparkQA commented Nov 30, 2021

Uh oh!

SparkQA commented Nov 30, 2021

Uh oh!

SparkQA commented Nov 30, 2021

Uh oh!

SparkQA commented Nov 30, 2021

Uh oh!

SparkQA commented Nov 30, 2021

Uh oh!

SparkQA commented Dec 1, 2021

Uh oh!

SparkQA commented Dec 1, 2021

Uh oh!

SparkQA commented Dec 1, 2021

Uh oh!

SparkQA commented Dec 1, 2021

Uh oh!

SparkQA commented Dec 1, 2021

Uh oh!

cloud-fan commented Dec 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Dec 1, 2021

Uh oh!

SparkQA commented Dec 1, 2021

Uh oh!

SparkQA commented Dec 1, 2021

Uh oh!

SparkQA commented Dec 1, 2021

Uh oh!

bersprockets Nov 30, 2021 •

edited

Loading

bersprockets Nov 30, 2021 •

edited

Loading

cloud-fan commented Dec 1, 2021 •

edited

Loading