[SPARK-36346][SQL] Support TimestampNTZ type in Orc file source #33588

beliefer · 2021-07-30T11:17:30Z

What changes were proposed in this pull request?

As per https://orc.apache.org/docs/types.html, ORC includes two different forms of timestamps from the SQL world:

Timestamp is a date and time without a time zone, which does not change based on the time zone of the reader.
Timestamp with local time zone is a fixed instant in time, which does change based on the time zone of the reader.

So the contrast relationship of timestamp between Spark and ORC as follows:

TIMESTAMP_LTZ => Timestamp with local time zone.
TIMESTAMP_NTZ => Timestamp

Unfortunately, in Spark 3.1 or prior, Spark considered ORC Timestamp as TIMESTAMP_LTZ mistakely.
The behavior is not correct. To keep backward compatibility, we not change the mistake now.
Since 3.2, with the support of timestamp without time zone type:

For the ORC writer, we will have to map both Spark’s TimestampLTZ and TimestampNTZ as ORC’s TimestampNTZ for backward compatibility. In addition, the ORC should write the column type in ORC’s type information with the key “spark.sql.catalyst.type”.
For the ORC reader. If the type information contains the key “spark.sql.catalyst.type”, infer the data type as per the value
Otherwise, given an ORC TimestampLTZ column, we will infer it as Spark’s TimestampType; given an ORC TimestampNTZ, the inferred result is TimestampType(TimestampLTZ) so that the behavior is consistent with older version Spark.

Why are the changes needed?

Docking TimestampNTZ type in Spark and Timestamp in ORC.

Does this PR introduce any user-facing change?

'Yes'.
Spark will read/write the TimestampNTZ/Timestamp in ORC.

How was this patch tested?

New tests.

SparkQA · 2021-07-30T12:19:32Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46406/

SparkQA · 2021-07-30T12:54:25Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46406/

SparkQA · 2021-07-30T13:55:40Z

Test build #141897 has finished for PR 33588 at commit 3344bb8.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-08-02T05:38:09Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46448/

SparkQA · 2021-08-02T07:34:40Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46453/

SparkQA · 2021-08-02T08:09:05Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46453/

SparkQA · 2021-08-02T09:09:11Z

Test build #141938 has finished for PR 33588 at commit ef516ec.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class TimestampsWithNonPrimitiveType(
test(\"test for timestamp types: save and load case class RDD withNones as orc\")

SparkQA · 2021-08-02T15:06:14Z

Test build #141943 has finished for PR 33588 at commit a1aa093.

This patch fails from timeout after a configured wait of 500m.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-08-03T03:35:49Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46483/

SparkQA · 2021-08-03T04:29:56Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46483/

SparkQA · 2021-08-03T07:19:31Z

Test build #141972 has finished for PR 33588 at commit a1aa093.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

beliefer · 2021-08-03T07:28:28Z

ping @gengliangwang cc @cloud-fan @MaxGekk

gengliangwang · 2021-08-03T08:00:19Z

@beliefer It seems that you are converting the reverse way:

A TIMESTAMP => TIMESTAMP_LTZ
Timestamp with local time zone => TIMESTAMP_NTZ

Timestamp with local time zone should be mapped to TIMESTAMP_LTZ, right?

beliefer · 2021-08-03T09:10:25Z

ORC includes two different forms of timestamps from the SQL world:

Timestamp is a date and time without a time zone, which does not change based on the time zone of the reader.
Timestamp with local time zone is a fixed instant in time, which does change based on the time zone of the reader.

But The current implementation of spark will treat A TIMESTAMP => TIMESTAMP_LTZ incorrectly.
Should we change the behavior, so it will be converted as follows:

A TIMESTAMP => TIMESTAMP_NTZ
Timestamp with local time zone => TIMESTAMP_LTZ

ping @cloud-fan @gengliangwang @MaxGekk

SparkQA · 2021-08-04T07:42:08Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46535/

SparkQA · 2021-08-04T15:13:24Z

Test build #142024 has finished for PR 33588 at commit 0bd85eb.

This patch fails from timeout after a configured wait of 500m.
This patch merges cleanly.
This patch adds no public classes.

beliefer · 2021-08-05T02:16:35Z

retest this please

SparkQA · 2021-08-05T03:16:44Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46569/

SparkQA · 2021-08-05T06:53:13Z

Test build #142059 has finished for PR 33588 at commit 0bd85eb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-08-16T09:25:06Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47003/

SparkQA · 2021-08-16T10:21:20Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47003/

SparkQA · 2021-11-17T08:26:56Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49788/

SparkQA · 2021-11-17T08:54:03Z

Test build #145311 has finished for PR 33588 at commit 8e66926.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-11-17T09:13:36Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49791/

SparkQA · 2021-11-17T10:47:43Z

Test build #145317 has finished for PR 33588 at commit 51d7651.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2021-11-17T11:04:18Z

The test failure in GA seems not related. @beliefer can you rerun it?

SparkQA · 2021-11-17T11:39:37Z

Test build #145320 has finished for PR 33588 at commit 3b70990.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2021-11-17T12:49:19Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcDeserializer.scala

+      case TimestampNTZType => (ordinal, value) =>
+        val ts = value.asInstanceOf[OrcTimestamp]
+        updater.setLong(ordinal, DateTimeUtils.millisToMicros(ts.getTime) +
+          (ts.getNanos / NANOS_PER_MICROS) % MICROS_PER_MILLIS)


this appears twice, maybe move it to OrcUtils.fromOrcNTZ?

cloud-fan · 2021-11-17T12:51:51Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcSerializer.scala

+      val nanos = (micros - seconds * MICROS_PER_SECOND) * NANOS_PER_MICROS
+      ts.setNanos(nanos.toInt)
+      val result = new OrcTimestamp(ts.getTime)
+      result.setNanos(ts.getNanos)


ditto, OrcUtils.toOrcNTZ?

SparkQA · 2021-11-17T14:31:44Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49808/

SparkQA · 2021-11-17T15:19:49Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49808/

SparkQA · 2021-11-17T18:28:53Z

Test build #145337 has finished for PR 33588 at commit 2c213c3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2021-11-18T02:45:56Z

Thanks, merging to master

beliefer · 2021-11-18T03:16:04Z

@cloud-fan @gengliangwang @MaxGekk Thanks.

bersprockets · 2021-11-25T01:28:40Z

@beliefer

Sorry, doing a post-commit review...

I don't think this is working quite as you expected.

The display value is affected by the time zone of the reader, which should not be the case.

For example, run this code in local mode:

import java.util.TimeZone

TimeZone.setDefault(TimeZone.getTimeZone("America/Los_Angeles"))
sql("set spark.sql.session.timeZone=America/Los_Angeles")

val df = sql("select timestamp_ntz '2021-06-01 00:00:00' ts_ntz, timestamp '2021-06-01 00:00:00' ts")

df.write.mode("overwrite").orc("ts_ntz_orc")
df.write.mode("overwrite").parquet("ts_ntz_parquet")
df.write.mode("overwrite").format("avro").save("ts_ntz_avro")

val query = """
  select 'orc', *
  from `orc`.`ts_ntz_orc`
  union all
  select 'parquet', *
  from `parquet`.`ts_ntz_parquet`
  union all
  select 'avro', *
  from `avro`.`ts_ntz_avro`
"""

val tzs = Seq("America/Los_Angeles", "UTC", "Europe/Amsterdam")
for (tz <- tzs) {
  TimeZone.setDefault(TimeZone.getTimeZone(tz))
  sql(s"set spark.sql.session.timeZone=$tz")

  println(s"Time zone is ${TimeZone.getDefault.getID}")
  sql(query).show(false)
}

You will see:

Time zone is America/Los_Angeles
+-------+-------------------+-------------------+
|orc    |ts_ntz             |ts                 |
+-------+-------------------+-------------------+
|orc    |2021-06-01 00:00:00|2021-06-01 00:00:00|
|parquet|2021-06-01 00:00:00|2021-06-01 00:00:00|
|avro   |2021-06-01 00:00:00|2021-06-01 00:00:00|
+-------+-------------------+-------------------+

Time zone is UTC
+-------+-------------------+-------------------+
|orc    |ts_ntz             |ts                 |
+-------+-------------------+-------------------+
|orc    |2021-05-31 17:00:00|2021-06-01 00:00:00|
|parquet|2021-06-01 00:00:00|2021-06-01 07:00:00|
|avro   |2021-06-01 00:00:00|2021-06-01 07:00:00|
+-------+-------------------+-------------------+

Time zone is Europe/Amsterdam
+-------+-------------------+-------------------+
|orc    |ts_ntz             |ts                 |
+-------+-------------------+-------------------+
|orc    |2021-05-31 15:00:00|2021-06-01 00:00:00|
|parquet|2021-06-01 00:00:00|2021-06-01 09:00:00|
|avro   |2021-06-01 00:00:00|2021-06-01 09:00:00|
+-------+-------------------+-------------------+

Note how the display value of ts_ntz varies as the reader's time zone changes, but only for ORC.

This is due to this code in ORC

For TIMESTAMPNTZ, Spark treats all values as though UTC is the only time zone, regardless of the actual time zone. However, ORC cares about the reader and writer's actual time zone (because we don't set the useUTCTimestamp option). ORC remember's the writer's time zone. When the reader has a different time zone than the writer, ORC "adjusts" the value accordingly (which is why the ts column above behaves more like a TIMESTAMPNTZ type than the ts_ntz column).

To confirm that this is the issue, I added some code to shift the value to local time on write, and to re-shift to UTC on read. With that code, the ts_ntz column's display value does not vary across time zones. Not necessarily a proposed fix, just to confirm the issue.

cloud-fan · 2021-11-25T06:12:00Z

@bersprockets good catch! I hope ORC can allow us to write int64 directly as timestamp, like Parquet does. Otherwise, we probably have to do some shifting as you suggested.

@beliefer can you look into master...bersprockets:orc_ntz_issue_play and make a fix? Thanks!

cloud-fan · 2021-11-25T06:13:13Z

BTW, seems the same problem happens in the old timestamp ltz as well. We should have a separate PR to fix it and backport it. also cc @gengliangwang @MaxGekk

beliefer · 2021-11-25T07:03:26Z

@cloud-fan Thank you for ping.

dongjoon-hyun · 2021-11-29T06:16:54Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcTest.scala

    spark.read.orc(file.getAbsolutePath)
  }

+  def withAllOrcReaders(code: => Unit): Unit = {


Hi, All.
This naming is misleading because this only tests native ORC reader.
Apache Spark provides hive ORC reader, too.

Instead of withAllOrcReaders, let's use withAllNativeOrcReaders.

dongjoon-hyun · 2021-11-29T06:39:05Z

Here is a follow-up PR.

[SPARK-36346][SQL][FOLLOWUP] Rename withAllOrcReaders to withAllNativeOrcReaders #34733

dongjoon-hyun · 2021-11-29T06:47:47Z

BTW, thank you, @bersprockets !

gengliangwang · 2021-11-29T07:17:34Z

@bersprockets good catch, thank you!

### What changes were proposed in this pull request? #33588 (comment) show Spark cannot read/write timestamp ntz and ltz correctly. Based on the discussion #34741 (comment), we just to fix read/write timestamp ntz to Orc uses int64. ### Why are the changes needed? Fix the bug about read/write timestamp ntz from/to Orc with different times zone. ### Does this PR introduce _any_ user-facing change? Yes. Orc timestamp ntz is a new feature. ### How was this patch tested? New tests. Closes #34984 from beliefer/SPARK-37463-int64. Authored-by: Jiaan Geng <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? #33588 (comment) show Spark cannot read/write timestamp ntz and ltz correctly. Based on the discussion #34741 (comment), we just to fix read/write timestamp ntz to Orc uses int64. ### Why are the changes needed? Fix the bug about read/write timestamp ntz from/to Orc with different times zone. ### Does this PR introduce _any_ user-facing change? Yes. Orc timestamp ntz is a new feature. ### How was this patch tested? New tests. Closes #34984 from beliefer/SPARK-37463-int64. Authored-by: Jiaan Geng <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit e410d98) Signed-off-by: Wenchen Fan <[email protected]>

Support TimestampNTZ type in Orc file source

3344bb8

github-actions bot added the SQL label Jul 30, 2021

Update code

ef516ec

Update code

a1aa093

beliefer changed the title ~~[WIP][SPARK-36346][SQL] Support TimestampNTZ type in Orc file source~~ [SPARK-36346][SQL] Support TimestampNTZ type in Orc file source Aug 2, 2021

Merge branch 'master' into SPARK-36346

0bd85eb

beliefer added 2 commits August 16, 2021 16:15

Store spark data types in ORC metadata.

cd80e28

Merge branch 'SPARK-36346' of github.com:beliefer/spark into SPARK-36346

3cb69f0

Update code

42fa8e4

cloud-fan reviewed Nov 17, 2021

View reviewed changes

cloud-fan approved these changes Nov 17, 2021

View reviewed changes

Update code

2c213c3

gengliangwang closed this in 6f6a066 Nov 18, 2021

beliefer mentioned this pull request Nov 25, 2021

[SPARK-37463][SQL] Read/Write Timestamp ntz or ltz to Orc uses UTC timestamp #34712

Closed

dongjoon-hyun reviewed Nov 29, 2021

View reviewed changes

This was referenced Nov 29, 2021

[SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses UTC time zone #34741

Closed

[SPARK-37463][SQL] Read/Write Timestamp ntz to Orc uses UTC timestamp #34769

Closed

beliefer mentioned this pull request Dec 22, 2021

[SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses int64 #34984

Closed

[SPARK-36346][SQL] Support TimestampNTZ type in Orc file source #33588

[SPARK-36346][SQL] Support TimestampNTZ type in Orc file source #33588

Uh oh!

Conversation

beliefer commented Jul 30, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Jul 30, 2021

Uh oh!

SparkQA commented Jul 30, 2021

Uh oh!

SparkQA commented Jul 30, 2021

Uh oh!

SparkQA commented Aug 2, 2021

Uh oh!

SparkQA commented Aug 2, 2021

Uh oh!

SparkQA commented Aug 2, 2021

Uh oh!

SparkQA commented Aug 2, 2021

Uh oh!

SparkQA commented Aug 2, 2021

Uh oh!

SparkQA commented Aug 3, 2021

Uh oh!

SparkQA commented Aug 3, 2021

Uh oh!

SparkQA commented Aug 3, 2021

Uh oh!

beliefer commented Aug 3, 2021

Uh oh!

gengliangwang commented Aug 3, 2021

Uh oh!

beliefer commented Aug 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Aug 4, 2021

Uh oh!

SparkQA commented Aug 4, 2021

Uh oh!

beliefer commented Aug 5, 2021

Uh oh!

SparkQA commented Aug 5, 2021

Uh oh!

SparkQA commented Aug 5, 2021

Uh oh!

SparkQA commented Aug 16, 2021

Uh oh!

SparkQA commented Aug 16, 2021

Uh oh!

SparkQA commented Nov 17, 2021

Uh oh!

SparkQA commented Nov 17, 2021

Uh oh!

SparkQA commented Nov 17, 2021

Uh oh!

SparkQA commented Nov 17, 2021

Uh oh!

gengliangwang commented Nov 17, 2021

Uh oh!

SparkQA commented Nov 17, 2021

Uh oh!

cloud-fan Nov 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Nov 17, 2021

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 17, 2021

Uh oh!

SparkQA commented Nov 17, 2021

Uh oh!

SparkQA commented Nov 17, 2021

Uh oh!

gengliangwang commented Nov 18, 2021

beliefer commented Jul 30, 2021 •

edited

Loading

beliefer commented Aug 3, 2021 •

edited

Loading

cloud-fan Nov 17, 2021 •

edited

Loading