Skip to content

Conversation

@beliefer
Copy link
Contributor

@beliefer beliefer commented Jul 30, 2021

What changes were proposed in this pull request?

As per https://orc.apache.org/docs/types.html, ORC includes two different forms of timestamps from the SQL world:

Timestamp is a date and time without a time zone, which does not change based on the time zone of the reader.
Timestamp with local time zone is a fixed instant in time, which does change based on the time zone of the reader.

So the contrast relationship of timestamp between Spark and ORC as follows:

  • TIMESTAMP_LTZ => Timestamp with local time zone.
  • TIMESTAMP_NTZ => Timestamp

Unfortunately, in Spark 3.1 or prior, Spark considered ORC Timestamp as TIMESTAMP_LTZ mistakely.
The behavior is not correct. To keep backward compatibility, we not change the mistake now.
Since 3.2, with the support of timestamp without time zone type:

  • For the ORC writer, we will have to map both Spark’s TimestampLTZ and TimestampNTZ as ORC’s TimestampNTZ for backward compatibility. In addition, the ORC should write the column type in ORC’s type information with the key “spark.sql.catalyst.type”.
  • For the ORC reader. If the type information contains the key “spark.sql.catalyst.type”, infer the data type as per the value
    Otherwise, given an ORC TimestampLTZ column, we will infer it as Spark’s TimestampType; given an ORC TimestampNTZ, the inferred result is TimestampType(TimestampLTZ) so that the behavior is consistent with older version Spark.

Why are the changes needed?

Docking TimestampNTZ type in Spark and Timestamp in ORC.

Does this PR introduce any user-facing change?

'Yes'.
Spark will read/write the TimestampNTZ/Timestamp in ORC.

How was this patch tested?

New tests.

@github-actions github-actions bot added the SQL label Jul 30, 2021
@SparkQA
Copy link

SparkQA commented Jul 30, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46406/

@SparkQA
Copy link

SparkQA commented Jul 30, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46406/

@SparkQA
Copy link

SparkQA commented Jul 30, 2021

Test build #141897 has finished for PR 33588 at commit 3344bb8.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 2, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46448/

@SparkQA
Copy link

SparkQA commented Aug 2, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46453/

@SparkQA
Copy link

SparkQA commented Aug 2, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46453/

@SparkQA
Copy link

SparkQA commented Aug 2, 2021

Test build #141938 has finished for PR 33588 at commit ef516ec.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class TimestampsWithNonPrimitiveType(
  • test(\"test for timestamp types: save and load case class RDD withNones as orc\")

@beliefer beliefer changed the title [WIP][SPARK-36346][SQL] Support TimestampNTZ type in Orc file source [SPARK-36346][SQL] Support TimestampNTZ type in Orc file source Aug 2, 2021
@SparkQA
Copy link

SparkQA commented Aug 2, 2021

Test build #141943 has finished for PR 33588 at commit a1aa093.

  • This patch fails from timeout after a configured wait of 500m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 3, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46483/

@SparkQA
Copy link

SparkQA commented Aug 3, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46483/

@SparkQA
Copy link

SparkQA commented Aug 3, 2021

Test build #141972 has finished for PR 33588 at commit a1aa093.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@beliefer
Copy link
Contributor Author

beliefer commented Aug 3, 2021

ping @gengliangwang cc @cloud-fan @MaxGekk

@gengliangwang
Copy link
Member

@beliefer It seems that you are converting the reverse way:

A TIMESTAMP => TIMESTAMP_LTZ
Timestamp with local time zone => TIMESTAMP_NTZ

Timestamp with local time zone should be mapped to TIMESTAMP_LTZ, right?

@beliefer
Copy link
Contributor Author

beliefer commented Aug 3, 2021

ORC includes two different forms of timestamps from the SQL world:

Timestamp is a date and time without a time zone, which does not change based on the time zone of the reader.
Timestamp with local time zone is a fixed instant in time, which does change based on the time zone of the reader.

But The current implementation of spark will treat A TIMESTAMP => TIMESTAMP_LTZ incorrectly.
Should we change the behavior, so it will be converted as follows:

A TIMESTAMP => TIMESTAMP_NTZ
Timestamp with local time zone => TIMESTAMP_LTZ

ping @cloud-fan @gengliangwang @MaxGekk

@SparkQA
Copy link

SparkQA commented Aug 4, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46535/

@SparkQA
Copy link

SparkQA commented Aug 4, 2021

Test build #142024 has finished for PR 33588 at commit 0bd85eb.

  • This patch fails from timeout after a configured wait of 500m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@beliefer
Copy link
Contributor Author

beliefer commented Aug 5, 2021

retest this please

@SparkQA
Copy link

SparkQA commented Aug 5, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46569/

@SparkQA
Copy link

SparkQA commented Aug 5, 2021

Test build #142059 has finished for PR 33588 at commit 0bd85eb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 16, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47003/

@SparkQA
Copy link

SparkQA commented Aug 16, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47003/

@SparkQA
Copy link

SparkQA commented Nov 17, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49788/

@SparkQA
Copy link

SparkQA commented Nov 17, 2021

Test build #145311 has finished for PR 33588 at commit 8e66926.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 17, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49791/

@SparkQA
Copy link

SparkQA commented Nov 17, 2021

Test build #145317 has finished for PR 33588 at commit 51d7651.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gengliangwang
Copy link
Member

The test failure in GA seems not related. @beliefer can you rerun it?

@SparkQA
Copy link

SparkQA commented Nov 17, 2021

Test build #145320 has finished for PR 33588 at commit 3b70990.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

case TimestampNTZType => (ordinal, value) =>
val ts = value.asInstanceOf[OrcTimestamp]
updater.setLong(ordinal, DateTimeUtils.millisToMicros(ts.getTime) +
(ts.getNanos / NANOS_PER_MICROS) % MICROS_PER_MILLIS)
Copy link
Contributor

@cloud-fan cloud-fan Nov 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this appears twice, maybe move it to OrcUtils.fromOrcNTZ?

val nanos = (micros - seconds * MICROS_PER_SECOND) * NANOS_PER_MICROS
ts.setNanos(nanos.toInt)
val result = new OrcTimestamp(ts.getTime)
result.setNanos(ts.getNanos)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, OrcUtils.toOrcNTZ?

@SparkQA
Copy link

SparkQA commented Nov 17, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49808/

@SparkQA
Copy link

SparkQA commented Nov 17, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49808/

@SparkQA
Copy link

SparkQA commented Nov 17, 2021

Test build #145337 has finished for PR 33588 at commit 2c213c3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gengliangwang
Copy link
Member

Thanks, merging to master

@beliefer
Copy link
Contributor Author

@cloud-fan @gengliangwang @MaxGekk Thanks.

@bersprockets
Copy link
Contributor

@beliefer

Sorry, doing a post-commit review...

I don't think this is working quite as you expected.

The display value is affected by the time zone of the reader, which should not be the case.

For example, run this code in local mode:

import java.util.TimeZone

TimeZone.setDefault(TimeZone.getTimeZone("America/Los_Angeles"))
sql("set spark.sql.session.timeZone=America/Los_Angeles")

val df = sql("select timestamp_ntz '2021-06-01 00:00:00' ts_ntz, timestamp '2021-06-01 00:00:00' ts")

df.write.mode("overwrite").orc("ts_ntz_orc")
df.write.mode("overwrite").parquet("ts_ntz_parquet")
df.write.mode("overwrite").format("avro").save("ts_ntz_avro")

val query = """
  select 'orc', *
  from `orc`.`ts_ntz_orc`
  union all
  select 'parquet', *
  from `parquet`.`ts_ntz_parquet`
  union all
  select 'avro', *
  from `avro`.`ts_ntz_avro`
"""

val tzs = Seq("America/Los_Angeles", "UTC", "Europe/Amsterdam")
for (tz <- tzs) {
  TimeZone.setDefault(TimeZone.getTimeZone(tz))
  sql(s"set spark.sql.session.timeZone=$tz")

  println(s"Time zone is ${TimeZone.getDefault.getID}")
  sql(query).show(false)
}

You will see:

Time zone is America/Los_Angeles
+-------+-------------------+-------------------+
|orc    |ts_ntz             |ts                 |
+-------+-------------------+-------------------+
|orc    |2021-06-01 00:00:00|2021-06-01 00:00:00|
|parquet|2021-06-01 00:00:00|2021-06-01 00:00:00|
|avro   |2021-06-01 00:00:00|2021-06-01 00:00:00|
+-------+-------------------+-------------------+

Time zone is UTC
+-------+-------------------+-------------------+
|orc    |ts_ntz             |ts                 |
+-------+-------------------+-------------------+
|orc    |2021-05-31 17:00:00|2021-06-01 00:00:00|
|parquet|2021-06-01 00:00:00|2021-06-01 07:00:00|
|avro   |2021-06-01 00:00:00|2021-06-01 07:00:00|
+-------+-------------------+-------------------+

Time zone is Europe/Amsterdam
+-------+-------------------+-------------------+
|orc    |ts_ntz             |ts                 |
+-------+-------------------+-------------------+
|orc    |2021-05-31 15:00:00|2021-06-01 00:00:00|
|parquet|2021-06-01 00:00:00|2021-06-01 09:00:00|
|avro   |2021-06-01 00:00:00|2021-06-01 09:00:00|
+-------+-------------------+-------------------+

Note how the display value of ts_ntz varies as the reader's time zone changes, but only for ORC.

This is due to this code in ORC

For TIMESTAMPNTZ, Spark treats all values as though UTC is the only time zone, regardless of the actual time zone. However, ORC cares about the reader and writer's actual time zone (because we don't set the useUTCTimestamp option). ORC remember's the writer's time zone. When the reader has a different time zone than the writer, ORC "adjusts" the value accordingly (which is why the ts column above behaves more like a TIMESTAMPNTZ type than the ts_ntz column).

To confirm that this is the issue, I added some code to shift the value to local time on write, and to re-shift to UTC on read. With that code, the ts_ntz column's display value does not vary across time zones. Not necessarily a proposed fix, just to confirm the issue.

@cloud-fan
Copy link
Contributor

@bersprockets good catch! I hope ORC can allow us to write int64 directly as timestamp, like Parquet does. Otherwise, we probably have to do some shifting as you suggested.

@beliefer can you look into master...bersprockets:orc_ntz_issue_play and make a fix? Thanks!

@cloud-fan
Copy link
Contributor

BTW, seems the same problem happens in the old timestamp ltz as well. We should have a separate PR to fix it and backport it. also cc @gengliangwang @MaxGekk

@beliefer
Copy link
Contributor Author

@cloud-fan Thank you for ping.

spark.read.orc(file.getAbsolutePath)
}

def withAllOrcReaders(code: => Unit): Unit = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, All.
This naming is misleading because this only tests native ORC reader.
Apache Spark provides hive ORC reader, too.

Instead of withAllOrcReaders, let's use withAllNativeOrcReaders.

@dongjoon-hyun
Copy link
Member

@dongjoon-hyun
Copy link
Member

BTW, thank you, @bersprockets !

@gengliangwang
Copy link
Member

@bersprockets good catch, thank you!

cloud-fan pushed a commit that referenced this pull request Mar 24, 2022
### What changes were proposed in this pull request?
#33588 (comment) show Spark cannot read/write timestamp ntz and ltz correctly. Based on the discussion #34741 (comment), we just to fix read/write timestamp ntz to Orc uses int64.

### Why are the changes needed?
Fix the bug about read/write timestamp ntz from/to Orc with different times zone.

### Does this PR introduce _any_ user-facing change?
Yes. Orc timestamp ntz is a new feature.

### How was this patch tested?
New tests.

Closes #34984 from beliefer/SPARK-37463-int64.

Authored-by: Jiaan Geng <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
cloud-fan pushed a commit that referenced this pull request Mar 24, 2022
### What changes were proposed in this pull request?
#33588 (comment) show Spark cannot read/write timestamp ntz and ltz correctly. Based on the discussion #34741 (comment), we just to fix read/write timestamp ntz to Orc uses int64.

### Why are the changes needed?
Fix the bug about read/write timestamp ntz from/to Orc with different times zone.

### Does this PR introduce _any_ user-facing change?
Yes. Orc timestamp ntz is a new feature.

### How was this patch tested?
New tests.

Closes #34984 from beliefer/SPARK-37463-int64.

Authored-by: Jiaan Geng <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit e410d98)
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants