[SPARK-10364][SQL] Support Parquet logical type TIMESTAMP_MILLIS #15332

dilipbiswal · 2016-10-03T18:22:45Z

What changes were proposed in this pull request?

Description from JIRA

The TimestampType in Spark SQL is of microsecond precision. Ideally, we should convert Spark SQL timestamp values into Parquet TIMESTAMP_MICROS. But unfortunately parquet-mr hasn't supported it yet.
For the read path, we should be able to read TIMESTAMP_MILLIS Parquet values and pad a 0 microsecond part to read values.
For the write path, currently we are writing timestamps as INT96, similar to Impala and Hive. One alternative is that, we can have a separate SQL option to let users be able to write Spark SQL timestamp values as TIMESTAMP_MILLIS. Of course, in this way the microsecond part will be truncated.

How was this patch tested?

Added new tests in ParquetQuerySuite and ParquetIOSuite

SparkQA · 2016-10-03T20:36:28Z

Test build #66268 has finished for PR 15332 at commit 4e040e5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2016-10-03T21:08:08Z

Paruqet support both millisecond and microseconds:

TIMESTAMP_MILLIS

TIMESTAMP_MILLIS is used for a combined logical date and time type, with millisecond precision. It must annotate an int64 that stores the number of milliseconds from the Unix epoch, 00:00:00.000 on 1 January 1970, UTC.

TIMESTAMP_MICROS

TIMESTAMP_MICROS is used for a combined logical date and time type with microsecond precision. It must annotate an int64 that stores the number of microseconds from the Unix epoch, 00:00:00.000000 on 1 January 1970, UTC.

We should be able to read both of them, but write it as TIMESTAMP_MICROS by default?

davies · 2016-10-03T21:08:54Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala

Where does this come from? It's different than that in Parquet doc (since epoch)

@davies Thank you for your comments Davies. Per Lian's comment in the jira,
https://issues.apache.org/jira/browse/SPARK-8824
parquet 1.8 which is what we are using currently does not have support for TIMESTAMP_MICROS yet. He suggested we implement TIMESTAMP_MILLIS for now.

@davies On your second comment , let me please check and get back to you.

@davies Many thanks !! You are right. We need to write it as milliseconds since epoc. I will send a update .. Thanks again !!

SparkQA · 2016-10-04T09:15:31Z

Test build #66307 has finished for PR 15332 at commit 360e0d9.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

dilipbiswal · 2016-10-04T09:30:20Z

retest this please

liancheng · 2016-10-04T09:34:10Z

@davies Unfortunately parquet-mr 1.8.1, which is used by the current master, hadn't included TIMESTAMP_MICROS yet. To be more specific, OriginalType in parquet-mr 1.8.1 doesn't include TIMESTAMP_MICROS as a member. So I think only support TIMESTAMP_MILLIS is reasonable here.

viirya · 2016-10-04T10:32:07Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala

isParquetINT64AsTimestampMillis here?

@viirya Thank you. will change.

viirya · 2016-10-04T10:37:39Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala

Return type should be SQLTimestamp? Input type should be Long?

@viirya Will make the change. Thanks !!

SparkQA · 2016-10-04T11:49:10Z

Test build #66309 has finished for PR 15332 at commit 360e0d9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-05T00:09:53Z

Test build #66339 has finished for PR 15332 at commit 9a486ac.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-10-05T01:51:17Z

...src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java

For vectorized reader, I think we should also add TimestampType support for INT64 in decodeDictionaryIds?

@dilipbiswal Per our offline discussion, I think you should add TimestampType support for INT64 in decodeDictionaryIds. In order to test it, a test case of mixing dictionary-encoded values and non dictionary-encoded values is needed.

I've tested the following test case:

test("SPARK-10634 timestamp written and read as INT64 - TIMESTAMP_MILLIS") { val data = (1 to 1000).map { i => if (i < 500) { Row(new java.sql.Timestamp(10)) } else { Row(new java.sql.Timestamp(i)) } } val schema = StructType(List(StructField("time", TimestampType, false)).toArray) withSQLConf(ParquetOutputFormat.DICTIONARY_PAGE_SIZE -> "64", ParquetOutputFormat.PAGE_SIZE -> "128") { withSQLConf(SQLConf.PARQUET_INT64_AS_TIMESTAMP_MILLIS.key -> "true") { withTempPath { file => val df = spark.createDataFrame(sparkContext.parallelize(data), schema) df.coalesce(1).write.parquet(file.getCanonicalPath) ("true" :: Nil).foreach { vectorized => withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key -> vectorized) { val df2 = spark.read.parquet(file.getCanonicalPath) checkAnswer(df2, df.collect().toSeq) } } } } } }

It will cause an exception:

[info] org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 4, localhost): java.lang.UnsupportedOperationException: Unimplemented type: TimestampType [info] at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.decodeDictionaryIds(VectorizedColumnReader.java:256) [info] at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:177) [info] at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:230)

@viirya Thanks Simon. Very good catch !! I have made the changes.

viirya · 2016-10-05T02:01:44Z

...c/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala

We have a block of comments and codes for TimestampType below. Can we move this branch in the block? And we should add few comments about this change.

@viirya Done.

viirya · 2016-10-05T02:18:01Z

...re/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala

Can you add a test that can we read a timestamp field written with SQLConf.PARQUET_INT64_AS_TIMESTAMP_MILLIS as true, but reading it with SQLConf.PARQUET_INT64_AS_TIMESTAMP_MILLIS as false?

@viirya Done

davies · 2016-10-05T03:46:36Z

sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

Should we merge these configs into a single one, spark.sql.parquet.timestampAs (a better name?), which could be int96, millisecond, or microsecond (support in future)?

@davies Thanks Davies. I have a couple of questions.

If we externalized an config in prior releases, could we just change it or we need to be backward compatible.

I was reading the description and usage of existing config 'spark.sql.parquet.int96AsTimestamp' , it seems that this is applicable for read where as the new one we have introduced in this PR is applicable for write.

Should we change the semantics of the proposed common property to control the write encoding only and make reading solely based on the schema metadata i.e type + original type? If you agree then may be we could go with spark.sql.parquet.timestamp.encoding ? I am ok with spark.sql.parquet.timestampAs as well.

Did you want this change as part of this PR ? Thanks a lot for your input as always.

SparkQA · 2016-10-05T22:21:44Z

Test build #66401 has finished for PR 15332 at commit aa4fab6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-06T00:34:08Z

Test build #66403 has finished for PR 15332 at commit ddc957f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-10-06T02:12:02Z

...src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java

Why this check?

We need to convert the time back to micros and in case of lazy decoding, we don't get that chance ?

SparkQA · 2016-10-06T09:10:23Z

Test build #66435 has finished for PR 15332 at commit a907563.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-10-07T03:09:02Z

LGTM. see if @davies @liancheng have other comments about this.

SparkQA · 2016-10-20T20:47:10Z

Test build #67278 has finished for PR 15332 at commit c376b4e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

saulshanabrook · 2017-03-27T15:06:23Z

Would it be helpful to submit a new PR with the conflicts resolved? If not, what are the next steps for this issue?

dilipbiswal · 2017-03-27T18:23:19Z

@saulshanabrook Hello, thanks for your comment. Currently, i am waiting for feedback from @liancheng and @davies. Perhaps this is not a priority now. I will try to resolve the conflicts and push in any case.

gatorsmile · 2017-03-28T04:06:15Z

Also cc @ueshin and @squito

SparkQA · 2017-03-28T10:17:32Z

Test build #75306 has finished for PR 15332 at commit 796b6b1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin

LGTM, except for minor comments.

ueshin · 2017-03-28T21:57:17Z

...e/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaSuite.scala

    writeLegacyParquetFormat = true)

+  testSchema(
+    "Timestmp written and read as INT64 with TIMESTAMP_MILLIS",


nit: Timestmp -> Timestamp

@ueshin Thanks. Done.

ueshin · 2017-03-28T22:01:16Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala

+
+    if (us < 0 && (us % MILLIS_PER_SECOND < 0)) {
+      millis = millis - 1
+    }


Can't we use Math.floor() here as the same as millisToDays?

@ueshin Thanks a lot. I have made the change.

SparkQA · 2017-03-29T09:22:01Z

Test build #75347 has finished for PR 15332 at commit 93a77d1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

squito

I just found a couple of tiny nits. I will defer to others on this change.

I guess we'll want to add something similar to SPARK-12297 for int64 as well eventually, but I dont' think they need to go in together, especially as both are in-flight right now.

squito · 2017-03-29T13:52:46Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala

  }

+  /*
+   * Converts the timestamp to milliseconds since epoc. In spark timestamp values have microseconds


typo: epoch

squito · 2017-03-29T13:55:41Z

...src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java

        DecimalType.is64BitDecimalType(column.dataType())) {
      defColumn.readLongs(
-          num, column, rowId, maxDefLevel, (VectorizedValuesReader) dataColumn);
+              num, column, rowId, maxDefLevel, (VectorizedValuesReader) dataColumn);


why this change in indentation? if anything, looks like it should be indented less than the original.

@squito Thank you for reviewing. I have fixed the indentation.

SparkQA · 2017-03-29T18:57:05Z

Test build #75361 has finished for PR 15332 at commit e2d0182.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2017-04-04T00:52:25Z

Thanks! Merging to master.

dilipbiswal · 2017-04-04T02:13:22Z

Thanks a lot @ueshin @viirya @gatorsmile

cloud-fan · 2017-11-09T00:55:45Z

Hi @dilipbiswal , do you mind to share how you generate the testing parquet file? thanks!

dilipbiswal · 2017-11-09T01:50:03Z

@cloud-fan Hi Wenchen, its been a while .. i am trying my best to recollect. I think once i had the write code implemented in spark, i used it to produce files. Depending on the data, parquet uses different encoding (plain or dictionary). I examined the encodings and the data using parquet tools. I produced two files and used the merge option in the tools to merge them in one file. This is to the best of my recollection :-)

dilipbiswal · 2017-11-09T01:54:05Z

@cloud-fan fyi - https://github.com/apache/parquet-mr/tree/master/parquet-tools

cloud-fan · 2017-11-09T10:47:08Z

great, thanks!

**Description** from JIRA The TimestampType in Spark SQL is of microsecond precision. Ideally, we should convert Spark SQL timestamp values into Parquet TIMESTAMP_MICROS. But unfortunately parquet-mr hasn't supported it yet. For the read path, we should be able to read TIMESTAMP_MILLIS Parquet values and pad a 0 microsecond part to read values. For the write path, currently we are writing timestamps as INT96, similar to Impala and Hive. One alternative is that, we can have a separate SQL option to let users be able to write Spark SQL timestamp values as TIMESTAMP_MILLIS. Of course, in this way the microsecond part will be truncated. Added new tests in ParquetQuerySuite and ParquetIOSuite Author: Dilip Biswal <[email protected]> Closes apache#15332 from dilipbiswal/parquet-time-millis.

davies reviewed Oct 3, 2016

View reviewed changes

viirya reviewed Oct 4, 2016

View reviewed changes

dilipbiswal changed the title ~~[SPARK-10634][SQL] Support Parquet logical type TIMESTAMP_MILLIS~~ [SPARK-10364][SQL] Support Parquet logical type TIMESTAMP_MILLIS Oct 5, 2016

viirya reviewed Oct 5, 2016

View reviewed changes

davies reviewed Oct 5, 2016

View reviewed changes

viirya reviewed Oct 6, 2016

View reviewed changes

dilipbiswal force-pushed the parquet-time-millis branch from a907563 to c376b4e Compare October 20, 2016 18:20

dilipbiswal added 5 commits March 27, 2017 23:30

[SPARK-10634] Support Parquet logical type TIMESTAMP_MILLIS

cb61b53

Comments

00d2b38

review comments + additional testcase

89c5eb5

more comments

8c1416c

missing data file

d7720ab

dilipbiswal added 2 commits March 27, 2017 23:30

add comment

8f83f89

rebase

796b6b1

dilipbiswal force-pushed the parquet-time-millis branch from c376b4e to 796b6b1 Compare March 28, 2017 08:04

ueshin reviewed Mar 28, 2017

View reviewed changes

review comments

93a77d1

squito reviewed Mar 29, 2017

View reviewed changes

review comments

e2d0182

saulshanabrook mentioned this pull request Apr 2, 2017

Record runs lspector/Clojush#232

Merged

26 tasks

asfgit closed this in 3bfb639 Apr 4, 2017

[SPARK-10364][SQL] Support Parquet logical type TIMESTAMP_MILLIS #15332

[SPARK-10364][SQL] Support Parquet logical type TIMESTAMP_MILLIS #15332

Uh oh!

Conversation

dilipbiswal commented Oct 3, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Oct 3, 2016

Uh oh!

davies commented Oct 3, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dilipbiswal Oct 3, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 4, 2016

Uh oh!

dilipbiswal commented Oct 4, 2016

Uh oh!

liancheng commented Oct 4, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 4, 2016

Uh oh!

SparkQA commented Oct 5, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dilipbiswal Oct 5, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dilipbiswal Oct 5, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 5, 2016

Uh oh!

SparkQA commented Oct 6, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 6, 2016

Uh oh!

viirya commented Oct 7, 2016

Uh oh!

SparkQA commented Oct 20, 2016

Uh oh!

saulshanabrook commented Mar 27, 2017

Uh oh!

dilipbiswal Oct 3, 2016 •

edited

Loading

dilipbiswal Oct 5, 2016 •

edited

Loading

dilipbiswal Oct 5, 2016 •

edited

Loading

dilipbiswal commented Nov 9, 2017 •

edited

Loading