Skip to content

Conversation

@dilipbiswal
Copy link
Contributor

What changes were proposed in this pull request?

Description from JIRA

The TimestampType in Spark SQL is of microsecond precision. Ideally, we should convert Spark SQL timestamp values into Parquet TIMESTAMP_MICROS. But unfortunately parquet-mr hasn't supported it yet.
For the read path, we should be able to read TIMESTAMP_MILLIS Parquet values and pad a 0 microsecond part to read values.
For the write path, currently we are writing timestamps as INT96, similar to Impala and Hive. One alternative is that, we can have a separate SQL option to let users be able to write Spark SQL timestamp values as TIMESTAMP_MILLIS. Of course, in this way the microsecond part will be truncated.

How was this patch tested?

Added new tests in ParquetQuerySuite and ParquetIOSuite

@SparkQA
Copy link

SparkQA commented Oct 3, 2016

Test build #66268 has finished for PR 15332 at commit 4e040e5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@davies
Copy link
Contributor

davies commented Oct 3, 2016

Paruqet support both millisecond and microseconds:

TIMESTAMP_MILLIS

TIMESTAMP_MILLIS is used for a combined logical date and time type, with millisecond precision. It must annotate an int64 that stores the number of milliseconds from the Unix epoch, 00:00:00.000 on 1 January 1970, UTC.

TIMESTAMP_MICROS

TIMESTAMP_MICROS is used for a combined logical date and time type with microsecond precision. It must annotate an int64 that stores the number of microseconds from the Unix epoch, 00:00:00.000000 on 1 January 1970, UTC.

We should be able to read both of them, but write it as TIMESTAMP_MICROS by default?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does this come from? It's different than that in Parquet doc (since epoch)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davies Thank you for your comments Davies. Per Lian's comment in the jira,
https://issues.apache.org/jira/browse/SPARK-8824
parquet 1.8 which is what we are using currently does not have support for TIMESTAMP_MICROS yet. He suggested we implement TIMESTAMP_MILLIS for now.

Copy link
Contributor Author

@dilipbiswal dilipbiswal Oct 3, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davies On your second comment , let me please check and get back to you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davies Many thanks !! You are right. We need to write it as milliseconds since epoc. I will send a update .. Thanks again !!

@SparkQA
Copy link

SparkQA commented Oct 4, 2016

Test build #66307 has finished for PR 15332 at commit 360e0d9.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dilipbiswal
Copy link
Contributor Author

retest this please

@liancheng
Copy link
Contributor

@davies Unfortunately parquet-mr 1.8.1, which is used by the current master, hadn't included TIMESTAMP_MICROS yet. To be more specific, OriginalType in parquet-mr 1.8.1 doesn't include TIMESTAMP_MICROS as a member. So I think only support TIMESTAMP_MILLIS is reasonable here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isParquetINT64AsTimestampMillis here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya Thank you. will change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Return type should be SQLTimestamp? Input type should be Long?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya Will make the change. Thanks !!

@SparkQA
Copy link

SparkQA commented Oct 4, 2016

Test build #66309 has finished for PR 15332 at commit 360e0d9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 5, 2016

Test build #66339 has finished for PR 15332 at commit 9a486ac.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dilipbiswal dilipbiswal changed the title [SPARK-10634][SQL] Support Parquet logical type TIMESTAMP_MILLIS [SPARK-10364][SQL] Support Parquet logical type TIMESTAMP_MILLIS Oct 5, 2016
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For vectorized reader, I think we should also add TimestampType support for INT64 in decodeDictionaryIds?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dilipbiswal Per our offline discussion, I think you should add TimestampType support for INT64 in decodeDictionaryIds. In order to test it, a test case of mixing dictionary-encoded values and non dictionary-encoded values is needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tested the following test case:

test("SPARK-10634 timestamp written and read as INT64 - TIMESTAMP_MILLIS") {
  val data = (1 to 1000).map { i =>
    if (i < 500) {
      Row(new java.sql.Timestamp(10))
    } else {
      Row(new java.sql.Timestamp(i))
    }
  }
  val schema = StructType(List(StructField("time", TimestampType, false)).toArray)
  withSQLConf(ParquetOutputFormat.DICTIONARY_PAGE_SIZE -> "64",
      ParquetOutputFormat.PAGE_SIZE -> "128") {
    withSQLConf(SQLConf.PARQUET_INT64_AS_TIMESTAMP_MILLIS.key -> "true") {
      withTempPath { file =>
        val df = spark.createDataFrame(sparkContext.parallelize(data), schema)
        df.coalesce(1).write.parquet(file.getCanonicalPath)
        ("true" :: Nil).foreach { vectorized =>
          withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key -> vectorized) {
            val df2 = spark.read.parquet(file.getCanonicalPath)
            checkAnswer(df2, df.collect().toSeq)
          }
        }
      }
    }
  }
}

It will cause an exception:

[info]  org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 4, localhost): java.lang.UnsupportedOperationException: Unimplemented type: TimestampType
[info]  at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.decodeDictionaryIds(VectorizedColumnReader.java:256)
[info]  at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:177)
[info]  at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:230)

Copy link
Contributor Author

@dilipbiswal dilipbiswal Oct 5, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya Thanks Simon. Very good catch !! I have made the changes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a block of comments and codes for TimestampType below. Can we move this branch in the block? And we should add few comments about this change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya Done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test that can we read a timestamp field written with SQLConf.PARQUET_INT64_AS_TIMESTAMP_MILLIS as true, but reading it with SQLConf.PARQUET_INT64_AS_TIMESTAMP_MILLIS as false?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya Done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we merge these configs into a single one, spark.sql.parquet.timestampAs (a better name?), which could be int96, millisecond, or microsecond (support in future)?

Copy link
Contributor Author

@dilipbiswal dilipbiswal Oct 5, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davies Thanks Davies. I have a couple of questions.

  1. If we externalized an config in prior releases, could we just change it or we need to be backward compatible.
  2. I was reading the description and usage of existing config 'spark.sql.parquet.int96AsTimestamp' , it seems that this is applicable for read where as the new one we have introduced in this PR is applicable for write.
  3. Should we change the semantics of the proposed common property to control the write encoding only and make reading solely based on the schema metadata i.e type + original type? If you agree then may be we could go with spark.sql.parquet.timestamp.encoding ? I am ok with spark.sql.parquet.timestampAs as well.

Did you want this change as part of this PR ? Thanks a lot for your input as always.

@SparkQA
Copy link

SparkQA commented Oct 5, 2016

Test build #66401 has finished for PR 15332 at commit aa4fab6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 6, 2016

Test build #66403 has finished for PR 15332 at commit ddc957f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to convert the time back to micros and in case of lazy decoding, we don't get that chance ?

@SparkQA
Copy link

SparkQA commented Oct 6, 2016

Test build #66435 has finished for PR 15332 at commit a907563.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member

viirya commented Oct 7, 2016

LGTM. see if @davies @liancheng have other comments about this.

@SparkQA
Copy link

SparkQA commented Oct 20, 2016

Test build #67278 has finished for PR 15332 at commit c376b4e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@saulshanabrook
Copy link

Would it be helpful to submit a new PR with the conflicts resolved? If not, what are the next steps for this issue?

@dilipbiswal
Copy link
Contributor Author

@saulshanabrook Hello, thanks for your comment. Currently, i am waiting for feedback from @liancheng and @davies. Perhaps this is not a priority now. I will try to resolve the conflicts and push in any case.

@gatorsmile
Copy link
Member

Also cc @ueshin and @squito

@dilipbiswal dilipbiswal force-pushed the parquet-time-millis branch from c376b4e to 796b6b1 Compare March 28, 2017 08:04
@SparkQA
Copy link

SparkQA commented Mar 28, 2017

Test build #75306 has finished for PR 15332 at commit 796b6b1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, except for minor comments.

writeLegacyParquetFormat = true)

testSchema(
"Timestmp written and read as INT64 with TIMESTAMP_MILLIS",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Timestmp -> Timestamp

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ueshin Thanks. Done.


if (us < 0 && (us % MILLIS_PER_SECOND < 0)) {
millis = millis - 1
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we use Math.floor() here as the same as millisToDays?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ueshin Thanks a lot. I have made the change.

@SparkQA
Copy link

SparkQA commented Mar 29, 2017

Test build #75347 has finished for PR 15332 at commit 93a77d1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@squito squito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just found a couple of tiny nits. I will defer to others on this change.

I guess we'll want to add something similar to SPARK-12297 for int64 as well eventually, but I dont' think they need to go in together, especially as both are in-flight right now.

}

/*
* Converts the timestamp to milliseconds since epoc. In spark timestamp values have microseconds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: epoch

DecimalType.is64BitDecimalType(column.dataType())) {
defColumn.readLongs(
num, column, rowId, maxDefLevel, (VectorizedValuesReader) dataColumn);
num, column, rowId, maxDefLevel, (VectorizedValuesReader) dataColumn);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this change in indentation? if anything, looks like it should be indented less than the original.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@squito Thank you for reviewing. I have fixed the indentation.

@SparkQA
Copy link

SparkQA commented Mar 29, 2017

Test build #75361 has finished for PR 15332 at commit e2d0182.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@saulshanabrook saulshanabrook mentioned this pull request Apr 2, 2017
26 tasks
@ueshin
Copy link
Member

ueshin commented Apr 4, 2017

Thanks! Merging to master.

@asfgit asfgit closed this in 3bfb639 Apr 4, 2017
@dilipbiswal
Copy link
Contributor Author

Thanks a lot @ueshin @viirya @gatorsmile

@cloud-fan
Copy link
Contributor

Hi @dilipbiswal , do you mind to share how you generate the testing parquet file? thanks!

@dilipbiswal
Copy link
Contributor Author

dilipbiswal commented Nov 9, 2017

@cloud-fan Hi Wenchen, its been a while .. i am trying my best to recollect. I think once i had the write code implemented in spark, i used it to produce files. Depending on the data, parquet uses different encoding (plain or dictionary). I examined the encodings and the data using parquet tools. I produced two files and used the merge option in the tools to merge them in one file. This is to the best of my recollection :-)

@dilipbiswal
Copy link
Contributor Author

@cloud-fan
Copy link
Contributor

great, thanks!

jzhuge pushed a commit to jzhuge/spark that referenced this pull request Aug 20, 2018
**Description** from JIRA

The TimestampType in Spark SQL is of microsecond precision. Ideally, we should convert Spark SQL timestamp values into Parquet TIMESTAMP_MICROS. But unfortunately parquet-mr hasn't supported it yet.
For the read path, we should be able to read TIMESTAMP_MILLIS Parquet values and pad a 0 microsecond part to read values.
For the write path, currently we are writing timestamps as INT96, similar to Impala and Hive. One alternative is that, we can have a separate SQL option to let users be able to write Spark SQL timestamp values as TIMESTAMP_MILLIS. Of course, in this way the microsecond part will be truncated.

Added new tests in ParquetQuerySuite and ParquetIOSuite

Author: Dilip Biswal <[email protected]>

Closes apache#15332 from dilipbiswal/parquet-time-millis.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants