Skip to content

Commit 06cfe9d

Browse files
committed
Adds comments about TimestampType handling
1 parent a099d3e commit 06cfe9d

File tree

2 files changed

+19
-3
lines changed

2 files changed

+19
-3
lines changed

sql/core/src/main/scala/org/apache/spark/sql/parquet/CatalystRowConverter.scala

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -146,6 +146,7 @@ private[parquet] class CatalystRowConverter(
146146
new CatalystStringConverter(updater)
147147

148148
case TimestampType =>
149+
// TODO Implements `TIMESTAMP_MICROS` once parquet-mr has that.
149150
new PrimitiveConverter {
150151
override def addBinary(value: Binary): Unit = {
151152
assert(

sql/core/src/main/scala/org/apache/spark/sql/parquet/CatalystSchemaConverter.scala

Lines changed: 18 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -358,9 +358,24 @@ private[parquet] class CatalystSchemaConverter(
358358
case DateType =>
359359
Types.primitive(INT32, repetition).as(DATE).named(field.name)
360360

361-
// NOTE: !! This timestamp type is not specified in Parquet format spec !!
362-
// However, Impala and older versions of Spark SQL use INT96 to store timestamps with
363-
// nanosecond precision (not TIME_MILLIS or TIMESTAMP_MILLIS described in the spec).
361+
// NOTE: Spark SQL TimestampType is NOT a well defined type in Parquet format spec.
362+
//
363+
// As stated in PARQUET-323, Parquet `INT96` was originally introduced to represent nanosecond
364+
// timestamp in Impala for some historical reasons, it's not recommended to be used for any
365+
// other types and will probably be deprecated in future Parquet format spec. That's the
366+
// reason why Parquet format spec only defines `TIMESTAMP_MILLIS` and `TIMESTAMP_MICROS` which
367+
// are both logical types annotating `INT64`.
368+
//
369+
// Originally, Spark SQL uses the same nanosecond timestamp type as Impala and Hive. Starting
370+
// from Spark 1.5.0, we resort to a timestamp type with 100 ns precision so that we can store
371+
// a timestamp into a `Long`. This design decision is subject to change though, for example,
372+
// we may resort to microsecond precision in the future.
373+
//
374+
// For Parquet, we plan to write all `TimestampType` value as `TIMESTAMP_MICROS`, but it's
375+
// currently not implemented yet because parquet-mr 1.7.0 (the version we're currently using)
376+
// hasn't implemented `TIMESTAMP_MICROS` yet.
377+
//
378+
// TODO Implements `TIMESTAMP_MICROS` once parquet-mr has that.
364379
case TimestampType =>
365380
Types.primitive(INT96, repetition).named(field.name)
366381

0 commit comments

Comments
 (0)