[SPARK-18936][SQL] Infrastructure for session local timezone support. #16308

ueshin · 2016-12-16T09:09:06Z

What changes were proposed in this pull request?

As of Spark 2.1, Spark SQL assumes the machine timezone for datetime manipulation, which is bad if users are not in the same timezones as the machines, or if different users have different timezones.

We should introduce a session local timezone setting that is used for execution.

An explicit non-goal is locale handling.

Semantics

Setting the session local timezone means that the timezone-aware expressions listed below should use the timezone to evaluate values, and also it should be used to convert (cast) between string and timestamp or between timestamp and date.

CurrentDate
CurrentBatchTimestamp
Hour
Minute
Second
DateFormatClass
ToUnixTimestamp
UnixTimestamp
FromUnixTime

and below are implicitly timezone-aware through cast from timestamp to date:

DayOfYear
Year
Quarter
Month
DayOfMonth
WeekOfYear
LastDay
NextDay
TruncDate

For example, if you have timestamp "2016-01-01 00:00:00" in GMT, the values evaluated by some of timezone-aware expressions are:

scala> val df = Seq(new java.sql.Timestamp(1451606400000L)).toDF("ts")
df: org.apache.spark.sql.DataFrame = [ts: timestamp]

scala> df.selectExpr("cast(ts as string)", "year(ts)", "month(ts)", "dayofmonth(ts)", "hour(ts)", "minute(ts)", "second(ts)").show(truncate = false)
+-------------------+----------------------+-----------------------+----------------------------+--------+----------+----------+
|ts                 |year(CAST(ts AS DATE))|month(CAST(ts AS DATE))|dayofmonth(CAST(ts AS DATE))|hour(ts)|minute(ts)|second(ts)|
+-------------------+----------------------+-----------------------+----------------------------+--------+----------+----------+
|2016-01-01 00:00:00|2016                  |1                      |1                           |0       |0         |0         |
+-------------------+----------------------+-----------------------+----------------------------+--------+----------+----------+

whereas setting the session local timezone to "PST", they are:

scala> spark.conf.set("spark.sql.session.timeZone", "PST")

scala> df.selectExpr("cast(ts as string)", "year(ts)", "month(ts)", "dayofmonth(ts)", "hour(ts)", "minute(ts)", "second(ts)").show(truncate = false)
+-------------------+----------------------+-----------------------+----------------------------+--------+----------+----------+
|ts                 |year(CAST(ts AS DATE))|month(CAST(ts AS DATE))|dayofmonth(CAST(ts AS DATE))|hour(ts)|minute(ts)|second(ts)|
+-------------------+----------------------+-----------------------+----------------------------+--------+----------+----------+
|2015-12-31 16:00:00|2015                  |12                     |31                          |16      |0         |0         |
+-------------------+----------------------+-----------------------+----------------------------+--------+----------+----------+

Notice that even if you set the session local timezone, it affects only in DataFrame operations, neither in Dataset operations, RDD operations nor in ScalaUDFs. You need to properly handle timezone by yourself.

Design of the fix

I introduced an analyzer to pass session local timezone to timezone-aware expressions and modified DateTimeUtils to take the timezone argument.

How was this patch tested?

Existing tests and added tests for timezone aware expressions.

…ionRepository.

ueshin · 2016-12-16T09:09:20Z

I'd like to discuss the boundary of session local timezone.

I assumed it affects only in DataFrame operations, neither in Dataset operations, before import from RDD, after export to RDD nor in ScalaUDFs because we can't control their timezones, which are basically based on system timezone, i.e. TimeZone.getDefault() value.
(And also for backward-compatibility)

What do you think?

rxin · 2016-12-16T09:22:54Z

Can you document the semantics of time zones and when they are used?

SparkQA · 2016-12-16T11:16:22Z

Test build #70242 has finished for PR 16308 at commit e5bb246.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-12-16T11:31:29Z

Test build #70244 has finished for PR 16308 at commit 32cc391.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-12-16T18:33:20Z

Test build #70254 has finished for PR 16308 at commit f434378.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-12-17T05:09:30Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala

+  override lazy val resolved: Boolean =
+    childrenResolved && checkInputDataTypes().isSuccess && timeZoneResolved
+
+  override def withTimeZone(zoneId: String): TimeZoneAwareExpression = copy(zoneId = zoneId)


this is just a copy ctor isn't it? Maybe no need to add this? Not a big deal though.

copy(zoneId = zoneId)

Yes, this is a copy ctor, but the analyzer ResolveTimeZone can't call the copy ctor because it doesn't know the actual expression class.

rxin · 2016-12-17T05:10:24Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala

+ * Common base class for time zone aware expressions.
+ */
+trait TimeZoneAwareExpression extends Expression {
+


is the reason you are using null rather than option to avoid a bunch of gets?

Yes, I wanted to avoid a bunch of gets.

rxin · 2016-12-17T05:15:35Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala

    case _ => false
  }
+
+  def needTimeZone(from: DataType, to: DataType): Boolean = (from, to) match {


i think it's important to document this ...

I see, I'll add a document.

rxin · 2016-12-17T05:16:53Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala

       10
  """)
-case class Cast(child: Expression, dataType: DataType) extends UnaryExpression with NullIntolerant {
+case class Cast(child: Expression, dataType: DataType, zoneId: String = null)


not 100% sure whether this is a good idea, but should we consider adding a Cast.unapply that does not match on zoneId?

Also we should add classdoc to explain what zoneId is. I'd probably call it timeZoneId.

maybe an extra unapply is probably a bad idea, since then we can miss a pattern match.

I agree that an extra unapply is a bad idea. I'll leave it as it is for now.

rxin · 2016-12-17T05:24:06Z

I'd rename all the zoneId to timeZoneId to reduce confusion ..

rxin · 2016-12-17T05:24:54Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

              case e if !e.resolved => u
              case g: Generator => MultiAlias(g, Nil)
-              case c @ Cast(ne: NamedExpression, _) => Alias(c, ne.name)()
+              case c @ Cast(ne: NamedExpression, _, _) => Alias(c, ne.name)()


if we add a Cast.unapply that returns only the first two arguments, we can reduce a lot of the cast match changes. Not sure if it is worth it though.

rxin · 2016-12-17T05:26:11Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala

    InternalRow.fromSeq(partitionSchema.map { field =>
-      Cast(Literal(spec(field.name)), field.dataType).eval()
+      Cast(Literal(spec(field.name)), field.dataType,
+        DateTimeUtils.defaultTimeZone().getID).eval()


could this change the behavior on how we interpret partition values when timezone settings change?

Currently the behavior doesn't change by timezone setting, i.e. using system timezone.

This is a part that I was not sure which we should handle the partition values, use timezone settings or system timezone.
Should we use timezone settings?

Hmm, now I think we should use timezone settings for partition values, because the values are also parts of data so they should be affected by the settings.

rxin · 2016-12-17T05:28:09Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala

+      case CurrentDate(tz) =>
+        currentDates.getOrElseUpdate(tz, {
+          val dateExpr = CurrentDate(tz)
+          Literal.create(dateExpr.eval(EmptyRow), dateExpr.dataType)


this can technically return different absolute time values for dates, can't this?

Good catch, I'll modify this.

rxin · 2016-12-17T05:43:00Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala

+  def timeZoneResolved: Boolean = zoneId != null
+
+  def withTimeZone(zoneId: String): TimeZoneAwareExpression
+


should this be a lazy val? otherwise it is pretty expensive to keep creating a new timezone object (or doing lookup) per row in the interpreted path.

You are right. I'll use lazy val.

ueshin

@rxin Thank you for your review.
I'll address your comments soon.

ueshin · 2016-12-19T02:51:50Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala

+  override lazy val resolved: Boolean =
+    childrenResolved && checkInputDataTypes().isSuccess && timeZoneResolved
+
+  override def withTimeZone(zoneId: String): TimeZoneAwareExpression = copy(zoneId = zoneId)


Yes, this is a copy ctor, but the analyzer ResolveTimeZone can't call the copy ctor because it doesn't know the actual expression class.

ueshin · 2016-12-19T02:55:10Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala

    case _ => false
  }
+
+  def needTimeZone(from: DataType, to: DataType): Boolean = (from, to) match {


I see, I'll add a document.

ueshin · 2016-12-19T02:57:21Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala

       10
  """)
-case class Cast(child: Expression, dataType: DataType) extends UnaryExpression with NullIntolerant {
+case class Cast(child: Expression, dataType: DataType, zoneId: String = null)


I agree that an extra unapply is a bad idea. I'll leave it as it is for now.

ueshin · 2016-12-19T02:58:25Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala

+      case CurrentDate(tz) =>
+        currentDates.getOrElseUpdate(tz, {
+          val dateExpr = CurrentDate(tz)
+          Literal.create(dateExpr.eval(EmptyRow), dateExpr.dataType)


Good catch, I'll modify this.

ueshin · 2016-12-19T03:01:52Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala

+ * Common base class for time zone aware expressions.
+ */
+trait TimeZoneAwareExpression extends Expression {
+


Yes, I wanted to avoid a bunch of gets.

ueshin · 2016-12-19T03:02:32Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala

+  def timeZoneResolved: Boolean = zoneId != null
+
+  def withTimeZone(zoneId: String): TimeZoneAwareExpression
+


You are right. I'll use lazy val.

SparkQA · 2017-01-23T12:11:32Z

Test build #71837 has finished for PR 16308 at commit 328399a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-01-23T14:41:07Z

sql/core/src/test/scala/org/apache/spark/sql/DateFunctionsSuite.scala

+      // "2016-01-01 08:00:00"
+      checkAnswer(
+        df.select("t").filter($"t" <= "2016-01-01 00:00:00"),
+        Row(Timestamp.valueOf("2015-12-31 16:00:00")))


this shows that it will be very confusing if the session local timezone is different from JVM default timezone in driver...

Ah, I see, let me modify it.

cloud-fan · 2017-01-23T15:16:59Z

sql/core/src/test/scala/org/apache/spark/sql/DateFunctionsSuite.scala

+    // |   | df                  | timestamp   | date_format         |
+    // +---+---------------------+-------------+---------------------+
+    // | a |                16533|1428476400000|"2015-04-08 00:00:00"|
+    // | b |"2015-04-08 13:10:15"|1428523815000|"2015-04-08 13:10:15"|


I'm a little confused, the d is already a Date, how can we have the time info back after converting the date to string?

Do you mean you are wondering why sdf.format(d) has the time info 13:10:15 ?

If so, java.sql.Date DOES have the time info if it was initialized with the constructor Date(long date) or even if it was initalized with the constructor Date(int year, int month, int day) or with Date.valueOf(String s), it has the time info 00:00:00 of the day in the timezone TimeZone.getDefault().

scala> TimeZone.setDefault(TimeZone.getTimeZone("GMT")) scala> val gmtDate = Date.valueOf("2017-01-24") gmtDate: java.sql.Date = 2017-01-24 scala> val gmtTime = gmtDate.getTime gmtTime: Long = 1485216000000 scala> TimeZone.setDefault(TimeZone.getTimeZone("PST")) scala> val pstDate = Date.valueOf("2017-01-24") pstDate: java.sql.Date = 2017-01-24 scala> val pstTime = pstDate.getTime pstTime: Long = 1485244800000 scala> val sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss") sdf: java.text.SimpleDateFormat = java.text.SimpleDateFormat@4f76f1a0 scala> sdf.setTimeZone(TimeZone.getTimeZone("GMT")) scala> sdf.format(gmtTime) res12: String = 2017-01-24 00:00:00 scala> sdf.format(pstTime) res13: String = 2017-01-24 08:00:00 scala> val d = new Date(sdf.parse("2015-04-08 13:10:15").getTime) d: java.sql.Date = 2015-04-08 scala> sdf.format(d) res14: String = 2015-04-08 13:10:15

cloud-fan · 2017-01-23T15:47:50Z

sql/core/src/test/scala/org/apache/spark/sql/DateFunctionsSuite.scala

      Row(ts1.getTime / 1000L), Row(ts2.getTime / 1000L)))
  }

+  test("to_unix_timestamp with session local timezone") {


These newly added tests are so similar that they all try to prove one thing: when you convert string or date to timestamp, the result changes according to session local timezone. When you convert timestamp to string or date, the result also changes with session local timezone. All time-related expressions should respect this.

shall we just write a general test for this? then we don't need so many similar tests

I agree that there are so many similar tests, but I have no idea to generalize them.
Would you please give me some code snippets? I'll be able to expand them.

I don't think we need to add tests in this file at all. We should improve DateTimeUtilsSuite to make sure the newly added methods work well with different timezones, e.g. getHours, daysToMillions, etc. Then make sure these timezone aware expressions will call the newly added methods in DateTimeUtils which has timezone parameter(we can remove the old versions that don't take timezone parameter, after we finish handling partition values).

This suite is end-to-end test, and it's very annoying if we wanna test all changed expressions, we should write more low-level tests in DateTimeUtilsSuite.

The problem is, except this suite, all the changes you made to the tests are just fixed existing tests to fit the timezone stuff. You add all the new tests in this suite as end-to-end tests, which is not good. We should add new tests in DateTimeUtilsSuite as unit tests.

Ah, I see! I'll move tests to DateTimeUtilsSuite soon. Thanks a lot!

SparkQA · 2017-01-23T17:48:44Z

Test build #71854 has finished for PR 16308 at commit 7352612.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-25T12:29:36Z

Test build #71983 has finished for PR 16308 at commit b99cf79.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell

The patch is in good shape. I left a few small comments, let's try to get this in ASAP.

hvanhovell · 2017-01-25T10:23:36Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala

  private lazy val constFormat: UTF8String = right.eval().asInstanceOf[UTF8String]
  private lazy val formatter: SimpleDateFormat =
-    Try(new SimpleDateFormat(constFormat.toString, Locale.US)).getOrElse(null)
+    Try {


Nit just use try catch...

I see that this pattern is used quite often. Should we put it in a method?

I see, I'll replace Trys with try-catch and add a method to create SimpleDateFormat with a format string and a timezone to DateTimeUtils.

hvanhovell · 2017-01-25T10:27:12Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

            Divide(newAggExpr, Literal.create(math.pow(10.0, scale), DoubleType)),
-            DecimalType(prec + 4, scale + 4))
+            DecimalType(prec + 4, scale + 4),
+            Option(conf.sessionLocalTimeZone))


NIT: why a new line?

I'll remove it.

hvanhovell · 2017-01-25T10:27:16Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

            Divide(newAggExpr, Literal.create(math.pow(10.0, scale), DoubleType)),
-            DecimalType(prec + 4, scale + 4))
+            DecimalType(prec + 4, scale + 4),
+            Option(conf.sessionLocalTimeZone))


NIT: why a new line?

I'll remove it.

hvanhovell · 2017-01-25T15:26:03Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuite.scala

-    checkEvaluation(Cast(Literal("20150318"), TimestampType), null)
-    checkEvaluation(Cast(Literal("2015-031-8"), TimestampType), null)
-    checkEvaluation(Cast(Literal("2015-03-18T12:03:17-0:70"), TimestampType), null)
+    for (tz <- ALL_TIMEZONES) {


Can we try to parameterize this a little bit more. I know you didn't write it, but it is quite hard to get through.

hvanhovell · 2017-01-25T15:48:50Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

    val hasMoreData = takeResult.length > numRows
    val data = takeResult.take(numRows)

+    lazy val timeZone = TimeZone.getTimeZone(sparkSession.sessionState.conf.sessionLocalTimeZone)


What happens if a user changes the session timezone? What would be the preferred behavior? Currently show() generates the same result every time, but that might be unexpected.

hvanhovell · 2017-01-25T15:49:41Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateTimeUtilsSuite.scala

-    c.set(Calendar.MILLISECOND, 123)
-    assert(stringToTimestamp(
-      UTF8String.fromString("2015-03-18T12:03:17.123121+7:30")).get ===
+    for (tz <- DateTimeTestUtils.ALL_TIMEZONES) {


Can you also try to parameterize these tests?

SparkQA · 2017-01-26T06:28:32Z

Test build #72012 has started for PR 16308 at commit 6fa1d6a.

ueshin · 2017-01-26T08:08:02Z

Jenkins, retest this please.

hvanhovell · 2017-01-26T09:54:35Z

LGTM - pending jenkins.

SparkQA · 2017-01-26T10:37:16Z

Test build #72019 has finished for PR 16308 at commit 6fa1d6a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2017-01-26T10:52:18Z

Merging to master. Thanks for the hard work, and the patience with the review process.

Can you open follow-up PRs for any remaining issues?

ueshin · 2017-01-26T12:14:52Z

Sure! I'll send follow-up prs as soon as possible.
Thanks a lot for your review!

## What changes were proposed in this pull request? As of Spark 2.1, Spark SQL assumes the machine timezone for datetime manipulation, which is bad if users are not in the same timezones as the machines, or if different users have different timezones. We should introduce a session local timezone setting that is used for execution. An explicit non-goal is locale handling. ### Semantics Setting the session local timezone means that the timezone-aware expressions listed below should use the timezone to evaluate values, and also it should be used to convert (cast) between string and timestamp or between timestamp and date. - `CurrentDate` - `CurrentBatchTimestamp` - `Hour` - `Minute` - `Second` - `DateFormatClass` - `ToUnixTimestamp` - `UnixTimestamp` - `FromUnixTime` and below are implicitly timezone-aware through cast from timestamp to date: - `DayOfYear` - `Year` - `Quarter` - `Month` - `DayOfMonth` - `WeekOfYear` - `LastDay` - `NextDay` - `TruncDate` For example, if you have timestamp `"2016-01-01 00:00:00"` in `GMT`, the values evaluated by some of timezone-aware expressions are: ```scala scala> val df = Seq(new java.sql.Timestamp(1451606400000L)).toDF("ts") df: org.apache.spark.sql.DataFrame = [ts: timestamp] scala> df.selectExpr("cast(ts as string)", "year(ts)", "month(ts)", "dayofmonth(ts)", "hour(ts)", "minute(ts)", "second(ts)").show(truncate = false) +-------------------+----------------------+-----------------------+----------------------------+--------+----------+----------+ |ts |year(CAST(ts AS DATE))|month(CAST(ts AS DATE))|dayofmonth(CAST(ts AS DATE))|hour(ts)|minute(ts)|second(ts)| +-------------------+----------------------+-----------------------+----------------------------+--------+----------+----------+ |2016-01-01 00:00:00|2016 |1 |1 |0 |0 |0 | +-------------------+----------------------+-----------------------+----------------------------+--------+----------+----------+ ``` whereas setting the session local timezone to `"PST"`, they are: ```scala scala> spark.conf.set("spark.sql.session.timeZone", "PST") scala> df.selectExpr("cast(ts as string)", "year(ts)", "month(ts)", "dayofmonth(ts)", "hour(ts)", "minute(ts)", "second(ts)").show(truncate = false) +-------------------+----------------------+-----------------------+----------------------------+--------+----------+----------+ |ts |year(CAST(ts AS DATE))|month(CAST(ts AS DATE))|dayofmonth(CAST(ts AS DATE))|hour(ts)|minute(ts)|second(ts)| +-------------------+----------------------+-----------------------+----------------------------+--------+----------+----------+ |2015-12-31 16:00:00|2015 |12 |31 |16 |0 |0 | +-------------------+----------------------+-----------------------+----------------------------+--------+----------+----------+ ``` Notice that even if you set the session local timezone, it affects only in `DataFrame` operations, neither in `Dataset` operations, `RDD` operations nor in `ScalaUDF`s. You need to properly handle timezone by yourself. ### Design of the fix I introduced an analyzer to pass session local timezone to timezone-aware expressions and modified DateTimeUtils to take the timezone argument. ## How was this patch tested? Existing tests and added tests for timezone aware expressions. Author: Takuya UESHIN <[email protected]> Closes apache#16308 from ueshin/issues/SPARK-18350.

## What changes were proposed in this pull request? This is a follow-up pr of #16308. This pr enables timezone support in CSV/JSON parsing. We should introduce `timeZone` option for CSV/JSON datasources (the default value of the option is session local timezone). The datasources should use the `timeZone` option to format/parse to write/read timestamp values. Notice that while reading, if the timestampFormat has the timezone info, the timezone will not be used because we should respect the timezone in the values. For example, if you have timestamp `"2016-01-01 00:00:00"` in `GMT`, the values written with the default timezone option, which is `"GMT"` because session local timezone is `"GMT"` here, are: ```scala scala> spark.conf.set("spark.sql.session.timeZone", "GMT") scala> val df = Seq(new java.sql.Timestamp(1451606400000L)).toDF("ts") df: org.apache.spark.sql.DataFrame = [ts: timestamp] scala> df.show() +-------------------+ |ts | +-------------------+ |2016-01-01 00:00:00| +-------------------+ scala> df.write.json("/path/to/gmtjson") ``` ```sh $ cat /path/to/gmtjson/part-* {"ts":"2016-01-01T00:00:00.000Z"} ``` whereas setting the option to `"PST"`, they are: ```scala scala> df.write.option("timeZone", "PST").json("/path/to/pstjson") ``` ```sh $ cat /path/to/pstjson/part-* {"ts":"2015-12-31T16:00:00.000-08:00"} ``` We can properly read these files even if the timezone option is wrong because the timestamp values have timezone info: ```scala scala> val schema = new StructType().add("ts", TimestampType) schema: org.apache.spark.sql.types.StructType = StructType(StructField(ts,TimestampType,true)) scala> spark.read.schema(schema).json("/path/to/gmtjson").show() +-------------------+ |ts | +-------------------+ |2016-01-01 00:00:00| +-------------------+ scala> spark.read.schema(schema).option("timeZone", "PST").json("/path/to/gmtjson").show() +-------------------+ |ts | +-------------------+ |2016-01-01 00:00:00| +-------------------+ ``` And even if `timezoneFormat` doesn't contain timezone info, we can properly read the values with setting correct timezone option: ```scala scala> df.write.option("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss").option("timeZone", "JST").json("/path/to/jstjson") ``` ```sh $ cat /path/to/jstjson/part-* {"ts":"2016-01-01T09:00:00"} ``` ```scala // wrong result scala> spark.read.schema(schema).option("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss").json("/path/to/jstjson").show() +-------------------+ |ts | +-------------------+ |2016-01-01 09:00:00| +-------------------+ // correct result scala> spark.read.schema(schema).option("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss").option("timeZone", "JST").json("/path/to/jstjson").show() +-------------------+ |ts | +-------------------+ |2016-01-01 00:00:00| +-------------------+ ``` This pr also makes `JsonToStruct` and `StructToJson` `TimeZoneAwareExpression` to be able to evaluate values with timezone option. ## How was this patch tested? Existing tests and added some tests. Author: Takuya UESHIN <[email protected]> Closes #16750 from ueshin/issues/SPARK-18937.

## What changes were proposed in this pull request? This is a follow-up pr of apache#16308. This pr enables timezone support in CSV/JSON parsing. We should introduce `timeZone` option for CSV/JSON datasources (the default value of the option is session local timezone). The datasources should use the `timeZone` option to format/parse to write/read timestamp values. Notice that while reading, if the timestampFormat has the timezone info, the timezone will not be used because we should respect the timezone in the values. For example, if you have timestamp `"2016-01-01 00:00:00"` in `GMT`, the values written with the default timezone option, which is `"GMT"` because session local timezone is `"GMT"` here, are: ```scala scala> spark.conf.set("spark.sql.session.timeZone", "GMT") scala> val df = Seq(new java.sql.Timestamp(1451606400000L)).toDF("ts") df: org.apache.spark.sql.DataFrame = [ts: timestamp] scala> df.show() +-------------------+ |ts | +-------------------+ |2016-01-01 00:00:00| +-------------------+ scala> df.write.json("/path/to/gmtjson") ``` ```sh $ cat /path/to/gmtjson/part-* {"ts":"2016-01-01T00:00:00.000Z"} ``` whereas setting the option to `"PST"`, they are: ```scala scala> df.write.option("timeZone", "PST").json("/path/to/pstjson") ``` ```sh $ cat /path/to/pstjson/part-* {"ts":"2015-12-31T16:00:00.000-08:00"} ``` We can properly read these files even if the timezone option is wrong because the timestamp values have timezone info: ```scala scala> val schema = new StructType().add("ts", TimestampType) schema: org.apache.spark.sql.types.StructType = StructType(StructField(ts,TimestampType,true)) scala> spark.read.schema(schema).json("/path/to/gmtjson").show() +-------------------+ |ts | +-------------------+ |2016-01-01 00:00:00| +-------------------+ scala> spark.read.schema(schema).option("timeZone", "PST").json("/path/to/gmtjson").show() +-------------------+ |ts | +-------------------+ |2016-01-01 00:00:00| +-------------------+ ``` And even if `timezoneFormat` doesn't contain timezone info, we can properly read the values with setting correct timezone option: ```scala scala> df.write.option("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss").option("timeZone", "JST").json("/path/to/jstjson") ``` ```sh $ cat /path/to/jstjson/part-* {"ts":"2016-01-01T09:00:00"} ``` ```scala // wrong result scala> spark.read.schema(schema).option("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss").json("/path/to/jstjson").show() +-------------------+ |ts | +-------------------+ |2016-01-01 09:00:00| +-------------------+ // correct result scala> spark.read.schema(schema).option("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss").option("timeZone", "JST").json("/path/to/jstjson").show() +-------------------+ |ts | +-------------------+ |2016-01-01 00:00:00| +-------------------+ ``` This pr also makes `JsonToStruct` and `StructToJson` `TimeZoneAwareExpression` to be able to evaluate values with timezone option. ## How was this patch tested? Existing tests and added some tests. Author: Takuya UESHIN <[email protected]> Closes apache#16750 from ueshin/issues/SPARK-18937.

## What changes were proposed in this pull request? This is a follow-up pr of apache#16308 and apache#16750. This pr enables timezone support in partition values. We should use `timeZone` option introduced at apache#16750 to parse/format partition values of the `TimestampType`. For example, if you have timestamp `"2016-01-01 00:00:00"` in `GMT` which will be used for partition values, the values written by the default timezone option, which is `"GMT"` because the session local timezone is `"GMT"` here, are: ```scala scala> spark.conf.set("spark.sql.session.timeZone", "GMT") scala> val df = Seq((1, new java.sql.Timestamp(1451606400000L))).toDF("i", "ts") df: org.apache.spark.sql.DataFrame = [i: int, ts: timestamp] scala> df.show() +---+-------------------+ | i| ts| +---+-------------------+ | 1|2016-01-01 00:00:00| +---+-------------------+ scala> df.write.partitionBy("ts").save("/path/to/gmtpartition") ``` ```sh $ ls /path/to/gmtpartition/ _SUCCESS ts=2016-01-01 00%3A00%3A00 ``` whereas setting the option to `"PST"`, they are: ```scala scala> df.write.option("timeZone", "PST").partitionBy("ts").save("/path/to/pstpartition") ``` ```sh $ ls /path/to/pstpartition/ _SUCCESS ts=2015-12-31 16%3A00%3A00 ``` We can properly read the partition values if the session local timezone and the timezone of the partition values are the same: ```scala scala> spark.read.load("/path/to/gmtpartition").show() +---+-------------------+ | i| ts| +---+-------------------+ | 1|2016-01-01 00:00:00| +---+-------------------+ ``` And even if the timezones are different, we can properly read the values with setting corrent timezone option: ```scala // wrong result scala> spark.read.load("/path/to/pstpartition").show() +---+-------------------+ | i| ts| +---+-------------------+ | 1|2015-12-31 16:00:00| +---+-------------------+ // correct result scala> spark.read.option("timeZone", "PST").load("/path/to/pstpartition").show() +---+-------------------+ | i| ts| +---+-------------------+ | 1|2016-01-01 00:00:00| +---+-------------------+ ``` ## How was this patch tested? Existing tests and added some tests. Author: Takuya UESHIN <[email protected]> Closes apache#17053 from ueshin/issues/SPARK-18939.

grvishwanath · 2017-07-14T15:05:12Z

@ueshin Can you please explain to me if this functionality is the same as blessing Spark with "Timestamp with Time Zone"? If now, how is it different?

ueshin added 12 commits December 16, 2016 16:35

Prepare for session local timezone support.

c48a70d

Make Cast TimeZoneAwareExpression.

1d21fec

Fix DateTimeUtilsSuite to follow changes.

0763c8f

Make some datetime expressions TimeZoneAwareExpression.

449d93d

Fix compiler error in sql/core.

b59d902

Add constructors without zoneId to TimeZoneAwareExpressions for Funct…

3ddfae4

…ionRepository.

Add DateTimeUtils.threadLocalLocalTimeZone to partition-reltated Cast.

f58f00d

Fix timezone for Hive timestamp string.

8f2040b

Use defaultTimeZone instead of threadLocalLocalTimeZone.

63c103c

Add TimeZone to DateFormats.

7066850

Make CurrentBatchTimestamp TimeZoneAwareExpression.

1aaca29

Add tests for date functions with session local timezone.

e5bb246

Remove unused import and small cleanup.

32cc391

Fix tests.

f434378

rxin reviewed Dec 17, 2016

View reviewed changes

ueshin commented Dec 19, 2016

View reviewed changes

Rename zoneId to timeZoneId.

16fd1e4

Add comments to explain tests.

328399a

cloud-fan reviewed Jan 23, 2017

View reviewed changes

Modify a test.

7352612

cloud-fan reviewed Jan 23, 2017

View reviewed changes

Refine tests.

b99cf79

hvanhovell requested changes Jan 25, 2017

View reviewed changes

ueshin added 3 commits January 26, 2017 12:00

Remove unnecessary new lines.

a85377f

Add newDateFormat to DateTimeUtils and use it.

f0c911b

Parameterize some tests.

6fa1d6a

asfgit closed this in 2969fb4 Jan 26, 2017

ueshin mentioned this pull request Jan 31, 2017

[SPARK-18937][SQL] Timezone support in CSV/JSON parsing #16750

Closed

ueshin mentioned this pull request Feb 24, 2017

[SPARK-18939][SQL] Timezone support in partition values. #17053

Closed

HyukjinKwon mentioned this pull request Mar 22, 2017

[SPARK-20018][SQL] Pivot with timestamp and count should not print internal representation #17348

Closed

		def timeZoneResolved: Boolean = zoneId != null

		def withTimeZone(zoneId: String): TimeZoneAwareExpression

[SPARK-18936][SQL] Infrastructure for session local timezone support. #16308

[SPARK-18936][SQL] Infrastructure for session local timezone support. #16308

Uh oh!

Conversation

ueshin commented Dec 16, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Semantics

Design of the fix

How was this patch tested?

Uh oh!

ueshin commented Dec 16, 2016

Uh oh!

rxin commented Dec 16, 2016

Uh oh!

SparkQA commented Dec 16, 2016

Uh oh!

SparkQA commented Dec 16, 2016

Uh oh!

SparkQA commented Dec 16, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rxin commented Dec 17, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rxin Dec 17, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ueshin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 23, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ueshin commented Dec 16, 2016 •

edited

Loading

rxin Dec 17, 2016 •

edited

Loading