[SPARK-29012][SQL] Support special timestamp values #25716

MaxGekk · 2019-09-06T18:13:19Z

What changes were proposed in this pull request?

Supported special string values for TIMESTAMP type. They are simply notational shorthands that will be converted to ordinary timestamp values when read. The following string values are supported:

epoch [zoneId] - 1970-01-01 00:00:00+00 (Unix system time zero)
today [zoneId] - midnight today.
yesterday [zoneId] -midnight yesterday
tomorrow [zoneId] - midnight tomorrow
now - current query start time.

For example:

spark-sql> SELECT timestamp 'tomorrow';
2019-09-07 00:00:00

Why are the changes needed?

To maintain feature parity with PostgreSQL, see 8.5.1.4. Special Values

Does this PR introduce any user-facing change?

Previously, the parser fails on the special values with the error:

spark-sql> select timestamp 'today';
Error in query: 
Cannot parse the TIMESTAMP value: today(line 1, pos 7)

After the changes, the special values are converted to appropriate dates:

spark-sql> select timestamp 'today';
2019-09-06 00:00:00

How was this patch tested?

Added tests to TimestampFormatterSuite to check parsing special values from regular strings.
Tests in DateTimeUtilsSuite check parsing those values from UTF8String
Uncommented tests in timestamp.sql

SparkQA · 2019-09-06T21:02:15Z

Test build #110256 has finished for PR 25716 at commit ad23507.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

This reverts commit ad23507.

SparkQA · 2019-09-07T21:23:09Z

Test build #110286 has finished for PR 25716 at commit 59e30e3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-09-07T23:12:21Z

Test build #110288 has finished for PR 25716 at commit 9f7ed14.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2019-09-09T16:27:27Z

sql/core/src/test/resources/sql-tests/results/pgSQL/timestamp.sql.out

+-- !query 16 schema
+struct<64:string,d1:timestamp>
+-- !query 16 output
+	1969-12-31 16:00:00


This is the epoch in UTC (1970-01-01 00:00:00) displayed in the local time zone.

To print 1970-01-01 00:00:00 here, better to set a config for pgSQL tests in
https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala#L301 ?

I can set UTC globally but ... if we know the reason of this, should we do that?

And I am afraid it won't be enough. Need to set system time zone as well:

spark/sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala

Lines 552 to 553 in 7cc0f0e

// Timezone is fixed to America/Los_Angeles for those timezone sensitive tests (timestamp_*)

TimeZone.setDefault(TimeZone.getTimeZone("America/Los_Angeles"))

NVM. I think its ok as it is. But, better to leave some comments about why in timestamp.sql.

MaxGekk · 2019-09-09T16:28:25Z

@dongjoon-hyun @cloud-fan @maropu Could you take a look at this when you have time.

SparkQA · 2019-09-09T20:13:45Z

Test build #110357 has finished for PR 25716 at commit a268e62.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2019-09-09T23:40:58Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala

+   * @return some of microseconds since the epoch if the conversion completed
+   *         successfully otherwise None.
+   */
+  def convertSpecialTimestamp(input: String, zoneId: ZoneId): Option[SQLTimestamp] = {


What's different from convertSpecialDate? I know the output dataType is different, but the way to handle these special values is different, too?
https://github.com/apache/spark/pull/25708/files#diff-da60f07e1826788aaeb07f295fae4b8aR866
Can we share some code between them?

I have extracted common code there https://github.com/apache/spark/pull/25716/files#diff-da60f07e1826788aaeb07f295fae4b8aR864-R890

maropu · 2019-09-09T23:44:00Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala

+
+  /**
+   * Converts notational shorthands that are converted to ordinary timestamps.
+   * @param input - a trimmed string


How about checking if an input is trimmed by assert?

I will add the assert:

assert(input.trim.length == input.length)

maropu · 2019-09-09T23:52:08Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala

+    }
+  }
+
+  private def convertSpecialTimestamp(bytes: Array[Byte], zoneId: ZoneId): Option[SQLTimestamp] = {


Why did you use Array[Byte] instead of UTF8String?

Because I need String inside of extractSpecialValue, and UTF8String.fromString converts UTF8String to String via Array[Byte]. Why should we convert the same string to bytes twice?

maropu · 2019-09-09T23:52:57Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala

    var currentSegmentValue = 0
    val bytes = s.trim.getBytes
+    val specialTimestamp = convertSpecialTimestamp(bytes, timeZoneId)
+    if (specialTimestamp.isDefined) return specialTimestamp


Can we avoid to use return here?

I'm not 100% sure about bytecode for this though, no overhead to use return?

maropu · 2019-09-10T00:00:12Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala

+import java.util.Locale
 import java.util.concurrent.TimeUnit.SECONDS

+import DateTimeUtils.{convertSpecialTimestamp}


nit: remove {

maropu · 2019-09-10T00:20:38Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala

+    Instant.now().atZone(zoneId).`with`(LocalTime.MIDNIGHT)
+  }
+
+  private val specialValue = """(EPOCH|NOW|TODAY|TOMORROW|YESTERDAY)\p{Blank}*(.*)""".r


The description for the supported special values in this pr should be;

epoch [zoneid] - 1970-01-01 00:00:00+00 (Unix system time zero) today [zoneid] - midnight today. yesterday [zoneid] -midnight yesterday tomorrow [zoneid] - midnight tomorrow now - current query start time.

?

maropu · 2019-09-10T00:21:54Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateTimeUtilsSuite.scala

+
+      assert(toTimestamp("Epoch", zoneId).get === 0)
+      val now = instantToMicros(LocalDateTime.now(zoneId).atZone(zoneId).toInstant)
+      toTimestamp("NOW", zoneId).get should be (now +- tolerance)


Can you check illegal cases, e.g., now CET

I have already added the test here https://github.com/apache/spark/pull/25716/files#diff-c5655e947ce2dd3748e4cf95ebc32e8aR580

sql/core/src/test/resources/sql-tests/inputs/pgSQL/timestamp.sql

maropu · 2019-09-10T00:44:12Z

sql/core/src/test/resources/sql-tests/results/pgSQL/timestamp.sql.out

+-- !query 16 schema
+struct<64:string,d1:timestamp>
+-- !query 16 output
+	1969-12-31 16:00:00


To print 1970-01-01 00:00:00 here, better to set a config for pgSQL tests in
https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala#L301 ?

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala

dongjoon-hyun · 2019-09-10T08:05:20Z

Thank you for pinging me, @MaxGekk . I support this approach.
One thing I'm concerning is the coordination between this PR and the followings.

SPARK-28934 Add spark.sql.compatiblity.mode
SPARK-28989 Add spark.sql.ansi.enabled
SPARK-28997 Add spark.sql.dialect

Hi, @gatorsmile , @gengliangwang , @cloud-fan .

Can we proceed the feature implementation first in the non-critical PR like this?
Or, do we need to wait and choose one of them to follow?

maropu · 2019-09-10T08:08:02Z

The current code looks ok to me, but this might be blocked by #25697 because of the same reason with #25708 ...

SparkQA · 2019-09-10T08:37:10Z

Test build #110403 has finished for PR 25716 at commit b17d642.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2019-09-10T08:52:10Z

Well, for this feature it doesn't really conflict with the current Spark behavior. I think we can proceed it.
I will continue the work of #25697 this week.

SparkQA · 2019-09-10T11:26:08Z

Test build #110408 has finished for PR 25716 at commit a4fae09.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2019-09-12T07:13:41Z

@maropu @dongjoon-hyun Can we continue with the PR or we are waiting for @gengliangwang 's #25697 ?

maropu · 2019-09-12T12:38:02Z

How about holding this pr until this weekend for the @gengliangwang work? I personally think we don't have any reason to rush to merge this.

MaxGekk · 2019-09-12T13:30:18Z

I have some performance related concerns regarding to using the config. In current implementation, decision is pretty cheap - just comparing first byte. In the case of the config usage, we will need to retrieve it and compare its value with other string which can bring visible overhead even if PostgreSQL compatibility mode is turned off here https://github.com/apache/spark/pull/25716/files#diff-da60f07e1826788aaeb07f295fae4b8aR223

Are you absolutely sure about using this config in the PR?

MaxGekk · 2019-09-17T08:24:39Z

@dongjoon-hyun @maropu Can you merge this PR? Checking the flag could be added in #25697 itself or in a separate follow-up PR after the merge of #25697, I do believe.

HyukjinKwon · 2019-09-18T03:39:09Z

retest this please

HyukjinKwon · 2019-09-18T03:41:49Z

@MaxGekk so do you plan to hide this change behind the configuration spark.sql.dialect later?

MaxGekk · 2019-09-18T04:36:05Z

so do you plan to hide this change behind the configuration spark.sql.dialect later?

@HyukjinKwon Yes, I do. I would do that in a separate PR as soon as the flag is available in the master branch.

HyukjinKwon

LGTM but please @MaxGekk make sure adding a configuration later. Otherwise we might have to revert before 3.0

SparkQA · 2019-09-18T07:05:02Z

Test build #110860 has finished for PR 25716 at commit a4fae09.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-09-18T07:41:32Z

retest this please

SparkQA · 2019-09-18T13:14:08Z

Test build #110888 has finished for PR 25716 at commit a4fae09.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-09-18T14:30:09Z

Merged to master.

MaxGekk · 2019-09-22T17:06:26Z

This PR #25834 hides the feature under the SQL config spark.sql.dialect = "PostgreSQL"

… SQL migration guide ### What changes were proposed in this pull request? Updated the SQL migration guide regarding to recently supported special date and timestamp values, see #25716 and #25708. Closes #25834 ### Why are the changes needed? To let users know about new feature in Spark 3.0. ### Does this PR introduce any user-facing change? No Closes #25948 from MaxGekk/special-values-migration-guide. Authored-by: Maxim Gekk <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…only ### What changes were proposed in this pull request? In the PR, I propose to support special datetime values introduced by #25708 and by #25716 only in typed literals, and don't recognize them in parsing strings to dates/timestamps. The following string values are supported only in typed timestamp literals: - `epoch [zoneId]` - `1970-01-01 00:00:00+00 (Unix system time zero)` - `today [zoneId]` - midnight today. - `yesterday [zoneId]` - midnight yesterday - `tomorrow [zoneId]` - midnight tomorrow - `now` - current query start time. For example: ```sql spark-sql> SELECT timestamp 'tomorrow'; 2019-09-07 00:00:00 ``` Similarly, the following special date values are supported only in typed date literals: - `epoch [zoneId]` - `1970-01-01` - `today [zoneId]` - the current date in the time zone specified by `spark.sql.session.timeZone`. - `yesterday [zoneId]` - the current date -1 - `tomorrow [zoneId]` - the current date + 1 - `now` - the date of running the current query. It has the same notion as `today`. For example: ```sql spark-sql> SELECT date 'tomorrow' - date 'yesterday'; 2 ``` ### Why are the changes needed? In the current implementation, Spark supports the special date/timestamp value in any input strings casted to dates/timestamps that leads to the following problems: - If executors have different system time, the result is inconsistent, and random. Column values depend on where the conversions were performed. - The special values play the role of distributed non-deterministic functions though users might think of the values as constants. ### Does this PR introduce _any_ user-facing change? Yes but the probability should be small. ### How was this patch tested? By running existing test suites: ``` $ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z interval.sql" $ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z date.sql" $ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z timestamp.sql" $ build/sbt "test:testOnly *DateTimeUtilsSuite" ``` Closes #32714 from MaxGekk/remove-datetime-special-values. Lead-authored-by: Max Gekk <[email protected]> Co-authored-by: Maxim Gekk <[email protected]> Signed-off-by: Max Gekk <[email protected]>

MaxGekk added 4 commits September 6, 2019 22:32

Support special timestamp values by TimestampFormatter

f3c2f4f

dates -> timestamps in a comment.

dfe541b

Support special values by stringToTimestamp

d4751af

Re-gen timestamp.sql.out

ad23507

dongjoon-hyun added the SQL label Sep 6, 2019

MaxGekk added 2 commits September 7, 2019 22:43

Revert "Re-gen timestamp.sql.out"

59e30e3

This reverts commit ad23507.

Use regex

9f7ed14

MaxGekk mentioned this pull request Sep 8, 2019

[SPARK-28141][SQL] Support special date values #25708

Closed

MaxGekk added 3 commits September 9, 2019 20:39

Require trimmed string

fa0037d

Add tests for from_csv and from_json

14ce002

Uncomment tests in timestamp.sql

a268e62

MaxGekk changed the title ~~[WIP][SPARK-29012][SQL] Support special timestamp values~~ [SPARK-29012][SQL] Support special timestamp values Sep 9, 2019

MaxGekk commented Sep 9, 2019

View reviewed changes

maropu reviewed Sep 9, 2019

View reviewed changes

maropu reviewed Sep 10, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala Outdated Show resolved Hide resolved

MaxGekk added 4 commits September 10, 2019 12:00

Extract common code to extractSpecialValue

e27a450

Use Zulu instead of zulu

b17d642

Add an assert

ec1020f

Remove {} in import

a4fae09

HyukjinKwon approved these changes Sep 18, 2019

View reviewed changes

HyukjinKwon closed this in c2734ab Sep 18, 2019

MaxGekk mentioned this pull request Sep 18, 2019

[WIP][SPARK-29155][SQL] Support special date/timestamp values in the PostgreSQL dialect only #25834

Closed

MaxGekk mentioned this pull request Sep 27, 2019

[SPARK-29275][SQL][DOC] Describe special date/timestamp values in the SQL migration guide #25948

Closed

MaxGekk deleted the timestamp-special-values branch October 5, 2019 19:17

MaxGekk mentioned this pull request May 31, 2021

[SPARK-35581][SQL] Support special datetime values in typed literals only #32714

Closed

	// Timezone is fixed to America/Los_Angeles for those timezone sensitive tests (timestamp_*)
	TimeZone.setDefault(TimeZone.getTimeZone("America/Los_Angeles"))

[SPARK-29012][SQL] Support special timestamp values #25716

[SPARK-29012][SQL] Support special timestamp values #25716

Uh oh!

Conversation

MaxGekk commented Sep 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Sep 6, 2019

Uh oh!

SparkQA commented Sep 7, 2019

Uh oh!

SparkQA commented Sep 7, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaxGekk commented Sep 9, 2019

Uh oh!

SparkQA commented Sep 9, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dongjoon-hyun commented Sep 10, 2019

Uh oh!

maropu commented Sep 10, 2019

Uh oh!

SparkQA commented Sep 10, 2019

Uh oh!

gengliangwang commented Sep 10, 2019

Uh oh!

SparkQA commented Sep 10, 2019

Uh oh!

MaxGekk commented Sep 12, 2019

Uh oh!

maropu commented Sep 12, 2019

Uh oh!

MaxGekk commented Sep 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MaxGekk commented Sep 6, 2019 •

edited

Loading

MaxGekk commented Sep 12, 2019 •

edited

Loading

MaxGekk commented Sep 22, 2019 •

edited

Loading