Skip to content

Conversation

@yaooqinn
Copy link
Member

@yaooqinn yaooqinn commented May 27, 2020

What changes were proposed in this pull request?

Currently, date_format and from_unixtime, unix_timestamp,to_unix_timestamp, to_timestamp, to_date have different exception handling behavior for formatting datetime values.

In this PR, we apply the exception handling behavior of date_format to from_unixtime, unix_timestamp,to_unix_timestamp, to_timestamp and to_date.

In the phase of creating the datetime formatted or formating, exceptions will be raised.

e.g.

spark-sql> select date_format(make_timestamp(1, 1 ,1,1,1,1), 'yyyyyyyyyyy-MM-aaa');
20/05/28 15:25:38 ERROR SparkSQLDriver: Failed in [select date_format(make_timestamp(1, 1 ,1,1,1,1), 'yyyyyyyyyyy-MM-aaa')]
org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'yyyyyyyyyyy-MM-aaa' pattern in the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0. 2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
spark-sql> select date_format(make_timestamp(1, 1 ,1,1,1,1), 'yyyyyyyyyyy-MM-AAA');
20/05/28 15:26:10 ERROR SparkSQLDriver: Failed in [select date_format(make_timestamp(1, 1 ,1,1,1,1), 'yyyyyyyyyyy-MM-AAA')]
java.lang.IllegalArgumentException: Illegal pattern character: A
spark-sql> select date_format(make_timestamp(1,1,1,1,1,1), 'yyyyyyyyyyy-MM-dd');
20/05/28 15:23:23 ERROR SparkSQLDriver: Failed in [select date_format(make_timestamp(1,1,1,1,1,1), 'yyyyyyyyyyy-MM-dd')]
java.lang.ArrayIndexOutOfBoundsException: 11
	at java.time.format.DateTimeFormatterBuilder$NumberPrinterParser.format(DateTimeFormatterBuilder.java:2568)

In the phase of parsing, DateTimeParseException | DateTimeException | ParseException will be suppressed, but SparkUpgradeException will still be raised

e.g.

spark-sql> set spark.sql.legacy.timeParserPolicy=exception;
spark.sql.legacy.timeParserPolicy	exception
spark-sql> select to_timestamp("2020-01-27T20:06:11.847-0800", "yyyy-MM-dd'T'HH:mm:ss.SSSz");
20/05/28 15:31:15 ERROR SparkSQLDriver: Failed in [select to_timestamp("2020-01-27T20:06:11.847-0800", "yyyy-MM-dd'T'HH:mm:ss.SSSz")]
org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: Fail to parse '2020-01-27T20:06:11.847-0800' in the new parser. You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime string.
spark-sql> set spark.sql.legacy.timeParserPolicy=corrected;
spark.sql.legacy.timeParserPolicy	corrected
spark-sql> select to_timestamp("2020-01-27T20:06:11.847-0800", "yyyy-MM-dd'T'HH:mm:ss.SSSz");
NULL
spark-sql> set spark.sql.legacy.timeParserPolicy=legacy;
spark.sql.legacy.timeParserPolicy	legacy
spark-sql> select to_timestamp("2020-01-27T20:06:11.847-0800", "yyyy-MM-dd'T'HH:mm:ss.SSSz");
2020-01-28 12:06:11.847

Why are the changes needed?

Consistency

Does this PR introduce any user-facing change?

Yes, invalid datetime patterns will fail from_unixtime, unix_timestamp,to_unix_timestamp, to_timestamp and to_date instead of resulting NULL

How was this patch tested?

add more tests

@yaooqinn
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented May 27, 2020

Test build #123160 has finished for PR 28650 at commit f39dd48.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 27, 2020

Test build #123165 has finished for PR 28650 at commit f39dd48.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yaooqinn
Copy link
Member Author

retest this please

@yaooqinn
Copy link
Member Author

cc @cloud-fan thanks.

@yaooqinn
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented May 27, 2020

Test build #123182 has finished for PR 28650 at commit f39dd48.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 28, 2020

Test build #123213 has finished for PR 28650 at commit 33fedf8.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait TimestampFormatterHelper extends TimeZoneAwareExpression

}
// Test escaping of format
GenerateUnsafeProjection.generate(FromUnixTime(Literal(0L), Literal("\"quote"), UTC_OPT) :: Nil)
GenerateUnsafeProjection.generate(FromUnixTime(Literal(0L), Literal("\""), UTC_OPT) :: Nil)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quote contains invalid format patterns, removing it does not affect the purpose of this test case.

checkEvaluation(
FromUnixTime(Literal(1000L), Literal.create(null, StringType), timeZoneId),
null)
checkEvaluation(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UnixTimestamp(Literal(date1), Literal.create(null, StringType), timeZoneId),
MICROSECONDS.toSeconds(
DateTimeUtils.daysToMicros(DateTimeUtils.fromJavaDate(date1), tz.toZoneId)))
checkEvaluation(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Literal(date1), Literal.create(null, StringType), timeZoneId),
MICROSECONDS.toSeconds(
DateTimeUtils.daysToMicros(DateTimeUtils.fromJavaDate(date1), zid)))
checkEvaluation(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

select from_csv('26/October/2015', 'time Timestamp', map('timestampFormat', 'dd/MMMMM/yyyy'));
select from_csv('26/October/2015', 'date Date', map('dateFormat', 'dd/MMMMM/yyyy'));

select date_format(null, null);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests were added because datetime function end-2-end tests were missing so far

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can add from_utc_timestamp and to_utc_timestamp here later too

@SparkQA
Copy link

SparkQA commented May 28, 2020

Test build #123217 has finished for PR 28650 at commit 82bb963.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yaooqinn
Copy link
Member Author

retest this please

@yaooqinn yaooqinn changed the title [SPARK-31830][SQL] Consistent error handling for datetime formatting functions [SPARK-31830][SQL] Consistent error handling for datetime formatting and parsing functions May 28, 2020
zoneId = zoneId,
legacyFormat = SIMPLE_DATE_FORMAT,
needVarLengthSecondFraction = isParsing)
formatter.validatePatternString()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this? It's already done in TimestampFormatter.apply AFAIK

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TimestampFormatter.apply works for the new Parser. Adding this to fail the legacy formatter here will give us a narrower scope of influence.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we call validatePatternString in TimestampFormatter.apply for legacy parser as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me push a commit to verify with jenkins

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks ok through Jenkins

@SparkQA
Copy link

SparkQA commented May 28, 2020

Test build #123230 has finished for PR 28650 at commit ce2eff0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

spark-sql> select from_unixtime(1, 'yyyyyyyyyyy-MM-dd');
NULL

Why we don't throw SparkUpgradeException in this case?

@yaooqinn
Copy link
Member Author

yaooqinn commented May 28, 2020

spark-sql> select from_unixtime(1, 'yyyyyyyyyyy-MM-dd');
NULL

Why we don't throw SparkUpgradeException in this case?

that logic belongs to parse() only now

@cloud-fan
Copy link
Contributor

So we didn't fail when we construct the formatter with yyyyyyyyyyy-MM-dd? Then I think this PR doesn't help either.

@yaooqinn
Copy link
Member Author

yaooqinn commented May 28, 2020

No the pattern yyyyyyyyyyy-MM-dd is valid for both version of formatters, but calling the format() throws an exception in the new one but silently suppressed in FromUnixTime expression. Now we are not suppressing the exceptions in this PR,

it will do the same as date_format

spark-sql>  select from_unixtime(1, 'yyyyyyyyyyy-MM-dd'); -- this is before
NULL
spark-sql>  select date_format('now', 'yyyyyyyyyyy-MM-dd'); -- this will be after for `from_unixtime` too
20/05/29 00:14:47 ERROR SparkSQLDriver: Failed in [ select date_format('now', 'yyyyyyyyyyy-MM-dd')]
java.lang.ArrayIndexOutOfBoundsException: 11
	at java.time.format.DateTimeFormatterBuilder$NumberPrinterParser.format(DateTimeFormatterBuilder.java:2568)
	at java.time.format.DateTimeFormatterBuilder$CompositePrinterParser.format(DateTimeFormatterBuilder.java:2190)

@yaooqinn
Copy link
Member Author

But maybe we should apply SparkUpgradeException to format() as parse() for better error msg for end-users

@SparkQA
Copy link

SparkQA commented May 28, 2020

Test build #123239 has finished for PR 28650 at commit 0680855.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

But maybe we should apply SparkUpgradeException to format() as parse() for better error msg for end-users

Yea, can you open a PR for it? We should merge that to 3.0.

@yaooqinn
Copy link
Member Author

OK

@cloud-fan
Copy link
Contributor

Since this PR is for master only, let's fix the format() first, in case this PR introduces conflicts.

@SparkQA
Copy link

SparkQA commented Jun 9, 2020

Test build #123674 has finished for PR 28650 at commit 2e940bd.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@transient lazy val zoneId: ZoneId = DateTimeUtils.getZoneId(timeZoneId.get)
}

trait TimestampFormatterHelper extends TimeZoneAwareExpression {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have expressions that create DateFormatter?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no datetime functions, only csv/json ones

ToUnixTimestamp(Literal("1"), Literal(c.toString)), "3.0")
checkExceptionInExpression[SparkUpgradeException](
UnixTimestamp(Literal("1"), Literal(c.toString)), "3.0")
def checkException[T <: Exception : ClassTag](c: String, onlyParsing: Boolean = false): Unit = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need the onlyParsing flag? we can get it by Seq('E', 'F', 'q', 'Q').contains(c)

Row(null), Row(null), Row(null), Row(null)))
val invalid = df1.selectExpr(s"to_unix_timestamp(x, 'yyyy-MM-dd bb:HH:ss')")
val e = intercept[IllegalArgumentException](invalid.collect())
assert(e.getMessage.contains('b'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we check more in the error message?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just pick the intersection of Illegal pattern character 'b' and Unknown pattern letter: b

select to_timestamp("2019 10:10:10", "yyyy hh:mm:ss");

-- Unsupported narrow text style
select date_format(date '2020-05-23', 'GGGGG');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we remove these tests?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

datetime-formatting-invalid.sql should be enough to cover these cases

@SparkQA
Copy link

SparkQA commented Jun 9, 2020

Test build #123680 has finished for PR 28650 at commit daac8dc.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

case class FromUnixTime(sec: Expression, format: Expression, timeZoneId: Option[String] = None)
extends BinaryExpression with TimeZoneAwareExpression with ImplicitCastInputTypes {
extends BinaryExpression with TimestampFormatterHelper with ImplicitCastInputTypes
with NullIntolerant {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we add the NullIntolerant in the base class TimestampFormatterHelper? and it seems better to do it in a new PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy that.

@SparkQA
Copy link

SparkQA commented Jun 9, 2020

Test build #123687 has finished for PR 28650 at commit 9d578f1.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 9, 2020

Test build #123689 has finished for PR 28650 at commit 8139dfc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 6a424b9 Jun 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants