-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-30788][SQL] Support SimpleDateFormat and FastDateFormat as legacy date/timestamp formatters
#27524
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-30788][SQL] Support SimpleDateFormat and FastDateFormat as legacy date/timestamp formatters
#27524
Conversation
| df.select(unix_timestamp(col("ss")).cast("timestamp"))) | ||
| checkAnswer(df.select(to_timestamp(col("ss"))), Seq( | ||
| Row(ts1), Row(ts2))) | ||
| if (legacyParser) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to handle legacy mode especially due to behavior change of to_timestamp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, SimpleDateFormat doesn't work correctly with the pattern .S. In Spark 2.4, it wasn't visible in the test because to_timestamp truncated results to seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan Only here, I have to modify the test to adopt it for the legacy parser.
|
Test build #118172 has finished for PR 27524 at commit
|
|
Test build #118175 has finished for PR 27524 at commit
|
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
Outdated
Show resolved
Hide resolved
|
@cloud-fan @HyukjinKwon Please, look at the draft PR. |
| @@ -1,2 +1,2 @@ | |||
| "good record",1999-08-01 | |||
| "bad record",1999-088-01 | |||
| "bad record",1999-088_01 | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to change this because FastDateFormat is not so strong, and can parse 1999-088-01
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we run these tests with legacy formatter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I added CSVLegacyTimeParserSuite which runs entire CSVSuite with the legacy parser.
| @@ -1,2 +1,2 @@ | |||
| 0,2013-111-11 12:13:14 | |||
| 0,2013-111_11 12:13:14 | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2013-111-11 is valid for FastDateFormat
| * Also this class allows to set raw value to the `MILLISECOND` field | ||
| * directly before formatting. | ||
| */ | ||
| class MicrosCalendar(tz: TimeZone, digitsInFraction: Int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a copy-paste from 2.4
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateFormatter.scala
Outdated
Show resolved
Hide resolved
|
Approach seems okay. |
|
Test build #118185 has finished for PR 27524 at commit
|
|
Test build #118213 has finished for PR 27524 at commit
|
| val MAX_LONG_DIGITS = 18 | ||
|
|
||
| private val POW_10 = Array.tabulate[Long](MAX_LONG_DIGITS + 1)(i => math.pow(10, i).toLong) | ||
| val POW_10 = Array.tabulate[Long](MAX_LONG_DIGITS + 1)(i => math.pow(10, i).toLong) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
POW_10 is needed in the wrapper of FastDateFormat to support parsing/formatting in microsecond precision. Similar changes were made in Spark 2.4.
|
Test build #118215 has finished for PR 27524 at commit
|
|
jenkins, retest this, please |
| import org.apache.spark.sql.internal.SQLConf | ||
| import org.apache.spark.sql.types._ | ||
| import org.apache.spark.unsafe.types.{CalendarInterval, UTF8String} | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all the tests in this file are not affected by the new or legacy formatter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrapped the tests that are affected by:
Seq(false, true).foreach { legacyParser =>
withSQLConf(SQLConf.LEGACY_TIME_PARSER_ENABLED.key -> legacyParser.toString) {
}
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They work fine with SimpleDateFormat and lenient = false.
SimpleDateFormat and FastDateFormat as legacy date/timestamp formattersSimpleDateFormat and FastDateFormat as legacy date/timestamp formatters
|
Test build #118219 has finished for PR 27524 at commit
|
|
jenkins, retest this, please |
|
Test build #118232 has finished for PR 27524 at commit
|
|
thanks, merging to master/3.0! |
… legacy date/timestamp formatters ### What changes were proposed in this pull request? In the PR, I propose to add legacy date/timestamp formatters based on `SimpleDateFormat` and `FastDateFormat`: - `LegacyFastTimestampFormatter` - uses `FastDateFormat` and supports parsing/formatting in microsecond precision. The code was borrowed from Spark 2.4, see #26507 & #26582 - `LegacySimpleTimestampFormatter` uses `SimpleDateFormat`, and support the `lenient` mode. When the `lenient` parameter is set to `false`, the parser become much stronger in checking its input. ### Why are the changes needed? Spark 2.4.x uses the following parsers for parsing/formatting date/timestamp strings: - `DateTimeFormat` in CSV/JSON datasource - `SimpleDateFormat` - is used in JDBC datasource, in partitions parsing. - `SimpleDateFormat` in strong mode (`lenient = false`), see https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L124. It is used by the `date_format`, `from_unixtime`, `unix_timestamp` and `to_unix_timestamp` functions. The PR aims to make Spark 3.0 compatible with Spark 2.4.x in all those cases when `spark.sql.legacy.timeParser.enabled` is set to `true`. ### Does this PR introduce any user-facing change? This shouldn't change behavior with default settings. If `spark.sql.legacy.timeParser.enabled` is set to `true`, users should observe behavior of Spark 2.4. ### How was this patch tested? - Modified tests in `DateExpressionsSuite` to check the legacy parser - `SimpleDateFormat`. - Added `CSVLegacyTimeParserSuite` and `JsonLegacyTimeParserSuite` to run `CSVSuite` and `JsonSuite` with the legacy parser - `FastDateFormat`. Closes #27524 from MaxGekk/timestamp-formatter-legacy-fallback. Authored-by: Maxim Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit c198620) Signed-off-by: Wenchen Fan <[email protected]>
… legacy date/timestamp formatters ### What changes were proposed in this pull request? In the PR, I propose to add legacy date/timestamp formatters based on `SimpleDateFormat` and `FastDateFormat`: - `LegacyFastTimestampFormatter` - uses `FastDateFormat` and supports parsing/formatting in microsecond precision. The code was borrowed from Spark 2.4, see apache#26507 & apache#26582 - `LegacySimpleTimestampFormatter` uses `SimpleDateFormat`, and support the `lenient` mode. When the `lenient` parameter is set to `false`, the parser become much stronger in checking its input. ### Why are the changes needed? Spark 2.4.x uses the following parsers for parsing/formatting date/timestamp strings: - `DateTimeFormat` in CSV/JSON datasource - `SimpleDateFormat` - is used in JDBC datasource, in partitions parsing. - `SimpleDateFormat` in strong mode (`lenient = false`), see https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L124. It is used by the `date_format`, `from_unixtime`, `unix_timestamp` and `to_unix_timestamp` functions. The PR aims to make Spark 3.0 compatible with Spark 2.4.x in all those cases when `spark.sql.legacy.timeParser.enabled` is set to `true`. ### Does this PR introduce any user-facing change? This shouldn't change behavior with default settings. If `spark.sql.legacy.timeParser.enabled` is set to `true`, users should observe behavior of Spark 2.4. ### How was this patch tested? - Modified tests in `DateExpressionsSuite` to check the legacy parser - `SimpleDateFormat`. - Added `CSVLegacyTimeParserSuite` and `JsonLegacyTimeParserSuite` to run `CSVSuite` and `JsonSuite` with the legacy parser - `FastDateFormat`. Closes apache#27524 from MaxGekk/timestamp-formatter-legacy-fallback. Authored-by: Maxim Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
In the PR, I propose to add legacy date/timestamp formatters based on
SimpleDateFormatandFastDateFormat:LegacyFastTimestampFormatter- usesFastDateFormatand supports parsing/formatting in microsecond precision. The code was borrowed from Spark 2.4, see [SPARK-29904][SQL][2.4] Parse timestamps in microsecond precision by JSON/CSV datasources #26507 & [SPARK-29949][SQL][2.4] Fix formatting of timestamps by JSON/CSV datasources #26582LegacySimpleTimestampFormatterusesSimpleDateFormat, and support thelenientmode. When thelenientparameter is set tofalse, the parser become much stronger in checking its input.Why are the changes needed?
Spark 2.4.x uses the following parsers for parsing/formatting date/timestamp strings:
DateTimeFormatin CSV/JSON datasourceSimpleDateFormat- is used in JDBC datasource, in partitions parsing.SimpleDateFormatin strong mode (lenient = false), see https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L124. It is used by thedate_format,from_unixtime,unix_timestampandto_unix_timestampfunctions.The PR aims to make Spark 3.0 compatible with Spark 2.4.x in all those cases when
spark.sql.legacy.timeParser.enabledis set totrue.Does this PR introduce any user-facing change?
This shouldn't change behavior with default settings. If
spark.sql.legacy.timeParser.enabledis set totrue, users should observe behavior of Spark 2.4.How was this patch tested?
DateExpressionsSuiteto check the legacy parser -SimpleDateFormat.CSVLegacyTimeParserSuiteandJsonLegacyTimeParserSuiteto runCSVSuiteandJsonSuitewith the legacy parser -FastDateFormat.