Skip to content

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Feb 10, 2020

What changes were proposed in this pull request?

In the PR, I propose to add legacy date/timestamp formatters based on SimpleDateFormat and FastDateFormat:

Why are the changes needed?

Spark 2.4.x uses the following parsers for parsing/formatting date/timestamp strings:

The PR aims to make Spark 3.0 compatible with Spark 2.4.x in all those cases when spark.sql.legacy.timeParser.enabled is set to true.

Does this PR introduce any user-facing change?

This shouldn't change behavior with default settings. If spark.sql.legacy.timeParser.enabled is set to true, users should observe behavior of Spark 2.4.

How was this patch tested?

  • Modified tests in DateExpressionsSuite to check the legacy parser - SimpleDateFormat.
  • Added CSVLegacyTimeParserSuite and JsonLegacyTimeParserSuite to run CSVSuite and JsonSuite with the legacy parser - FastDateFormat.

df.select(unix_timestamp(col("ss")).cast("timestamp")))
checkAnswer(df.select(to_timestamp(col("ss"))), Seq(
Row(ts1), Row(ts2)))
if (legacyParser) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to handle legacy mode especially due to behavior change of to_timestamp

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, SimpleDateFormat doesn't work correctly with the pattern .S. In Spark 2.4, it wasn't visible in the test because to_timestamp truncated results to seconds.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan Only here, I have to modify the test to adopt it for the legacy parser.

@SparkQA
Copy link

SparkQA commented Feb 10, 2020

Test build #118172 has finished for PR 27524 at commit 38d90d8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait LegacyTimestampFormatter extends TimestampFormatter
  • class LegacyFastDateFormatter(
  • class LegacySimpleDateFormatter(

@SparkQA
Copy link

SparkQA commented Feb 10, 2020

Test build #118175 has finished for PR 27524 at commit ab1d57f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait LegacyDateFormatter extends DateFormatter
  • class LegacyFastDateFormatter(pattern: String, locale: Locale) extends LegacyDateFormatter
  • class LegacySimpleDateFormatter(pattern: String, locale: Locale) extends LegacyDateFormatter
  • class LegacyFastTimestampFormatter(
  • class LegacySimpleTimestampFormatter(

@MaxGekk
Copy link
Member Author

MaxGekk commented Feb 10, 2020

@cloud-fan @HyukjinKwon Please, look at the draft PR.

@@ -1,2 +1,2 @@
"good record",1999-08-01
"bad record",1999-088-01
"bad record",1999-088_01
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to change this because FastDateFormat is not so strong, and can parse 1999-088-01

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we run these tests with legacy formatter?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I added CSVLegacyTimeParserSuite which runs entire CSVSuite with the legacy parser.

@@ -1,2 +1,2 @@
0,2013-111-11 12:13:14
0,2013-111_11 12:13:14
Copy link
Member Author

@MaxGekk MaxGekk Feb 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2013-111-11 is valid for FastDateFormat

* Also this class allows to set raw value to the `MILLISECOND` field
* directly before formatting.
*/
class MicrosCalendar(tz: TimeZone, digitsInFraction: Int)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a copy-paste from 2.4

@HyukjinKwon
Copy link
Member

Approach seems okay.

@SparkQA
Copy link

SparkQA commented Feb 11, 2020

Test build #118185 has finished for PR 27524 at commit 566170a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 11, 2020

Test build #118213 has finished for PR 27524 at commit ddf127d.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val MAX_LONG_DIGITS = 18

private val POW_10 = Array.tabulate[Long](MAX_LONG_DIGITS + 1)(i => math.pow(10, i).toLong)
val POW_10 = Array.tabulate[Long](MAX_LONG_DIGITS + 1)(i => math.pow(10, i).toLong)
Copy link
Member Author

@MaxGekk MaxGekk Feb 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

POW_10 is needed in the wrapper of FastDateFormat to support parsing/formatting in microsecond precision. Similar changes were made in Spark 2.4.

@SparkQA
Copy link

SparkQA commented Feb 11, 2020

Test build #118215 has finished for PR 27524 at commit 93f3ae1.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Feb 11, 2020

jenkins, retest this, please

import org.apache.spark.sql.internal.SQLConf
import org.apache.spark.sql.types._
import org.apache.spark.unsafe.types.{CalendarInterval, UTF8String}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all the tests in this file are not affected by the new or legacy formatter?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrapped the tests that are affected by:

    Seq(false, true).foreach { legacyParser =>
      withSQLConf(SQLConf.LEGACY_TIME_PARSER_ENABLED.key -> legacyParser.toString) {
     }
  }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They work fine with SimpleDateFormat and lenient = false.

@MaxGekk MaxGekk changed the title [WIP][SQL] Support SimpleDateFormat and FastDateFormat as legacy date/timestamp formatters [SPARK-30788][SQL] Support SimpleDateFormat and FastDateFormat as legacy date/timestamp formatters Feb 11, 2020
@SparkQA
Copy link

SparkQA commented Feb 11, 2020

Test build #118219 has finished for PR 27524 at commit 93f3ae1.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Feb 11, 2020

jenkins, retest this, please

@SparkQA
Copy link

SparkQA commented Feb 11, 2020

Test build #118232 has finished for PR 27524 at commit 93f3ae1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master/3.0!

@cloud-fan cloud-fan closed this in c198620 Feb 12, 2020
cloud-fan pushed a commit that referenced this pull request Feb 12, 2020
… legacy date/timestamp formatters

### What changes were proposed in this pull request?
In the PR, I propose to add legacy date/timestamp formatters based on `SimpleDateFormat` and `FastDateFormat`:
- `LegacyFastTimestampFormatter` - uses `FastDateFormat` and supports parsing/formatting in microsecond precision. The code was borrowed from Spark 2.4, see #26507 & #26582
- `LegacySimpleTimestampFormatter` uses `SimpleDateFormat`, and support the `lenient` mode. When the `lenient` parameter is set to `false`, the parser become much stronger in checking its input.

### Why are the changes needed?
Spark 2.4.x uses the following parsers for parsing/formatting date/timestamp strings:
- `DateTimeFormat` in CSV/JSON datasource
- `SimpleDateFormat` - is used in JDBC datasource, in partitions parsing.
- `SimpleDateFormat` in strong mode (`lenient = false`), see https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L124. It is used by the `date_format`, `from_unixtime`, `unix_timestamp` and `to_unix_timestamp` functions.

The PR aims to make Spark 3.0 compatible with Spark 2.4.x in all those cases when `spark.sql.legacy.timeParser.enabled` is set to `true`.

### Does this PR introduce any user-facing change?
This shouldn't change behavior with default settings. If `spark.sql.legacy.timeParser.enabled` is set to `true`, users should observe behavior of Spark 2.4.

### How was this patch tested?
- Modified tests in `DateExpressionsSuite` to check the legacy parser - `SimpleDateFormat`.
- Added `CSVLegacyTimeParserSuite` and `JsonLegacyTimeParserSuite` to run `CSVSuite` and `JsonSuite` with the legacy parser - `FastDateFormat`.

Closes #27524 from MaxGekk/timestamp-formatter-legacy-fallback.

Authored-by: Maxim Gekk <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit c198620)
Signed-off-by: Wenchen Fan <[email protected]>
sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
… legacy date/timestamp formatters

### What changes were proposed in this pull request?
In the PR, I propose to add legacy date/timestamp formatters based on `SimpleDateFormat` and `FastDateFormat`:
- `LegacyFastTimestampFormatter` - uses `FastDateFormat` and supports parsing/formatting in microsecond precision. The code was borrowed from Spark 2.4, see apache#26507 & apache#26582
- `LegacySimpleTimestampFormatter` uses `SimpleDateFormat`, and support the `lenient` mode. When the `lenient` parameter is set to `false`, the parser become much stronger in checking its input.

### Why are the changes needed?
Spark 2.4.x uses the following parsers for parsing/formatting date/timestamp strings:
- `DateTimeFormat` in CSV/JSON datasource
- `SimpleDateFormat` - is used in JDBC datasource, in partitions parsing.
- `SimpleDateFormat` in strong mode (`lenient = false`), see https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L124. It is used by the `date_format`, `from_unixtime`, `unix_timestamp` and `to_unix_timestamp` functions.

The PR aims to make Spark 3.0 compatible with Spark 2.4.x in all those cases when `spark.sql.legacy.timeParser.enabled` is set to `true`.

### Does this PR introduce any user-facing change?
This shouldn't change behavior with default settings. If `spark.sql.legacy.timeParser.enabled` is set to `true`, users should observe behavior of Spark 2.4.

### How was this patch tested?
- Modified tests in `DateExpressionsSuite` to check the legacy parser - `SimpleDateFormat`.
- Added `CSVLegacyTimeParserSuite` and `JsonLegacyTimeParserSuite` to run `CSVSuite` and `JsonSuite` with the legacy parser - `FastDateFormat`.

Closes apache#27524 from MaxGekk/timestamp-formatter-legacy-fallback.

Authored-by: Maxim Gekk <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
@MaxGekk MaxGekk deleted the timestamp-formatter-legacy-fallback branch June 5, 2020 19:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants