-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-26178][SQL] Use java.time API for parsing timestamps and dates from CSV #23150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
Show all changes
32 commits
Select commit
Hold shift + click to select a range
74a76c2
New and legacy time parser
MaxGekk 63cf611
Add config spark.sql.legacy.timeParser.enabled
MaxGekk 2a2ab83
Fallback legacy parser
MaxGekk 667bf9f
something
MaxGekk 227a7bd
Using instances
MaxGekk 73ee560
Added generator
MaxGekk f35f6e1
Refactoring of TimeFormatter
MaxGekk 1c09b58
Renaming to DateTimeFormatter
MaxGekk 7b213d5
Added DateFormatter
MaxGekk 242ba47
Default values in parsing
MaxGekk db48ee6
Parse as date type because format for timestamp is not not matched to…
MaxGekk e18841b
Fix tests
MaxGekk 8db0238
CSVSuite passed
MaxGekk 0b9ed92
Fix imports
MaxGekk 799ebb3
Revert test back
MaxGekk 5a22391
Set timeZone
MaxGekk f287b77
Removing default for micros because it causes conflicts in parsing
MaxGekk 52074f7
Set timezone otherwise default is using
MaxGekk 647b09c
Removing CSVOptions param from CsvInferSchema methods
MaxGekk 4d6c86b
Use constants
MaxGekk 6552dcf
Merge remote-tracking branch 'origin/master' into time-parser
MaxGekk f3f46c7
Merging followup
MaxGekk 3f3ca70
Updating the migration guide
MaxGekk 1dd9ed1
Inlining method's arguments
MaxGekk 83bf58b
Additional fallback
MaxGekk 00509d3
Removing unrelated changes
MaxGekk 9b0570e
Merge remote-tracking branch 'origin/master' into time-parser
MaxGekk e9d6bb0
Using floorDiv to take days from seconds
MaxGekk 1ad1184
A test for roundtrip timestamp parsing
MaxGekk f8097b4
Tests for DateTimeFormatter
MaxGekk 3848795
Fix typo
MaxGekk 60c0974
Merge remote-tracking branch 'fork/time-parser' into time-parser
MaxGekk File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -22,10 +22,16 @@ import scala.util.control.Exception.allCatch | |
| import org.apache.spark.rdd.RDD | ||
| import org.apache.spark.sql.catalyst.analysis.TypeCoercion | ||
| import org.apache.spark.sql.catalyst.expressions.ExprUtils | ||
| import org.apache.spark.sql.catalyst.util.DateTimeUtils | ||
| import org.apache.spark.sql.catalyst.util.DateTimeFormatter | ||
| import org.apache.spark.sql.types._ | ||
|
|
||
| class CSVInferSchema(options: CSVOptions) extends Serializable { | ||
| class CSVInferSchema(val options: CSVOptions) extends Serializable { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. since we get the |
||
|
|
||
| @transient | ||
| private lazy val timeParser = DateTimeFormatter( | ||
| options.timestampFormat, | ||
| options.timeZone, | ||
| options.locale) | ||
|
|
||
| private val decimalParser = { | ||
| ExprUtils.getDecimalParser(options.locale) | ||
|
|
@@ -154,10 +160,7 @@ class CSVInferSchema(options: CSVOptions) extends Serializable { | |
|
|
||
| private def tryParseTimestamp(field: String): DataType = { | ||
| // This case infers a custom `dataFormat` is set. | ||
| if ((allCatch opt options.timestampFormat.parse(field)).isDefined) { | ||
| TimestampType | ||
| } else if ((allCatch opt DateTimeUtils.stringToTime(field)).isDefined) { | ||
| // We keep this for backwards compatibility. | ||
| if ((allCatch opt timeParser.parse(field)).isDefined) { | ||
| TimestampType | ||
| } else { | ||
| tryParseBoolean(field) | ||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
179 changes: 179 additions & 0 deletions
179
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeFormatter.scala
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,179 @@ | ||
| /* | ||
| * Licensed to the Apache Software Foundation (ASF) under one or more | ||
| * contributor license agreements. See the NOTICE file distributed with | ||
| * this work for additional information regarding copyright ownership. | ||
| * The ASF licenses this file to You under the Apache License, Version 2.0 | ||
| * (the "License"); you may not use this file except in compliance with | ||
| * the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, software | ||
| * distributed under the License is distributed on an "AS IS" BASIS, | ||
| * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| * See the License for the specific language governing permissions and | ||
| * limitations under the License. | ||
| */ | ||
|
|
||
| package org.apache.spark.sql.catalyst.util | ||
|
|
||
| import java.time._ | ||
| import java.time.format.DateTimeFormatterBuilder | ||
| import java.time.temporal.{ChronoField, TemporalQueries} | ||
| import java.util.{Locale, TimeZone} | ||
|
|
||
| import scala.util.Try | ||
|
|
||
| import org.apache.commons.lang3.time.FastDateFormat | ||
|
|
||
| import org.apache.spark.sql.internal.SQLConf | ||
|
|
||
| sealed trait DateTimeFormatter { | ||
| def parse(s: String): Long // returns microseconds since epoch | ||
| def format(us: Long): String | ||
| } | ||
|
|
||
| class Iso8601DateTimeFormatter( | ||
| pattern: String, | ||
| timeZone: TimeZone, | ||
| locale: Locale) extends DateTimeFormatter { | ||
| val formatter = new DateTimeFormatterBuilder() | ||
| .appendPattern(pattern) | ||
| .parseDefaulting(ChronoField.YEAR_OF_ERA, 1970) | ||
| .parseDefaulting(ChronoField.MONTH_OF_YEAR, 1) | ||
| .parseDefaulting(ChronoField.DAY_OF_MONTH, 1) | ||
| .parseDefaulting(ChronoField.HOUR_OF_DAY, 0) | ||
| .parseDefaulting(ChronoField.MINUTE_OF_HOUR, 0) | ||
| .parseDefaulting(ChronoField.SECOND_OF_MINUTE, 0) | ||
| .toFormatter(locale) | ||
|
|
||
| def toInstant(s: String): Instant = { | ||
| val temporalAccessor = formatter.parse(s) | ||
| if (temporalAccessor.query(TemporalQueries.offset()) == null) { | ||
| val localDateTime = LocalDateTime.from(temporalAccessor) | ||
| val zonedDateTime = ZonedDateTime.of(localDateTime, timeZone.toZoneId) | ||
| Instant.from(zonedDateTime) | ||
| } else { | ||
| Instant.from(temporalAccessor) | ||
| } | ||
| } | ||
|
|
||
| private def instantToMicros(instant: Instant): Long = { | ||
| val sec = Math.multiplyExact(instant.getEpochSecond, DateTimeUtils.MICROS_PER_SECOND) | ||
| val result = Math.addExact(sec, instant.getNano / DateTimeUtils.NANOS_PER_MICROS) | ||
| result | ||
| } | ||
|
|
||
| def parse(s: String): Long = instantToMicros(toInstant(s)) | ||
|
|
||
| def format(us: Long): String = { | ||
| val secs = Math.floorDiv(us, DateTimeUtils.MICROS_PER_SECOND) | ||
| val mos = Math.floorMod(us, DateTimeUtils.MICROS_PER_SECOND) | ||
| val instant = Instant.ofEpochSecond(secs, mos * DateTimeUtils.NANOS_PER_MICROS) | ||
|
|
||
| formatter.withZone(timeZone.toZoneId).format(instant) | ||
| } | ||
| } | ||
|
|
||
| class LegacyDateTimeFormatter( | ||
| pattern: String, | ||
| timeZone: TimeZone, | ||
| locale: Locale) extends DateTimeFormatter { | ||
| val format = FastDateFormat.getInstance(pattern, timeZone, locale) | ||
|
|
||
| protected def toMillis(s: String): Long = format.parse(s).getTime | ||
|
|
||
| def parse(s: String): Long = toMillis(s) * DateTimeUtils.MICROS_PER_MILLIS | ||
|
|
||
| def format(us: Long): String = { | ||
| format.format(DateTimeUtils.toJavaTimestamp(us)) | ||
| } | ||
| } | ||
|
|
||
| class LegacyFallbackDateTimeFormatter( | ||
| pattern: String, | ||
| timeZone: TimeZone, | ||
| locale: Locale) extends LegacyDateTimeFormatter(pattern, timeZone, locale) { | ||
| override def toMillis(s: String): Long = { | ||
| Try {super.toMillis(s)}.getOrElse(DateTimeUtils.stringToTime(s).getTime) | ||
| } | ||
| } | ||
|
|
||
| object DateTimeFormatter { | ||
| def apply(format: String, timeZone: TimeZone, locale: Locale): DateTimeFormatter = { | ||
| if (SQLConf.get.legacyTimeParserEnabled) { | ||
| new LegacyFallbackDateTimeFormatter(format, timeZone, locale) | ||
| } else { | ||
| new Iso8601DateTimeFormatter(format, timeZone, locale) | ||
| } | ||
| } | ||
| } | ||
|
|
||
| sealed trait DateFormatter { | ||
| def parse(s: String): Int // returns days since epoch | ||
| def format(days: Int): String | ||
| } | ||
|
|
||
| class Iso8601DateFormatter( | ||
| pattern: String, | ||
| timeZone: TimeZone, | ||
| locale: Locale) extends DateFormatter { | ||
|
|
||
| val dateTimeFormatter = new Iso8601DateTimeFormatter(pattern, timeZone, locale) | ||
|
|
||
| override def parse(s: String): Int = { | ||
| val seconds = dateTimeFormatter.toInstant(s).getEpochSecond | ||
| val days = Math.floorDiv(seconds, DateTimeUtils.SECONDS_PER_DAY) | ||
|
|
||
| days.toInt | ||
| } | ||
|
|
||
| override def format(days: Int): String = { | ||
| val instant = Instant.ofEpochSecond(days * DateTimeUtils.SECONDS_PER_DAY) | ||
| dateTimeFormatter.formatter.withZone(timeZone.toZoneId).format(instant) | ||
| } | ||
| } | ||
|
|
||
| class LegacyDateFormatter( | ||
| pattern: String, | ||
| timeZone: TimeZone, | ||
| locale: Locale) extends DateFormatter { | ||
| val format = FastDateFormat.getInstance(pattern, timeZone, locale) | ||
|
|
||
| def parse(s: String): Int = { | ||
| val milliseconds = format.parse(s).getTime | ||
| DateTimeUtils.millisToDays(milliseconds) | ||
| } | ||
|
|
||
| def format(days: Int): String = { | ||
| val date = DateTimeUtils.toJavaDate(days) | ||
| format.format(date) | ||
| } | ||
| } | ||
|
|
||
| class LegacyFallbackDateFormatter( | ||
| pattern: String, | ||
| timeZone: TimeZone, | ||
| locale: Locale) extends LegacyDateFormatter(pattern, timeZone, locale) { | ||
| override def parse(s: String): Int = { | ||
| Try(super.parse(s)).orElse { | ||
| // If it fails to parse, then tries the way used in 2.0 and 1.x for backwards | ||
| // compatibility. | ||
| Try(DateTimeUtils.millisToDays(DateTimeUtils.stringToTime(s).getTime)) | ||
| }.getOrElse { | ||
| // In Spark 1.5.0, we store the data as number of days since epoch in string. | ||
| // So, we just convert it to Int. | ||
| s.toInt | ||
| } | ||
| } | ||
| } | ||
|
|
||
| object DateFormatter { | ||
| def apply(format: String, timeZone: TimeZone, locale: Locale): DateFormatter = { | ||
| if (SQLConf.get.legacyTimeParserEnabled) { | ||
| new LegacyFallbackDateFormatter(format, timeZone, locale) | ||
| } else { | ||
| new Iso8601DateFormatter(format, timeZone, locale) | ||
| } | ||
| } | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MaxGekk, can you check if this legacy configuration works or not?
I checked it as below:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adding @cloud-fan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm surprised it doesn't work, as this pattern of using SQLConf appears in many places.
Can you create a ticket for it? Is this only a problem when setting conf via spark shell?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely the flag switches behavior since I used in a test recently: https://github.com/apache/spark/pull/23196/files#diff-fde14032b0e6ef8086461edf79a27c5dR1454
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I saw the test but weirdly it doesn't work in shall. Do you mind if I check it as I did? Something is weird. Want to be very sure if it's an issue or something I did wrongly by myself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the same but it is interesting that:
It seems when an instance of
CSVInferSchemais created SQL configs haven't been set on executor side yet.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good that I wasn't doing something stupid alone. Let's file a JIRA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the ticket: https://issues.apache.org/jira/browse/SPARK-26384
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We removed this conf in the follow-up PRs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#23495 is the PR that removed the conf.