-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-31159][SQL] Rebase date/timestamp from/to Julian calendar in parquet #27915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
Show all changes
41 commits
Select commit
Hold shift + click to select a range
ae83fdc
Add rebaseGregorianToJulianMicros()
MaxGekk 4c35ddb
Add rebaseJulianToGregorianMicros()
MaxGekk 74774fc
Add a comment to rebaseJulianToGregorianMicros()
MaxGekk 8d74214
Add comment about calendar
MaxGekk 56ca744
Add round trip test for micros
MaxGekk 0cfeed5
Rename to JULIAN_CUTOVER_MICROS
MaxGekk 78cbd6c
Add rebaseJulianToGregorianDays and rebaseGregorianToJulianDays
MaxGekk 13aad60
Minor fix code style
MaxGekk 96573a9
Add comments for rebaseGregorianToJulianDays()
MaxGekk 36c0400
Add the SQL config spark.sql.legacy.parquet.rebaseDateTime.enabled
MaxGekk f0a2df6
Perform rebase in write
MaxGekk 9e3c201
Perform rebase dates in write
MaxGekk 053861c
Perform rebase dates/timestamps in read
MaxGekk 1624756
Rewrite days rebasing using Java 7 API
MaxGekk e3bbcb5
Rewrite micros rebasing using Java 7 API
MaxGekk d1e6d84
Extract common code
MaxGekk acd33f1
Revert "Extract common code"
MaxGekk 41fc33f
Revert "Rewrite micros rebasing using Java 7 API"
MaxGekk fe9f130
Revert "Rewrite days rebasing using Java 7 API"
MaxGekk c2c53b8
Remove branching by cutover days in rebase functions
MaxGekk 8e94359
Rebasing via system time zone
MaxGekk d6f7e6b
Rebase dates via UTC local date
MaxGekk 81d342a
Check more time zones in days rebasing
MaxGekk 63428ab
More dates/timestamps for testing
MaxGekk a34a9ce
Rename utcCal to cal
MaxGekk e590d36
Test multiple time zones in rebasing timestamps
MaxGekk 262f744
Test reading parquet files written by Spark 2.4
MaxGekk 8947298
Remove .asInstanceOf[DateType#InternalType]
MaxGekk 276d159
Change SQL config description
MaxGekk 167b463
Rebase timestamp INT96
MaxGekk bbc4a1a
Support rebasing in VectorizedColumnReader
MaxGekk a1b34cb
Bug fix in write
MaxGekk d7debb4
Add test for write
MaxGekk 6bebf3b
Read SQL config in place
MaxGekk 67cec02
Remove a gap
MaxGekk 8fa19a6
Remove config settings from ParquetWriteBuilder
MaxGekk a061870
Initialize rebaseDateTime in default constructor in ParquetWriteSupport
MaxGekk ae49cc4
Check INT96 rebasing regardless of SQL config settings
MaxGekk a96392c
Add JIRA id
MaxGekk 5b52735
Test INT96 w/ and w/o vectorized reader
MaxGekk 184fcd8
Merge remote-tracking branch 'remotes/origin/master' into rebase-parq…
MaxGekk File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -21,7 +21,7 @@ import java.nio.charset.StandardCharsets | |
| import java.sql.{Date, Timestamp} | ||
| import java.time._ | ||
| import java.time.temporal.{ChronoField, ChronoUnit, IsoFields} | ||
| import java.util.{Locale, TimeZone} | ||
| import java.util.{Calendar, Locale, TimeZone} | ||
| import java.util.concurrent.TimeUnit._ | ||
|
|
||
| import scala.util.control.NonFatal | ||
|
|
@@ -148,7 +148,9 @@ object DateTimeUtils { | |
| def fromJulianDay(day: Int, nanoseconds: Long): SQLTimestamp = { | ||
| // use Long to avoid rounding errors | ||
| val seconds = (day - JULIAN_DAY_OF_EPOCH).toLong * SECONDS_PER_DAY | ||
| SECONDS.toMicros(seconds) + NANOSECONDS.toMicros(nanoseconds) | ||
| val micros = SECONDS.toMicros(seconds) + NANOSECONDS.toMicros(nanoseconds) | ||
| val rebased = rebaseJulianToGregorianMicros(micros) | ||
| rebased | ||
| } | ||
|
|
||
| /** | ||
|
|
@@ -157,7 +159,7 @@ object DateTimeUtils { | |
| * Note: support timestamp since 4717 BC (without negative nanoseconds, compatible with Hive). | ||
| */ | ||
| def toJulianDay(us: SQLTimestamp): (Int, Long) = { | ||
| val julian_us = us + JULIAN_DAY_OF_EPOCH * MICROS_PER_DAY | ||
| val julian_us = rebaseGregorianToJulianMicros(us) + JULIAN_DAY_OF_EPOCH * MICROS_PER_DAY | ||
| val day = julian_us / MICROS_PER_DAY | ||
| val micros = julian_us % MICROS_PER_DAY | ||
| (day.toInt, MICROSECONDS.toNanos(micros)) | ||
|
|
@@ -936,4 +938,102 @@ object DateTimeUtils { | |
| val days = period.getDays | ||
| new CalendarInterval(months, days, 0) | ||
| } | ||
|
|
||
| /** | ||
| * Converts the given microseconds to a local date-time in UTC time zone in Proleptic Gregorian | ||
| * calendar, interprets the result as a local date-time in Julian calendar in UTC time zone. | ||
| * And takes microseconds since the epoch from the Julian timestamp. | ||
| * | ||
| * @param micros The number of microseconds since the epoch '1970-01-01T00:00:00Z'. | ||
| * @return The rebased microseconds since the epoch in Julian calendar. | ||
| */ | ||
| def rebaseGregorianToJulianMicros(micros: Long): Long = { | ||
| val ldt = microsToInstant(micros).atZone(ZoneId.systemDefault).toLocalDateTime | ||
| val cal = new Calendar.Builder() | ||
| // `gregory` is a hybrid calendar that supports both | ||
| // the Julian and Gregorian calendar systems | ||
| .setCalendarType("gregory") | ||
| .setDate(ldt.getYear, ldt.getMonthValue - 1, ldt.getDayOfMonth) | ||
| .setTimeOfDay(ldt.getHour, ldt.getMinute, ldt.getSecond) | ||
| .build() | ||
| millisToMicros(cal.getTimeInMillis) + ldt.get(ChronoField.MICRO_OF_SECOND) | ||
| } | ||
|
|
||
| /** | ||
| * Converts the given microseconds to a local date-time in UTC time zone in Julian calendar, | ||
| * interprets the result as a local date-time in Proleptic Gregorian calendar in UTC time zone. | ||
| * And takes microseconds since the epoch from the Gregorian timestamp. | ||
| * | ||
| * @param micros The number of microseconds since the epoch '1970-01-01T00:00:00Z'. | ||
| * @return The rebased microseconds since the epoch in Proleptic Gregorian calendar. | ||
| */ | ||
| def rebaseJulianToGregorianMicros(micros: Long): Long = { | ||
| val cal = new Calendar.Builder() | ||
| // `gregory` is a hybrid calendar that supports both | ||
| // the Julian and Gregorian calendar systems | ||
| .setCalendarType("gregory") | ||
| .setInstant(microsToMillis(micros)) | ||
| .build() | ||
| val localDateTime = LocalDateTime.of( | ||
| cal.get(Calendar.YEAR), | ||
| cal.get(Calendar.MONTH) + 1, | ||
| cal.get(Calendar.DAY_OF_MONTH), | ||
| cal.get(Calendar.HOUR_OF_DAY), | ||
| cal.get(Calendar.MINUTE), | ||
| cal.get(Calendar.SECOND), | ||
| (Math.floorMod(micros, MICROS_PER_SECOND) * NANOS_PER_MICROS).toInt) | ||
| instantToMicros(localDateTime.atZone(ZoneId.systemDefault).toInstant) | ||
| } | ||
|
|
||
| /** | ||
| * Converts the given number of days since the epoch day 1970-01-01 to | ||
| * a local date in Julian calendar, interprets the result as a local | ||
| * date in Proleptic Gregorian calendar, and take the number of days | ||
| * since the epoch from the Gregorian date. | ||
| * | ||
| * @param days The number of days since the epoch in Julian calendar. | ||
| * @return The rebased number of days in Gregorian calendar. | ||
| */ | ||
| def rebaseJulianToGregorianDays(days: Int): Int = { | ||
| val utcCal = new Calendar.Builder() | ||
| // `gregory` is a hybrid calendar that supports both | ||
| // the Julian and Gregorian calendar systems | ||
| .setCalendarType("gregory") | ||
| .setTimeZone(TimeZoneUTC) | ||
|
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can use particular time zone here because the conversion of "logical" days is independent from time zone, actually. |
||
| .setInstant(Math.multiplyExact(days, MILLIS_PER_DAY)) | ||
| .build() | ||
| val localDate = LocalDate.of( | ||
| utcCal.get(Calendar.YEAR), | ||
| utcCal.get(Calendar.MONTH) + 1, | ||
| utcCal.get(Calendar.DAY_OF_MONTH)) | ||
| Math.toIntExact(localDate.toEpochDay) | ||
| } | ||
|
|
||
| /** | ||
| * Rebasing days since the epoch to store the same number of days | ||
| * as by Spark 2.4 and earlier versions. Spark 3.0 switched to | ||
| * Proleptic Gregorian calendar (see SPARK-26651), and as a consequence of that, | ||
| * this affects dates before 1582-10-15. Spark 2.4 and earlier versions use | ||
| * Julian calendar for dates before 1582-10-15. So, the same local date may | ||
| * be mapped to different number of days since the epoch in different calendars. | ||
| * | ||
| * For example: | ||
| * Proleptic Gregorian calendar: 1582-01-01 -> -141714 | ||
| * Julian calendar: 1582-01-01 -> -141704 | ||
| * The code below converts -141714 to -141704. | ||
| * | ||
| * @param days The number of days since the epoch 1970-01-01. It can be negative. | ||
| * @return The rebased number of days since the epoch in Julian calendar. | ||
| */ | ||
| def rebaseGregorianToJulianDays(days: Int): Int = { | ||
| val localDate = LocalDate.ofEpochDay(days) | ||
| val utcCal = new Calendar.Builder() | ||
| // `gregory` is a hybrid calendar that supports both | ||
| // the Julian and Gregorian calendar systems | ||
| .setCalendarType("gregory") | ||
| .setTimeZone(TimeZoneUTC) | ||
| .setDate(localDate.getYear, localDate.getMonthValue - 1, localDate.getDayOfMonth) | ||
| .build() | ||
| Math.toIntExact(Math.floorDiv(utcCal.getTimeInMillis, MILLIS_PER_DAY)) | ||
| } | ||
| } | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan I do believe this is a bug fix. We should rebase independently from new SQL config
spark.sql.legacy.parquet.rebaseDateTime.enabledbecause days are stored as Julian days in Parquet INT96 timestamps.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check
https://github.com/apache/spark/pull/27915/files#diff-9bca6e248a070768e9f38ee5929411d8R897-R898
reads INT96 timestamp saved by Spark 2.4.5 via:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. INT96 is a legacy timestamp type in parquet and I'm not surprised if it follows the java 7 semantic.