[SPARK-31662][SQL] Fix loading of dates before 1582-10-15 from dictionary encoded Parquet columns #28479

MaxGekk · 2020-05-08T12:17:55Z

What changes were proposed in this pull request?

Modified the decodeDictionaryIds() method of VectorizedColumnReader to handle especially the DateType when passed parameter rebaseDateTime is true. In that case, decoded days are rebased from the hybrid calendar to Proleptic Gregorian calendar using RebaseDateTime.rebaseJulianToGregorianDays().

Why are the changes needed?

This fixes the bug of loading dates before the cutover day from dictionary encoded column in parquet files. The code below forces dictionary encoding:

spark.conf.set("spark.sql.legacy.parquet.rebaseDateTimeInWrite.enabled", true)
Seq.tabulate(8)(_ => "1001-01-01").toDF("dateS")
  .select($"dateS".cast("date").as("date")).repartition(1)
  .write
  .option("parquet.enable.dictionary", true)
  .parquet(path)

Load the dates back:

spark.read.parquet(path).show(false)
+----------+
|date      |
+----------+
|1001-01-07|
...
|1001-01-07|
+----------+

Expected values must be 1000-01-01 but not 1001-01-07.

Does this PR introduce any user-facing change?

Yes. After the changes:

spark.read.parquet(path).show(false)
+----------+
|date      |
+----------+
|1001-01-01|
...
|1001-01-01|
+----------+

How was this patch tested?

Modified the test SPARK-31159: rebasing dates in write in ParquetIOSuite to checked reading dictionary encoded dates.

MaxGekk · 2020-05-08T12:18:17Z

@cloud-fan @HyukjinKwon Please, review this PR.

SparkQA · 2020-05-08T16:36:37Z

Test build #122440 has finished for PR 28479 at commit 0a560b9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-05-10T04:31:15Z

Merged to master and branch-3.0.

…nary encoded Parquet columns ### What changes were proposed in this pull request? Modified the `decodeDictionaryIds()` method `VectorizedColumnReader` to handle especially the `DateType` when passed parameter `rebaseDateTime` is true. In that case, decoded days are rebased from the hybrid calendar to Proleptic Gregorian calendar using `RebaseDateTime`.`rebaseJulianToGregorianDays()`. ### Why are the changes needed? This fixes the bug of loading dates before the cutover day from dictionary encoded column in parquet files. The code below forces dictionary encoding: ```scala spark.conf.set("spark.sql.legacy.parquet.rebaseDateTimeInWrite.enabled", true) Seq.tabulate(8)(_ => "1001-01-01").toDF("dateS") .select($"dateS".cast("date").as("date")).repartition(1) .write .option("parquet.enable.dictionary", true) .parquet(path) ``` Load the dates back: ```scala spark.read.parquet(path).show(false) +----------+ |date | +----------+ |1001-01-07| ... |1001-01-07| +----------+ ``` Expected values **must be 1000-01-01** but not 1001-01-07. ### Does this PR introduce _any_ user-facing change? Yes. After the changes: ```scala spark.read.parquet(path).show(false) +----------+ |date | +----------+ |1001-01-01| ... |1001-01-01| +----------+ ``` ### How was this patch tested? Modified the test `SPARK-31159: rebasing dates in write` in `ParquetIOSuite` to checked reading dictionary encoded dates. Closes #28479 from MaxGekk/fix-datetime-rebase-parquet-dict-enc. Authored-by: Max Gekk <[email protected]> Signed-off-by: HyukjinKwon <[email protected]> (cherry picked from commit ce63bef) Signed-off-by: HyukjinKwon <[email protected]>

MaxGekk added 4 commits May 8, 2020 13:55

Test dict encoding

ee0f773

Bug fix

07c9019

enableDict -> dictionaryEncoding

211a415

Fix test: add repartition by 1

0a560b9

probot-autolabeler bot added the SQL label May 8, 2020

cloud-fan approved these changes May 8, 2020

View reviewed changes

MaxGekk mentioned this pull request May 8, 2020

[SPARK-31665][SQL][TESTS] Check parquet dictionary encoding of random dates/timestamps #28481

Closed

HyukjinKwon approved these changes May 10, 2020

View reviewed changes

HyukjinKwon closed this in ce63bef May 10, 2020

MaxGekk deleted the fix-datetime-rebase-parquet-dict-enc branch June 5, 2020 19:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-31662][SQL] Fix loading of dates before 1582-10-15 from dictionary encoded Parquet columns #28479

[SPARK-31662][SQL] Fix loading of dates before 1582-10-15 from dictionary encoded Parquet columns #28479

Uh oh!

MaxGekk commented May 8, 2020 •

edited

Loading

Uh oh!

MaxGekk commented May 8, 2020

Uh oh!

SparkQA commented May 8, 2020

Uh oh!

HyukjinKwon commented May 10, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-31662][SQL] Fix loading of dates before 1582-10-15 from dictionary encoded Parquet columns #28479

[SPARK-31662][SQL] Fix loading of dates before 1582-10-15 from dictionary encoded Parquet columns #28479

Uh oh!

Conversation

MaxGekk commented May 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

MaxGekk commented May 8, 2020

Uh oh!

SparkQA commented May 8, 2020

Uh oh!

HyukjinKwon commented May 10, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

MaxGekk commented May 8, 2020 •

edited

Loading