Skip to content

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented May 8, 2020

What changes were proposed in this pull request?

Modified the decodeDictionaryIds() method of VectorizedColumnReader to handle especially the DateType when passed parameter rebaseDateTime is true. In that case, decoded days are rebased from the hybrid calendar to Proleptic Gregorian calendar using RebaseDateTime.rebaseJulianToGregorianDays().

Why are the changes needed?

This fixes the bug of loading dates before the cutover day from dictionary encoded column in parquet files. The code below forces dictionary encoding:

spark.conf.set("spark.sql.legacy.parquet.rebaseDateTimeInWrite.enabled", true)
Seq.tabulate(8)(_ => "1001-01-01").toDF("dateS")
  .select($"dateS".cast("date").as("date")).repartition(1)
  .write
  .option("parquet.enable.dictionary", true)
  .parquet(path)

Load the dates back:

spark.read.parquet(path).show(false)
+----------+
|date      |
+----------+
|1001-01-07|
...
|1001-01-07|
+----------+

Expected values must be 1000-01-01 but not 1001-01-07.

Does this PR introduce any user-facing change?

Yes. After the changes:

spark.read.parquet(path).show(false)
+----------+
|date      |
+----------+
|1001-01-01|
...
|1001-01-01|
+----------+

How was this patch tested?

Modified the test SPARK-31159: rebasing dates in write in ParquetIOSuite to checked reading dictionary encoded dates.

@MaxGekk
Copy link
Member Author

MaxGekk commented May 8, 2020

@cloud-fan @HyukjinKwon Please, review this PR.

@SparkQA
Copy link

SparkQA commented May 8, 2020

Test build #122440 has finished for PR 28479 at commit 0a560b9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to master and branch-3.0.

HyukjinKwon pushed a commit that referenced this pull request May 10, 2020
…nary encoded Parquet columns

### What changes were proposed in this pull request?
Modified the `decodeDictionaryIds()` method `VectorizedColumnReader` to handle especially the `DateType` when passed parameter `rebaseDateTime` is true. In that case, decoded days are rebased from the hybrid calendar to Proleptic Gregorian calendar using `RebaseDateTime`.`rebaseJulianToGregorianDays()`.

### Why are the changes needed?
This fixes the bug of loading dates before the cutover day from dictionary encoded column in parquet files. The code below forces dictionary encoding:
```scala
spark.conf.set("spark.sql.legacy.parquet.rebaseDateTimeInWrite.enabled", true)
Seq.tabulate(8)(_ => "1001-01-01").toDF("dateS")
  .select($"dateS".cast("date").as("date")).repartition(1)
  .write
  .option("parquet.enable.dictionary", true)
  .parquet(path)
```
Load the dates back:
```scala
spark.read.parquet(path).show(false)
+----------+
|date      |
+----------+
|1001-01-07|
...
|1001-01-07|
+----------+
```
Expected values **must be 1000-01-01** but not 1001-01-07.

### Does this PR introduce _any_ user-facing change?
Yes. After the changes:
```scala
spark.read.parquet(path).show(false)
+----------+
|date      |
+----------+
|1001-01-01|
...
|1001-01-01|
+----------+
```

### How was this patch tested?
Modified the test `SPARK-31159: rebasing dates in write` in `ParquetIOSuite` to checked reading dictionary encoded dates.

Closes #28479 from MaxGekk/fix-datetime-rebase-parquet-dict-enc.

Authored-by: Max Gekk <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
(cherry picked from commit ce63bef)
Signed-off-by: HyukjinKwon <[email protected]>
@MaxGekk MaxGekk deleted the fix-datetime-rebase-parquet-dict-enc branch June 5, 2020 19:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants