You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[SPARK-31662][SQL] Fix loading of dates before 1582-10-15 from dictionary encoded Parquet columns
### What changes were proposed in this pull request?
Modified the `decodeDictionaryIds()` method `VectorizedColumnReader` to handle especially the `DateType` when passed parameter `rebaseDateTime` is true. In that case, decoded days are rebased from the hybrid calendar to Proleptic Gregorian calendar using `RebaseDateTime`.`rebaseJulianToGregorianDays()`.
### Why are the changes needed?
This fixes the bug of loading dates before the cutover day from dictionary encoded column in parquet files. The code below forces dictionary encoding:
```scala
spark.conf.set("spark.sql.legacy.parquet.rebaseDateTimeInWrite.enabled", true)
Seq.tabulate(8)(_ => "1001-01-01").toDF("dateS")
.select($"dateS".cast("date").as("date")).repartition(1)
.write
.option("parquet.enable.dictionary", true)
.parquet(path)
```
Load the dates back:
```scala
spark.read.parquet(path).show(false)
+----------+
|date |
+----------+
|1001-01-07|
...
|1001-01-07|
+----------+
```
Expected values **must be 1000-01-01** but not 1001-01-07.
### Does this PR introduce _any_ user-facing change?
Yes. After the changes:
```scala
spark.read.parquet(path).show(false)
+----------+
|date |
+----------+
|1001-01-01|
...
|1001-01-01|
+----------+
```
### How was this patch tested?
Modified the test `SPARK-31159: rebasing dates in write` in `ParquetIOSuite` to checked reading dictionary encoded dates.
Closes#28479 from MaxGekk/fix-datetime-rebase-parquet-dict-enc.
Authored-by: Max Gekk <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
0 commit comments