[SPARK-36034][SQL][3.0] Rebase datetime in pushed down filters to parquet #33387

MaxGekk · 2021-07-16T07:28:15Z

What changes were proposed in this pull request?

In the PR, I propose to propagate either the SQL config spark.sql.parquet.datetimeRebaseModeInRead or/and Parquet option datetimeRebaseMode to ParquetFilters. The ParquetFilters class uses the settings in conversions of dates/timestamps instances from datasource filters to values pushed via FilterApi to the parquet-column lib.

Before the changes, date/timestamp values expressed as days/microseconds/milliseconds are interpreted as offsets in Proleptic Gregorian calendar, and pushed to the parquet library as is. That works fine if timestamp/dates values in parquet files were saved in the CORRECTED mode but in the LEGACY mode, filter's values could not match to actual values.

After the changes, timestamp/dates values of filters pushed down to parquet libs such as FilterApi.eq(col1, -719162) are rebased according the rebase settings. For the example, if the rebase mode is CORRECTED, -719162 is pushed down as is but if the current rebase mode is LEGACY, the number of days is rebased to -719164. For more context, the PR description #28067 shows the diffs between two calendars.

Why are the changes needed?

The changes fix the bug portrayed by the following example from SPARK-36034:

In [27]: spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "LEGACY")
>>> spark.sql("SELECT DATE '0001-01-01' AS date").write.mode("overwrite").parquet("date_written_by_spark3_legacy")
>>> spark.read.parquet("date_written_by_spark3_legacy").where("date = '0001-01-01'").show()
+----+
|date|
+----+
+----+

The result must have the date value 0001-01-01.

Does this PR introduce any user-facing change?

In some sense, yes. Query results can be different in some cases. For the example above:

scala> spark.conf.set("spark.sql.parquet.datetimeRebaseModeInWrite", "LEGACY")
scala> spark.sql("SELECT DATE '0001-01-01' AS date").write.mode("overwrite").parquet("date_written_by_spark3_legacy")
scala> spark.read.parquet("date_written_by_spark3_legacy").where("date = '0001-01-01'").show(false)
+----------+
|date      |
+----------+
|0001-01-01|
+----------+

How was this patch tested?

By running the modified test suite ParquetFilterSuite:

$ build/sbt "test:testOnly *ParquetV1FilterSuite"
$ build/sbt "test:testOnly *ParquetV2FilterSuite"

Authored-by: Max Gekk <max.gekkgmail.com>
Signed-off-by: Hyukjin Kwon [email protected]
(cherry picked from commit b09b7f7)
(cherry picked from commit ba71172)

…quet In the PR, I propose to propagate either the SQL config `spark.sql.parquet.datetimeRebaseModeInRead` or/and Parquet option `datetimeRebaseMode` to `ParquetFilters`. The `ParquetFilters` class uses the settings in conversions of dates/timestamps instances from datasource filters to values pushed via `FilterApi` to the `parquet-column` lib. Before the changes, date/timestamp values expressed as days/microseconds/milliseconds are interpreted as offsets in Proleptic Gregorian calendar, and pushed to the parquet library as is. That works fine if timestamp/dates values in parquet files were saved in the `CORRECTED` mode but in the `LEGACY` mode, filter's values could not match to actual values. After the changes, timestamp/dates values of filters pushed down to parquet libs such as `FilterApi.eq(col1, -719162)` are rebased according the rebase settings. For the example, if the rebase mode is `CORRECTED`, **-719162** is pushed down as is but if the current rebase mode is `LEGACY`, the number of days is rebased to **-719164**. For more context, the PR description apache#28067 shows the diffs between two calendars. The changes fix the bug portrayed by the following example from SPARK-36034: ```scala In [27]: spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "LEGACY") >>> spark.sql("SELECT DATE '0001-01-01' AS date").write.mode("overwrite").parquet("date_written_by_spark3_legacy") >>> spark.read.parquet("date_written_by_spark3_legacy").where("date = '0001-01-01'").show() +----+ |date| +----+ +----+ ``` The result must have the date value `0001-01-01`. In some sense, yes. Query results can be different in some cases. For the example above: ```scala scala> spark.conf.set("spark.sql.parquet.datetimeRebaseModeInWrite", "LEGACY") scala> spark.sql("SELECT DATE '0001-01-01' AS date").write.mode("overwrite").parquet("date_written_by_spark3_legacy") scala> spark.read.parquet("date_written_by_spark3_legacy").where("date = '0001-01-01'").show(false) +----------+ |date | +----------+ |0001-01-01| +----------+ ``` By running the modified test suite `ParquetFilterSuite`: ``` $ build/sbt "test:testOnly *ParquetV1FilterSuite" $ build/sbt "test:testOnly *ParquetV2FilterSuite" ``` Authored-by: Max Gekk <max.gekkgmail.com> Signed-off-by: Max Gekk <max.gekkgmail.com> (cherry picked from commit b09b7f7) Signed-off-by: Max Gekk <max.gekkgmail.com> Closes apache#33375 from MaxGekk/fix-parquet-ts-filter-pushdown-3.1. Authored-by: Max Gekk <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit ba71172) Signed-off-by: Max Gekk <[email protected]>

SparkQA · 2021-07-16T08:34:52Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45653/

SparkQA · 2021-07-16T08:59:26Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45653/

HyukjinKwon · 2021-07-16T10:40:14Z

I made a PR to fix the R tests (#33390). but AQE test failure seems new

MaxGekk · 2021-07-16T11:08:02Z

I cannot reproduce AQE failure locally:

$ build/sbt "sql/test:testOnly *.AdaptiveQueryExecSuite"
[info] AdaptiveQueryExecSuite:
...
[info] - multiple joins (1 second, 187 milliseconds)
[info] - multiple joins with aggregate (1 second, 225 milliseconds)
[info] - multiple joins with aggregate 2 (1 second, 298 milliseconds)
...
[info] Run completed in 1 minute, 12 seconds.
[info] Total number of tests run: 33
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 33, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.

HyukjinKwon · 2021-07-16T11:09:12Z

I retriggered

SparkQA · 2021-07-16T12:21:38Z

Test build #141142 has finished for PR 33387 at commit 04732c7.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2021-07-16T12:29:26Z

Merged to bracnh-3.0!

…quet ### What changes were proposed in this pull request? In the PR, I propose to propagate either the SQL config `spark.sql.parquet.datetimeRebaseModeInRead` or/and Parquet option `datetimeRebaseMode` to `ParquetFilters`. The `ParquetFilters` class uses the settings in conversions of dates/timestamps instances from datasource filters to values pushed via `FilterApi` to the `parquet-column` lib. Before the changes, date/timestamp values expressed as days/microseconds/milliseconds are interpreted as offsets in Proleptic Gregorian calendar, and pushed to the parquet library as is. That works fine if timestamp/dates values in parquet files were saved in the `CORRECTED` mode but in the `LEGACY` mode, filter's values could not match to actual values. After the changes, timestamp/dates values of filters pushed down to parquet libs such as `FilterApi.eq(col1, -719162)` are rebased according the rebase settings. For the example, if the rebase mode is `CORRECTED`, **-719162** is pushed down as is but if the current rebase mode is `LEGACY`, the number of days is rebased to **-719164**. For more context, the PR description #28067 shows the diffs between two calendars. ### Why are the changes needed? The changes fix the bug portrayed by the following example from SPARK-36034: ```scala In [27]: spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "LEGACY") >>> spark.sql("SELECT DATE '0001-01-01' AS date").write.mode("overwrite").parquet("date_written_by_spark3_legacy") >>> spark.read.parquet("date_written_by_spark3_legacy").where("date = '0001-01-01'").show() +----+ |date| +----+ +----+ ``` The result must have the date value `0001-01-01`. ### Does this PR introduce _any_ user-facing change? In some sense, yes. Query results can be different in some cases. For the example above: ```scala scala> spark.conf.set("spark.sql.parquet.datetimeRebaseModeInWrite", "LEGACY") scala> spark.sql("SELECT DATE '0001-01-01' AS date").write.mode("overwrite").parquet("date_written_by_spark3_legacy") scala> spark.read.parquet("date_written_by_spark3_legacy").where("date = '0001-01-01'").show(false) +----------+ |date | +----------+ |0001-01-01| +----------+ ``` ### How was this patch tested? By running the modified test suite `ParquetFilterSuite`: ``` $ build/sbt "test:testOnly *ParquetV1FilterSuite" $ build/sbt "test:testOnly *ParquetV2FilterSuite" ``` Authored-by: Max Gekk <max.gekkgmail.com> Signed-off-by: Hyukjin Kwon <gurwls223apache.org> (cherry picked from commit b09b7f7) (cherry picked from commit ba71172) Closes #33387 from MaxGekk/fix-parquet-ts-filter-pushdown-3.0. Authored-by: Max Gekk <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…quet (#85) ### What changes were proposed in this pull request? In the PR, I propose to propagate either the SQL config `spark.sql.parquet.datetimeRebaseModeInRead` or/and Parquet option `datetimeRebaseMode` to `ParquetFilters`. The `ParquetFilters` class uses the settings in conversions of dates/timestamps instances from datasource filters to values pushed via `FilterApi` to the `parquet-column` lib. Before the changes, date/timestamp values expressed as days/microseconds/milliseconds are interpreted as offsets in Proleptic Gregorian calendar, and pushed to the parquet library as is. That works fine if timestamp/dates values in parquet files were saved in the `CORRECTED` mode but in the `LEGACY` mode, filter's values could not match to actual values. After the changes, timestamp/dates values of filters pushed down to parquet libs such as `FilterApi.eq(col1, -719162)` are rebased according the rebase settings. For the example, if the rebase mode is `CORRECTED`, **-719162** is pushed down as is but if the current rebase mode is `LEGACY`, the number of days is rebased to **-719164**. For more context, the PR description apache#28067 shows the diffs between two calendars. ### Why are the changes needed? The changes fix the bug portrayed by the following example from SPARK-36034: ```scala In [27]: spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "LEGACY") >>> spark.sql("SELECT DATE '0001-01-01' AS date").write.mode("overwrite").parquet("date_written_by_spark3_legacy") >>> spark.read.parquet("date_written_by_spark3_legacy").where("date = '0001-01-01'").show() +----+ |date| +----+ +----+ ``` The result must have the date value `0001-01-01`. ### Does this PR introduce _any_ user-facing change? In some sense, yes. Query results can be different in some cases. For the example above: ```scala scala> spark.conf.set("spark.sql.parquet.datetimeRebaseModeInWrite", "LEGACY") scala> spark.sql("SELECT DATE '0001-01-01' AS date").write.mode("overwrite").parquet("date_written_by_spark3_legacy") scala> spark.read.parquet("date_written_by_spark3_legacy").where("date = '0001-01-01'").show(false) +----------+ |date | +----------+ |0001-01-01| +----------+ ``` ### How was this patch tested? By running the modified test suite `ParquetFilterSuite`: ``` $ build/sbt "test:testOnly *ParquetV1FilterSuite" $ build/sbt "test:testOnly *ParquetV2FilterSuite" ``` Authored-by: Max Gekk <max.gekkgmail.com> Signed-off-by: Hyukjin Kwon <gurwls223apache.org> (cherry picked from commit b09b7f7) (cherry picked from commit ba71172) Closes apache#33387 from MaxGekk/fix-parquet-ts-filter-pushdown-3.0. Authored-by: Max Gekk <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> Co-authored-by: Max Gekk <[email protected]>

HyukjinKwon closed this Jul 16, 2021

MaxGekk mentioned this pull request Jul 16, 2021

[SPARK-36034][SQL] Rebase datetime in pushed down filters to parquet #33347

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-36034][SQL][3.0] Rebase datetime in pushed down filters to parquet #33387

[SPARK-36034][SQL][3.0] Rebase datetime in pushed down filters to parquet #33387

Uh oh!

MaxGekk commented Jul 16, 2021

Uh oh!

SparkQA commented Jul 16, 2021

Uh oh!

SparkQA commented Jul 16, 2021

Uh oh!

HyukjinKwon commented Jul 16, 2021

Uh oh!

MaxGekk commented Jul 16, 2021

Uh oh!

HyukjinKwon commented Jul 16, 2021

Uh oh!

SparkQA commented Jul 16, 2021

Uh oh!

HyukjinKwon commented Jul 16, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-36034][SQL][3.0] Rebase datetime in pushed down filters to parquet #33387

[SPARK-36034][SQL][3.0] Rebase datetime in pushed down filters to parquet #33387

Uh oh!

Conversation

MaxGekk commented Jul 16, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Jul 16, 2021

Uh oh!

SparkQA commented Jul 16, 2021

Uh oh!

HyukjinKwon commented Jul 16, 2021

Uh oh!

MaxGekk commented Jul 16, 2021

Uh oh!

HyukjinKwon commented Jul 16, 2021

Uh oh!

SparkQA commented Jul 16, 2021

Uh oh!

HyukjinKwon commented Jul 16, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants