[SPARK-46092][SQL] Don't push down Parquet row group filters that overflow #44006

johanl-db · 2023-11-24T15:25:25Z

What changes were proposed in this pull request?

This change adds a check for overflows when creating Parquet row group filters on an INT32 (byte/short/int) parquet type to avoid incorrectly skipping row groups if the predicate value doesn't fit in an INT. This can happen if the read schema is specified as LONG, e.g via .schema("col LONG")
While the Parquet readers don't support reading INT32 into a LONG, the overflow can lead to row groups being incorrectly skipped, bypassing the reader altogether and producing incorrect results instead of failing.

Why are the changes needed?

Reading a parquet file containing INT32 values with a read schema specified as LONG can produce incorrect results today:

Seq(0).toDF("a").write.parquet(path)
spark.read.schema("a LONG").parquet(path).where(s"a < ${Long.MaxValue}").collect()

will return an empty result. The correct result is either:

Failing the query if the parquet reader doesn't support upcasting integers to longs (all parquet readers in Spark today)
Return result [0] if the parquet reader supports that upcast (no readers in Spark as of now, but I'm looking into adding this capability).

Does this PR introduce any user-facing change?

The following:

Seq(0).toDF("a").write.parquet(path)
spark.read.schema("a LONG").parquet(path).where(s"a < ${Long.MaxValue}").collect()

produces an (incorrect) empty result before this change. After this change, the read will fail, raising an error about the unsupported conversion from INT to LONG in the parquet reader.

How was this patch tested?

Added tests to ParquetFilterSuite to ensure that no row group filter is created when the predicate value overflows or when the value type isn't compatible with the parquet type
Added test to ParquetQuerySuite covering the correctness issue described above.

Was this patch authored or co-authored using generative AI tooling?

No

HyukjinKwon · 2023-11-27T00:07:04Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala

      case ParquetBooleanType => value.isInstanceOf[JBoolean]
      case ParquetIntegerType if value.isInstanceOf[Period] => true
-      case ParquetByteType | ParquetShortType | ParquetIntegerType => value.isInstanceOf[Number]
+      case ParquetByteType | ParquetShortType | ParquetIntegerType => value match {


I think you should add some comments like there are only INT32 and INT64 physical types in Parquet format specification; therefore, we don't check INT8 and/or INT16.

Or, we could even allow exact match of value on Int to be conservative? cc @cloud-fan and @wangyum

Added a comment. I also made the condition more restrictive: value must be either a java Byte/Short/Integer/Long. This excludes other Numbers such as Float/Double/BigDecimal that are generally not safe to cast to int and that the initial change didn't block. For example, float NaN gets cast to 0 and would pass the previous check.

We can't match on Integer because for ParquetByteType and ParquetShortType the value will typically be a Byte and Short resp. Also this allows upcasting where a parquet integer type is read with a larger Spark integer type.

…ecimal)

…p-skipping-overflow

dongjoon-hyun

Hi, @johanl-db . Do you happen to know what causes this? I'm curious if this is Apache Spark 3.5.0-only issue or not.

cc @sunchao

sunchao

It's unfortunate that the check for Spark type versus Parquet type happens in ParquetVectorUpdaterFactory which is after predicate pushdown for row groups. Will similar issue happen for float to double in certain cases?

johanl-db · 2023-11-29T10:29:54Z

It's unfortunate that the check for Spark type versus Parquet type happens in ParquetVectorUpdaterFactory which is after predicate pushdown for row groups. Will similar issue happen for float to double in certain cases?

There's no issue with float to double because we were already strict when deciding whether to build a row group filter: we only accept float values for float and double values for double so no overflow possible.

Hi, @johanl-db . Do you happen to know what causes this? I'm curious if this is Apache Spark 3.5.0-only issue or not.

When creating row group filters, we accept any value and don't check if the value actually fits in the target type. If the read schema is LONG for example and the parquet type is INT32, you could pass a value that will overflow before this change. We have stricter type checks in the Parquet readers themselves, but by that time it's too late as the row group may already be incorrectly skipped and that check won't trigger.

Looking at https://github.com/apache/spark/blame/branch-3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L478, this goes back at least to 3.0 and it seems the check was even less strict in earlier versions so I'd say this behavior was always there.

beliefer · 2023-11-30T05:59:16Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala

+        // Byte/Short/Int are all stored as INT32 in Parquet so filters are built using type Int.
+        // We don't create a filter if the value would overflow.
+        case _: JByte | _: JShort | _: Integer => true
+        case v: JLong => v.longValue() >= Int.MinValue && v.longValue() <= Int.MaxValue


For simply, how about forbid the Long value direclty ?

There are some use cases where this can be useful, assuming the Parquet file contains type INT32:

The user specifies a read schema using schema("col LONG").

The column in the table schema has type LONG.
I don't know whether the latter can happen today but it will be possible in the near future for Delta tables as I'm looking into supporting type widening.

We could skip creating row group filters in that case but the logic is simple enough and it's going to be beneficial in the cases above.

Thank you for the explanation. I got the two use case.
But users may not notice the detail and confused by the behavior. I think we can delay the support until Delta table supports filter on long value.
Of course, skip creating row group filters if value exceeds the long range looks good too.

The plan would be to support this in the Delta version that will build on top of the next Spark version, so it would be good to have it here already.

If Spark already supports long type for a few releases, we can't drop it now or we have perf regressions. I'm +1 for the change here.

dongjoon-hyun · 2023-12-02T19:17:38Z

Merged to master for Apache Spark 4.0.0.

Could you make backporting PRs to make it sure that all tests pass in the release branches, @johanl-db ?

Thank you, @johanl-db , @wangyum , @HyukjinKwon , @sunchao , @beliefer , @cloud-fan .

…rflow ### What changes were proposed in this pull request? This change adds a check for overflows when creating Parquet row group filters on an INT32 (byte/short/int) parquet type to avoid incorrectly skipping row groups if the predicate value doesn't fit in an INT. This can happen if the read schema is specified as LONG, e.g via `.schema("col LONG")` While the Parquet readers don't support reading INT32 into a LONG, the overflow can lead to row groups being incorrectly skipped, bypassing the reader altogether and producing incorrect results instead of failing. ### Why are the changes needed? Reading a parquet file containing INT32 values with a read schema specified as LONG can produce incorrect results today: ``` Seq(0).toDF("a").write.parquet(path) spark.read.schema("a LONG").parquet(path).where(s"a < ${Long.MaxValue}").collect() ``` will return an empty result. The correct result is either: - Failing the query if the parquet reader doesn't support upcasting integers to longs (all parquet readers in Spark today) - Return result `[0]` if the parquet reader supports that upcast (no readers in Spark as of now, but I'm looking into adding this capability). ### Does this PR introduce _any_ user-facing change? The following: ``` Seq(0).toDF("a").write.parquet(path) spark.read.schema("a LONG").parquet(path).where(s"a < ${Long.MaxValue}").collect() ``` produces an (incorrect) empty result before this change. After this change, the read will fail, raising an error about the unsupported conversion from INT to LONG in the parquet reader. ### How was this patch tested? - Added tests to `ParquetFilterSuite` to ensure that no row group filter is created when the predicate value overflows or when the value type isn't compatible with the parquet type - Added test to `ParquetQuerySuite` covering the correctness issue described above. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#44006 from johanl-db/SPARK-46092-row-group-skipping-overflow. Authored-by: Johan Lasperas <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…rflow This change adds a check for overflows when creating Parquet row group filters on an INT32 (byte/short/int) parquet type to avoid incorrectly skipping row groups if the predicate value doesn't fit in an INT. This can happen if the read schema is specified as LONG, e.g via `.schema("col LONG")` While the Parquet readers don't support reading INT32 into a LONG, the overflow can lead to row groups being incorrectly skipped, bypassing the reader altogether and producing incorrect results instead of failing. Reading a parquet file containing INT32 values with a read schema specified as LONG can produce incorrect results today: ``` Seq(0).toDF("a").write.parquet(path) spark.read.schema("a LONG").parquet(path).where(s"a < ${Long.MaxValue}").collect() ``` will return an empty result. The correct result is either: - Failing the query if the parquet reader doesn't support upcasting integers to longs (all parquet readers in Spark today) - Return result `[0]` if the parquet reader supports that upcast (no readers in Spark as of now, but I'm looking into adding this capability). The following: ``` Seq(0).toDF("a").write.parquet(path) spark.read.schema("a LONG").parquet(path).where(s"a < ${Long.MaxValue}").collect() ``` produces an (incorrect) empty result before this change. After this change, the read will fail, raising an error about the unsupported conversion from INT to LONG in the parquet reader. - Added tests to `ParquetFilterSuite` to ensure that no row group filter is created when the predicate value overflows or when the value type isn't compatible with the parquet type - Added test to `ParquetQuerySuite` covering the correctness issue described above. No Closes apache#44006 from johanl-db/SPARK-46092-row-group-skipping-overflow. Authored-by: Johan Lasperas <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

johanl-db · 2023-12-04T08:50:16Z

@dongjoon-hyun I created backport PRs for the following branches:

3.5: [SPARK-46092][SQL][3.5] Don't push down Parquet row group filters that overflow #44154
3.4: [SPARK-46092][SQL][3.4] Don't push down Parquet row group filters that overflow #44155
3.3: [SPARK-46092][SQL][3.3] Don't push down Parquet row group filters that overflow #44156

Any other branch I should target?

…t overflow This is a cherry-pick from #44006 to spark 3.5 ### What changes were proposed in this pull request? This change adds a check for overflows when creating Parquet row group filters on an INT32 (byte/short/int) parquet type to avoid incorrectly skipping row groups if the predicate value doesn't fit in an INT. This can happen if the read schema is specified as LONG, e.g via `.schema("col LONG")` While the Parquet readers don't support reading INT32 into a LONG, the overflow can lead to row groups being incorrectly skipped, bypassing the reader altogether and producing incorrect results instead of failing. ### Why are the changes needed? Reading a parquet file containing INT32 values with a read schema specified as LONG can produce incorrect results today: ``` Seq(0).toDF("a").write.parquet(path) spark.read.schema("a LONG").parquet(path).where(s"a < ${Long.MaxValue}").collect() ``` will return an empty result. The correct result is either: - Failing the query if the parquet reader doesn't support upcasting integers to longs (all parquet readers in Spark today) - Return result `[0]` if the parquet reader supports that upcast (no readers in Spark as of now, but I'm looking into adding this capability). ### Does this PR introduce _any_ user-facing change? The following: ``` Seq(0).toDF("a").write.parquet(path) spark.read.schema("a LONG").parquet(path).where(s"a < ${Long.MaxValue}").collect() ``` produces an (incorrect) empty result before this change. After this change, the read will fail, raising an error about the unsupported conversion from INT to LONG in the parquet reader. ### How was this patch tested? - Added tests to `ParquetFilterSuite` to ensure that no row group filter is created when the predicate value overflows or when the value type isn't compatible with the parquet type - Added test to `ParquetQuerySuite` covering the correctness issue described above. ### Was this patch authored or co-authored using generative AI tooling? No Closes #44154 from johanl-db/SPARK-46092-row-group-skipping-overflow-3.5. Authored-by: Johan Lasperas <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…t overflow This is a cherry-pick from #44006 to spark 3.4 ### What changes were proposed in this pull request? This change adds a check for overflows when creating Parquet row group filters on an INT32 (byte/short/int) parquet type to avoid incorrectly skipping row groups if the predicate value doesn't fit in an INT. This can happen if the read schema is specified as LONG, e.g via `.schema("col LONG")` While the Parquet readers don't support reading INT32 into a LONG, the overflow can lead to row groups being incorrectly skipped, bypassing the reader altogether and producing incorrect results instead of failing. ### Why are the changes needed? Reading a parquet file containing INT32 values with a read schema specified as LONG can produce incorrect results today: ``` Seq(0).toDF("a").write.parquet(path) spark.read.schema("a LONG").parquet(path).where(s"a < ${Long.MaxValue}").collect() ``` will return an empty result. The correct result is either: - Failing the query if the parquet reader doesn't support upcasting integers to longs (all parquet readers in Spark today) - Return result `[0]` if the parquet reader supports that upcast (no readers in Spark as of now, but I'm looking into adding this capability). ### Does this PR introduce _any_ user-facing change? The following: ``` Seq(0).toDF("a").write.parquet(path) spark.read.schema("a LONG").parquet(path).where(s"a < ${Long.MaxValue}").collect() ``` produces an (incorrect) empty result before this change. After this change, the read will fail, raising an error about the unsupported conversion from INT to LONG in the parquet reader. ### How was this patch tested? - Added tests to `ParquetFilterSuite` to ensure that no row group filter is created when the predicate value overflows or when the value type isn't compatible with the parquet type - Added test to `ParquetQuerySuite` covering the correctness issue described above. ### Was this patch authored or co-authored using generative AI tooling? No Closes #44155 from johanl-db/SPARK-46092-row-group-skipping-overflow-3.4. Authored-by: Johan Lasperas <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…t overflow This is a cherry-pick from #44006 to spark 3.3 ### What changes were proposed in this pull request? This change adds a check for overflows when creating Parquet row group filters on an INT32 (byte/short/int) parquet type to avoid incorrectly skipping row groups if the predicate value doesn't fit in an INT. This can happen if the read schema is specified as LONG, e.g via `.schema("col LONG")` While the Parquet readers don't support reading INT32 into a LONG, the overflow can lead to row groups being incorrectly skipped, bypassing the reader altogether and producing incorrect results instead of failing. ### Why are the changes needed? Reading a parquet file containing INT32 values with a read schema specified as LONG can produce incorrect results today: ``` Seq(0).toDF("a").write.parquet(path) spark.read.schema("a LONG").parquet(path).where(s"a < ${Long.MaxValue}").collect() ``` will return an empty result. The correct result is either: - Failing the query if the parquet reader doesn't support upcasting integers to longs (all parquet readers in Spark today) - Return result `[0]` if the parquet reader supports that upcast (no readers in Spark as of now, but I'm looking into adding this capability). ### Does this PR introduce _any_ user-facing change? The following: ``` Seq(0).toDF("a").write.parquet(path) spark.read.schema("a LONG").parquet(path).where(s"a < ${Long.MaxValue}").collect() ``` produces an (incorrect) empty result before this change. After this change, the read will fail, raising an error about the unsupported conversion from INT to LONG in the parquet reader. ### How was this patch tested? - Added tests to `ParquetFilterSuite` to ensure that no row group filter is created when the predicate value overflows or when the value type isn't compatible with the parquet type - Added test to `ParquetQuerySuite` covering the correctness issue described above. ### Was this patch authored or co-authored using generative AI tooling? No Closes #44156 from johanl-db/SPARK-46092-row-group-skipping-overflow-3.3. Authored-by: Johan Lasperas <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

dongjoon-hyun · 2023-12-04T23:34:27Z

Thank you, @johanl-db . Those branches are enough and all PRs are merged now.

…rflow ### What changes were proposed in this pull request? This change adds a check for overflows when creating Parquet row group filters on an INT32 (byte/short/int) parquet type to avoid incorrectly skipping row groups if the predicate value doesn't fit in an INT. This can happen if the read schema is specified as LONG, e.g via `.schema("col LONG")` While the Parquet readers don't support reading INT32 into a LONG, the overflow can lead to row groups being incorrectly skipped, bypassing the reader altogether and producing incorrect results instead of failing. ### Why are the changes needed? Reading a parquet file containing INT32 values with a read schema specified as LONG can produce incorrect results today: ``` Seq(0).toDF("a").write.parquet(path) spark.read.schema("a LONG").parquet(path).where(s"a < ${Long.MaxValue}").collect() ``` will return an empty result. The correct result is either: - Failing the query if the parquet reader doesn't support upcasting integers to longs (all parquet readers in Spark today) - Return result `[0]` if the parquet reader supports that upcast (no readers in Spark as of now, but I'm looking into adding this capability). ### Does this PR introduce _any_ user-facing change? The following: ``` Seq(0).toDF("a").write.parquet(path) spark.read.schema("a LONG").parquet(path).where(s"a < ${Long.MaxValue}").collect() ``` produces an (incorrect) empty result before this change. After this change, the read will fail, raising an error about the unsupported conversion from INT to LONG in the parquet reader. ### How was this patch tested? - Added tests to `ParquetFilterSuite` to ensure that no row group filter is created when the predicate value overflows or when the value type isn't compatible with the parquet type - Added test to `ParquetQuerySuite` covering the correctness issue described above. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#44006 from johanl-db/SPARK-46092-row-group-skipping-overflow. Authored-by: Johan Lasperas <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…t overflow This is a cherry-pick from apache#44006 to spark 3.4 ### What changes were proposed in this pull request? This change adds a check for overflows when creating Parquet row group filters on an INT32 (byte/short/int) parquet type to avoid incorrectly skipping row groups if the predicate value doesn't fit in an INT. This can happen if the read schema is specified as LONG, e.g via `.schema("col LONG")` While the Parquet readers don't support reading INT32 into a LONG, the overflow can lead to row groups being incorrectly skipped, bypassing the reader altogether and producing incorrect results instead of failing. ### Why are the changes needed? Reading a parquet file containing INT32 values with a read schema specified as LONG can produce incorrect results today: ``` Seq(0).toDF("a").write.parquet(path) spark.read.schema("a LONG").parquet(path).where(s"a < ${Long.MaxValue}").collect() ``` will return an empty result. The correct result is either: - Failing the query if the parquet reader doesn't support upcasting integers to longs (all parquet readers in Spark today) - Return result `[0]` if the parquet reader supports that upcast (no readers in Spark as of now, but I'm looking into adding this capability). ### Does this PR introduce _any_ user-facing change? The following: ``` Seq(0).toDF("a").write.parquet(path) spark.read.schema("a LONG").parquet(path).where(s"a < ${Long.MaxValue}").collect() ``` produces an (incorrect) empty result before this change. After this change, the read will fail, raising an error about the unsupported conversion from INT to LONG in the parquet reader. ### How was this patch tested? - Added tests to `ParquetFilterSuite` to ensure that no row group filter is created when the predicate value overflows or when the value type isn't compatible with the parquet type - Added test to `ParquetQuerySuite` covering the correctness issue described above. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#44155 from johanl-db/SPARK-46092-row-group-skipping-overflow-3.4. Authored-by: Johan Lasperas <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…t overflow (apache#361) This is a cherry-pick from apache#44006 to spark 3.5 ### What changes were proposed in this pull request? This change adds a check for overflows when creating Parquet row group filters on an INT32 (byte/short/int) parquet type to avoid incorrectly skipping row groups if the predicate value doesn't fit in an INT. This can happen if the read schema is specified as LONG, e.g via `.schema("col LONG")` While the Parquet readers don't support reading INT32 into a LONG, the overflow can lead to row groups being incorrectly skipped, bypassing the reader altogether and producing incorrect results instead of failing. ### Why are the changes needed? Reading a parquet file containing INT32 values with a read schema specified as LONG can produce incorrect results today: ``` Seq(0).toDF("a").write.parquet(path) spark.read.schema("a LONG").parquet(path).where(s"a < ${Long.MaxValue}").collect() ``` will return an empty result. The correct result is either: - Failing the query if the parquet reader doesn't support upcasting integers to longs (all parquet readers in Spark today) - Return result `[0]` if the parquet reader supports that upcast (no readers in Spark as of now, but I'm looking into adding this capability). ### Does this PR introduce _any_ user-facing change? The following: ``` Seq(0).toDF("a").write.parquet(path) spark.read.schema("a LONG").parquet(path).where(s"a < ${Long.MaxValue}").collect() ``` produces an (incorrect) empty result before this change. After this change, the read will fail, raising an error about the unsupported conversion from INT to LONG in the parquet reader. ### How was this patch tested? - Added tests to `ParquetFilterSuite` to ensure that no row group filter is created when the predicate value overflows or when the value type isn't compatible with the parquet type - Added test to `ParquetQuerySuite` covering the correctness issue described above. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#44154 from johanl-db/SPARK-46092-row-group-skipping-overflow-3.5. Authored-by: Johan Lasperas <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> Co-authored-by: Johan Lasperas <[email protected]>

Don't push down row group filters that overflow

a392ef5

johanl-db force-pushed the SPARK-46092-row-group-skipping-overflow branch from ccb9e73 to a392ef5 Compare November 24, 2023 15:25

github-actions bot added the SQL label Nov 24, 2023

wangyum approved these changes Nov 26, 2023

View reviewed changes

HyukjinKwon changed the title ~~[SPARK-46092] Don't push down Parquet row group filters that overflow~~ [SPARK-46092][SQL] Don't push down Parquet row group filters that overflow Nov 27, 2023

HyukjinKwon reviewed Nov 27, 2023

View reviewed changes

Don't filter integer columns with non-integer values (float/double, d…

1f88c4f

…ecimal)

johanl-db requested a review from HyukjinKwon November 28, 2023 09:32

Merge remote-tracking branch 'spark/master' into SPARK-46092-row-grou…

8744cc2

…p-skipping-overflow

dongjoon-hyun reviewed Nov 28, 2023

View reviewed changes

sunchao reviewed Nov 28, 2023

View reviewed changes

beliefer reviewed Nov 30, 2023

View reviewed changes

johanl-db requested a review from beliefer December 1, 2023 13:51

cloud-fan approved these changes Dec 1, 2023

View reviewed changes

beliefer approved these changes Dec 2, 2023

View reviewed changes

dongjoon-hyun closed this in 8f8e41e Dec 2, 2023

[SPARK-46092][SQL] Don't push down Parquet row group filters that overflow #44006

[SPARK-46092][SQL] Don't push down Parquet row group filters that overflow #44006

Uh oh!

Conversation

johanl-db commented Nov 24, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sunchao left a comment

Choose a reason for hiding this comment

Uh oh!

johanl-db commented Nov 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Dec 2, 2023

Uh oh!

johanl-db commented Dec 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Dec 4, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

dongjoon-hyun left a comment •

edited

Loading

johanl-db commented Nov 29, 2023 •

edited

Loading

johanl-db commented Dec 4, 2023 •

edited

Loading