-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-46092][SQL] Don't push down Parquet row group filters that overflow #44006
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -17,7 +17,7 @@ | |
|
|
||
| package org.apache.spark.sql.execution.datasources.parquet | ||
|
|
||
| import java.lang.{Boolean => JBoolean, Double => JDouble, Float => JFloat, Long => JLong} | ||
| import java.lang.{Boolean => JBoolean, Byte => JByte, Double => JDouble, Float => JFloat, Long => JLong, Short => JShort} | ||
| import java.math.{BigDecimal => JBigDecimal} | ||
| import java.nio.charset.StandardCharsets.UTF_8 | ||
| import java.sql.{Date, Timestamp} | ||
|
|
@@ -613,7 +613,13 @@ class ParquetFilters( | |
| value == null || (nameToParquetField(name).fieldType match { | ||
| case ParquetBooleanType => value.isInstanceOf[JBoolean] | ||
| case ParquetIntegerType if value.isInstanceOf[Period] => true | ||
| case ParquetByteType | ParquetShortType | ParquetIntegerType => value.isInstanceOf[Number] | ||
| case ParquetByteType | ParquetShortType | ParquetIntegerType => value match { | ||
| // Byte/Short/Int are all stored as INT32 in Parquet so filters are built using type Int. | ||
| // We don't create a filter if the value would overflow. | ||
| case _: JByte | _: JShort | _: Integer => true | ||
| case v: JLong => v.longValue() >= Int.MinValue && v.longValue() <= Int.MaxValue | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For simply, how about forbid the Long value direclty ?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There are some use cases where this can be useful, assuming the Parquet file contains type INT32:
We could skip creating row group filters in that case but the logic is simple enough and it's going to be beneficial in the cases above.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you for the explanation. I got the two use case.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The plan would be to support this in the Delta version that will build on top of the next Spark version, so it would be good to have it here already.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If Spark already supports long type for a few releases, we can't drop it now or we have perf regressions. I'm +1 for the change here. |
||
| case _ => false | ||
| } | ||
| case ParquetLongType => value.isInstanceOf[JLong] || value.isInstanceOf[Duration] | ||
| case ParquetFloatType => value.isInstanceOf[JFloat] | ||
| case ParquetDoubleType => value.isInstanceOf[JDouble] | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should add some comments like there are only INT32 and INT64 physical types in Parquet format specification; therefore, we don't check INT8 and/or INT16.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or, we could even allow exact match of
valueonIntto be conservative? cc @cloud-fan and @wangyumThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment. I also made the condition more restrictive: value must be either a java
Byte/Short/Integer/Long. This excludes otherNumbers such asFloat/Double/BigDecimalthat are generally not safe to cast to int and that the initial change didn't block. For example, floatNaNgets cast to0and would pass the previous check.We can't match on
Integerbecause forParquetByteTypeandParquetShortTypethe value will typically be aByteandShortresp. Also this allows upcasting where a parquet integer type is read with a larger Spark integer type.