-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-27699][SQL] Partially push down disjunctive predicated in Parquet/ORC #24598
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
cee1c73
3f7ef8a
529b207
caeb64d
0fe7194
5f412bd
b8cb843
4d84060
90b0b69
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -527,11 +527,22 @@ private[parquet] class ParquetFilters( | |
| } | ||
|
|
||
| case sources.Or(lhs, rhs) => | ||
| // The Or predicate is convertible when both of its children can be pushed down. | ||
| // That is to say, if one/both of the children can be partially pushed down, the Or | ||
| // predicate can be partially pushed down as well. | ||
| // | ||
| // Here is an example used to explain the reason. | ||
| // Let's say we have | ||
| // (a1 AND a2) OR (b1 AND b2), | ||
| // a1 and b1 is convertible, while a2 and b2 is not. | ||
| // The predicate can be converted as | ||
| // (a1 OR b1) AND (a1 OR b2) AND (a2 OR b1) AND (a2 OR b2) | ||
|
||
| // As per the logical in And predicate, we can push down (a1 OR b1). | ||
| for { | ||
| lhsFilter <- | ||
| createFilterHelper(nameToParquetField, lhs, canPartialPushDownConjuncts = false) | ||
|
||
| createFilterHelper(nameToParquetField, lhs, canPartialPushDownConjuncts) | ||
| rhsFilter <- | ||
| createFilterHelper(nameToParquetField, rhs, canPartialPushDownConjuncts = false) | ||
| createFilterHelper(nameToParquetField, rhs, canPartialPushDownConjuncts) | ||
| } yield FilterApi.or(lhsFilter, rhsFilter) | ||
|
|
||
| case sources.Not(pred) => | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -44,27 +44,35 @@ object DataSourceV2Strategy extends Strategy with PredicateHelper { | |
| filters: Seq[Expression]): (Seq[Expression], Seq[Expression]) = { | ||
| scanBuilder match { | ||
| case r: SupportsPushDownFilters => | ||
| // A map from translated data source filters to original catalyst filter expressions. | ||
| // A map from translated data source leaf node filters to original catalyst filter | ||
| // expressions. For a `And`/`Or` predicate, it is possible that the predicate is partially | ||
| // pushed down. This map can be used to construct a catalyst filter expression from the | ||
| // input filter, or a superset(partial push down filter) of the input filter. | ||
| val translatedFilterToExpr = mutable.HashMap.empty[sources.Filter, Expression] | ||
| val translatedFilters = mutable.ArrayBuffer.empty[sources.Filter] | ||
| // Catalyst filter expression that can't be translated to data source filters. | ||
| val untranslatableExprs = mutable.ArrayBuffer.empty[Expression] | ||
|
|
||
| for (filterExpr <- filters) { | ||
| val translated = DataSourceStrategy.translateFilter(filterExpr) | ||
| if (translated.isDefined) { | ||
| translatedFilterToExpr(translated.get) = filterExpr | ||
| } else { | ||
| val translated = | ||
|
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. With partial filter push down in Or operator, the result of pushedFilters() might not exist in the mapping translatedFilterToExpr. To fix it, this PR changes the mapping translatedFilterToExpr as leaf filter expression to sources.filter, and later on rebuild the whole expression with the mapping.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yep. Actually, when I tested your PR before, I also noticed that. The new Thank you for making this PR working, @gengliangwang !
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. BTW, it would be great to add a real
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for for the testing. Appreciate it! |
||
| DataSourceStrategy.translateFilterWithMapping(filterExpr, Some(translatedFilterToExpr)) | ||
| if (translated.isEmpty) { | ||
| untranslatableExprs += filterExpr | ||
| } else { | ||
| translatedFilters += translated.get | ||
| } | ||
| } | ||
|
|
||
| // Data source filters that need to be evaluated again after scanning. which means | ||
| // the data source cannot guarantee the rows returned can pass these filters. | ||
| // As a result we must return it so Spark can plan an extra filter operator. | ||
| val postScanFilters = r.pushFilters(translatedFilterToExpr.keys.toArray) | ||
| .map(translatedFilterToExpr) | ||
| val postScanFilters = r.pushFilters(translatedFilters.toArray).map { filter => | ||
| DataSourceStrategy.rebuildExpressionFromFilter(filter, translatedFilterToExpr) | ||
| } | ||
| // The filters which are marked as pushed to this data source | ||
| val pushedFilters = r.pushedFilters().map(translatedFilterToExpr) | ||
| val pushedFilters = r.pushedFilters().map { filter => | ||
| DataSourceStrategy.rebuildExpressionFromFilter(filter, translatedFilterToExpr) | ||
| } | ||
| (pushedFilters, untranslatableExprs ++ postScanFilters) | ||
|
|
||
| case _ => (Nil, filters) | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -26,7 +26,8 @@ import org.apache.spark.{AccumulatorSuite, SparkException} | |
| import org.apache.spark.scheduler.{SparkListener, SparkListenerJobStart} | ||
| import org.apache.spark.sql.catalyst.util.StringUtils | ||
| import org.apache.spark.sql.execution.aggregate.{HashAggregateExec, SortAggregateExec} | ||
| import org.apache.spark.sql.execution.datasources.FilePartition | ||
| import org.apache.spark.sql.execution.datasources.v2.BatchScanExec | ||
| import org.apache.spark.sql.execution.datasources.v2.orc.OrcScan | ||
| import org.apache.spark.sql.execution.joins.{BroadcastHashJoinExec, CartesianProductExec, SortMergeJoinExec} | ||
| import org.apache.spark.sql.functions._ | ||
| import org.apache.spark.sql.internal.SQLConf | ||
|
|
@@ -2978,6 +2979,31 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext { | |
| } | ||
| } | ||
|
|
||
| test("SPARK-27699 Validate pushed down filters") { | ||
| def checkPushedFilters(df: DataFrame, filters: Array[sources.Filter]): Unit = { | ||
| val scan = df.queryExecution.sparkPlan | ||
| .find(_.isInstanceOf[BatchScanExec]).get.asInstanceOf[BatchScanExec] | ||
| .scan | ||
| assert(scan.isInstanceOf[OrcScan]) | ||
| assert(scan.asInstanceOf[OrcScan].pushedFilters === filters) | ||
| } | ||
| withSQLConf(SQLConf.USE_V1_SOURCE_READER_LIST.key -> "") { | ||
| withTempPath { dir => | ||
| spark.range(10).map(i => (i, i.toString)).toDF("id", "s").write.orc(dir.getCanonicalPath) | ||
| val df = spark.read.orc(dir.getCanonicalPath) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It would be better to have Parquet testing because it's the default format in Apache Spark. But, Yes. I got it. You want to use a DSv2 way for testing.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can have a follow up after the Parquet V2 is merged. |
||
| checkPushedFilters( | ||
| df.where(('id < 2 and 's.contains("foo")) or ('id > 10 and 's.contains("bar"))), | ||
| Array(sources.Or(sources.LessThan("id", 2), sources.GreaterThan("id", 10)))) | ||
| checkPushedFilters( | ||
| df.where('s.contains("foo") or ('id > 10 and 's.contains("bar"))), | ||
| Array.empty) | ||
| checkPushedFilters( | ||
| df.where('id < 2 and not('id > 10 and 's.contains("bar"))), | ||
| Array(sources.IsNotNull("id"), sources.LessThan("id", 2))) | ||
| } | ||
| } | ||
| } | ||
|
|
||
| test("SPARK-26709: OptimizeMetadataOnlyQuery does not handle empty records correctly") { | ||
| Seq(true, false).foreach { enableOptimizeMetadataOnlyQuery => | ||
| withSQLConf(SQLConf.OPTIMIZER_METADATA_ONLY.key -> enableOptimizeMetadataOnlyQuery.toString) { | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add more explanation why this PR needs to have
translateFilterWithMappingandtranslateLeafNodeFilter? Is this inevitable refactoring?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is inevitable. See my comments below or the ending part of the PR description.