-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-19712][SQL] Pushing Left Semi and Left Anti joins through Project, Aggregate, Window, Union etc. #23750
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
ae5f6ee
488eda8
ea76e29
43e9eef
744355c
0fa7950
4a11ad4
fa41b44
76e7203
79579c8
5ea6a4a
68e7268
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,197 @@ | ||
| /* | ||
| * Licensed to the Apache Software Foundation (ASF) under one or more | ||
| * contributor license agreements. See the NOTICE file distributed with | ||
| * this work for additional information regarding copyright ownership. | ||
| * The ASF licenses this file to You under the Apache License, Version 2.0 | ||
| * (the "License"); you may not use this file except in compliance with | ||
| * the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, software | ||
| * distributed under the License is distributed on an "AS IS" BASIS, | ||
| * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| * See the License for the specific language governing permissions and | ||
| * limitations under the License. | ||
| */ | ||
|
|
||
| package org.apache.spark.sql.catalyst.optimizer | ||
|
|
||
| import org.apache.spark.sql.catalyst.expressions._ | ||
| import org.apache.spark.sql.catalyst.plans._ | ||
| import org.apache.spark.sql.catalyst.plans.logical._ | ||
| import org.apache.spark.sql.catalyst.rules.Rule | ||
|
|
||
| /** | ||
| * This rule is a variant of [[PushDownPredicate]] which can handle | ||
| * pushing down Left semi and Left Anti joins below the following operators. | ||
| * 1) Project | ||
| * 2) Window | ||
| * 3) Union | ||
| * 4) Aggregate | ||
| * 5) Other permissible unary operators. please see [[PushDownPredicate.canPushThrough]]. | ||
| */ | ||
| object PushDownLeftSemiAntiJoin extends Rule[LogicalPlan] with PredicateHelper { | ||
| def apply(plan: LogicalPlan): LogicalPlan = plan transform { | ||
| // LeftSemi/LeftAnti over Project | ||
| case Join(p @ Project(pList, gChild), rightOp, LeftSemiOrAnti(joinType), joinCond, hint) | ||
| if pList.forall(_.deterministic) && | ||
| !pList.exists(ScalarSubquery.hasCorrelatedScalarSubquery) && | ||
| canPushThroughCondition(Seq(gChild), joinCond, rightOp) => | ||
| if (joinCond.isEmpty) { | ||
| // No join condition, just push down the Join below Project | ||
| p.copy(child = Join(gChild, rightOp, joinType, joinCond, hint)) | ||
| } else { | ||
| val aliasMap = PushDownPredicate.getAliasMap(p) | ||
| val newJoinCond = if (aliasMap.nonEmpty) { | ||
| Option(replaceAlias(joinCond.get, aliasMap)) | ||
| } else { | ||
| joinCond | ||
| } | ||
| p.copy(child = Join(gChild, rightOp, joinType, newJoinCond, hint)) | ||
| } | ||
|
|
||
| // LeftSemi/LeftAnti over Aggregate | ||
| case join @ Join(agg: Aggregate, rightOp, LeftSemiOrAnti(joinType), joinCond, hint) | ||
| if agg.aggregateExpressions.forall(_.deterministic) && agg.groupingExpressions.nonEmpty && | ||
| !agg.aggregateExpressions.exists(ScalarSubquery.hasCorrelatedScalarSubquery) => | ||
| if (joinCond.isEmpty) { | ||
| // No join condition, just push down Join below Aggregate | ||
| agg.copy(child = Join(agg.child, rightOp, joinType, joinCond, hint)) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. cc @maryannxue shall we keep the join hint when pushdown join through an operator? |
||
| } else { | ||
| val aliasMap = PushDownPredicate.getAliasMap(agg) | ||
|
|
||
| // For each join condition, expand the alias and check if the condition can be evaluated | ||
| // using attributes produced by the aggregate operator's child operator. | ||
| val (pushDown, stayUp) = splitConjunctivePredicates(joinCond.get).partition { cond => | ||
| val replaced = replaceAlias(cond, aliasMap) | ||
| cond.references.nonEmpty && | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I left the same comment in the previous review though, I still have a question here....: Is it ok to push down non-deterministic exprs?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @maropu In my knowledge, join conditions cannot have non-deterministic expressions ? Its ensured in checkAnalysis.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. oh, I see. Thanks!
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why can't we pushdown constant join conditions?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @cloud-fan Hmmn.. thats how the original logic was.. coming to think of it wenchen, wouldn't we have close to 0% chance of ever having a join conditions with constants only :-) ? |
||
| replaced.references.subsetOf(agg.child.outputSet ++ rightOp.outputSet) | ||
| } | ||
|
|
||
| // Check if the remaining predicates do not contain columns from the right | ||
| // hand side of the join. Since the remaining predicates will be kept | ||
| // as a filter over aggregate, this check is necessary after the left semi | ||
| // or left anti join is moved below aggregate. The reason is, for this kind | ||
| // of join, we only output from the left leg of the join. | ||
| val rightOpColumns = AttributeSet(stayUp.toSet).intersect(rightOp.outputSet) | ||
|
|
||
| if (pushDown.nonEmpty && rightOpColumns.isEmpty) { | ||
| val pushDownPredicate = pushDown.reduce(And) | ||
| val replaced = replaceAlias(pushDownPredicate, aliasMap) | ||
| val newAgg = agg.copy(child = Join(agg.child, rightOp, joinType, Option(replaced), hint)) | ||
| // If there is no more filter to stay up, just return the Aggregate over Join. | ||
| // Otherwise, create "Filter(stayUp) <- Aggregate <- Join(pushDownPredicate)". | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Left-anti join outputs records that do NOT satisfy the join condition. Let's say the join condition is This seems problematic. Previously we get result satisfying
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It seems that, we can't push down partial left-anti join
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @cloud-fan Thank you. I understand it better now.Let me test this out a bit and plan a follow-up. |
||
| if (stayUp.isEmpty) newAgg else Filter(stayUp.reduce(And), newAgg) | ||
| } else { | ||
| // The join condition is not a subset of the Aggregate's GROUP BY columns, | ||
| // no push down. | ||
| join | ||
| } | ||
| } | ||
|
|
||
| // LeftSemi/LeftAnti over Window | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ditto |
||
| case join @ Join(w: Window, rightOp, LeftSemiOrAnti(joinType), joinCond, hint) | ||
| if w.partitionSpec.forall(_.isInstanceOf[AttributeReference]) => | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. will e.g.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @cloud-fan No.. windows expression can't contain correlated subqueries. |
||
| if (joinCond.isEmpty) { | ||
| // No join condition, just push down Join below Window | ||
| w.copy(child = Join(w.child, rightOp, joinType, joinCond, hint)) | ||
| } else { | ||
| val partitionAttrs = AttributeSet(w.partitionSpec.flatMap(_.references)) ++ | ||
| rightOp.outputSet | ||
|
|
||
| val (pushDown, stayUp) = splitConjunctivePredicates(joinCond.get).partition { cond => | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ditto: Is it ok to push down non-deterministic exprs?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @maropu Please see my answer above. |
||
| cond.references.subsetOf(partitionAttrs) | ||
| } | ||
|
|
||
| // Check if the remaining predicates do not contain columns from the right | ||
| // hand side of the join. Since the remaining predicates will be kept | ||
| // as a filter over window, this check is necessary after the left semi | ||
| // or left anti join is moved below window. The reason is, for this kind | ||
| // of join, we only output from the left leg of the join. | ||
| val rightOpColumns = AttributeSet(stayUp.toSet).intersect(rightOp.outputSet) | ||
|
|
||
| if (pushDown.nonEmpty && rightOpColumns.isEmpty) { | ||
| val predicate = pushDown.reduce(And) | ||
| val newPlan = w.copy(child = Join(w.child, rightOp, joinType, Option(predicate), hint)) | ||
| if (stayUp.isEmpty) newPlan else Filter(stayUp.reduce(And), newPlan) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @dilipbiswal this does hold with left anti joins? If a predicate is part of the condition then it means it should be filtered out right, and not retained?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @hvanhovell Thanks for reviewing. Can you please help illustrate the problem with an example ? So if the join was in filter form (in an subquery expression), we do push it down, right ? We don't distinguish between semi or anti joins ? |
||
| } else { | ||
| // The join condition is not a subset of the Window's PARTITION BY clause, | ||
| // no push down. | ||
| join | ||
| } | ||
| } | ||
|
|
||
| // LeftSemi/LeftAnti over Union | ||
| case join @ Join(union: Union, rightOp, LeftSemiOrAnti(joinType), joinCond, hint) | ||
| if canPushThroughCondition(union.children, joinCond, rightOp) => | ||
| if (joinCond.isEmpty) { | ||
| // Push down the Join below Union | ||
| val newGrandChildren = union.children.map { Join(_, rightOp, joinType, joinCond, hint) } | ||
| union.withNewChildren(newGrandChildren) | ||
| } else { | ||
| val output = union.output | ||
| val newGrandChildren = union.children.map { grandchild => | ||
| val newCond = joinCond.get transform { | ||
| case e if output.exists(_.semanticEquals(e)) => | ||
| grandchild.output(output.indexWhere(_.semanticEquals(e))) | ||
| } | ||
| assert(newCond.references.subsetOf(grandchild.outputSet ++ rightOp.outputSet)) | ||
| Join(grandchild, rightOp, joinType, Option(newCond), hint) | ||
| } | ||
| union.withNewChildren(newGrandChildren) | ||
| } | ||
|
|
||
| // LeftSemi/LeftAnti over UnaryNode | ||
| case join @ Join(u: UnaryNode, rightOp, LeftSemiOrAnti(joinType), joinCond, hint) | ||
| if PushDownPredicate.canPushThrough(u) && u.expressions.forall(_.deterministic) => | ||
| pushDownJoin(join, u.child) { joinCond => | ||
| u.withNewChildren(Seq(Join(u.child, rightOp, joinType, joinCond, hint))) | ||
| } | ||
| } | ||
|
|
||
| /** | ||
| * Check if we can safely push a join through a project or union by making sure that attributes | ||
| * referred in join condition do not contain the same attributes as the plan they are moved | ||
| * into. This can happen when both sides of join refers to the same source (self join). This | ||
| * function makes sure that the join condition refers to attributes that are not ambiguous (i.e | ||
| * present in both the legs of the join) or else the resultant plan will be invalid. | ||
| */ | ||
| private def canPushThroughCondition( | ||
| plans: Seq[LogicalPlan], | ||
| condition: Option[Expression], | ||
| rightOp: LogicalPlan): Boolean = { | ||
| val attributes = AttributeSet(plans.flatMap(_.output)) | ||
| if (condition.isDefined) { | ||
| val matched = condition.get.references.intersect(rightOp.outputSet).intersect(attributes) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. should this be
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nvm, the self-join is problematic only if we can't rewrite join condition correctly.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we should mention it in the method doc, so that other reviewers won't get confused.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @cloud-fan I had this in the function prologue. Did you want it improved ? |
||
| matched.isEmpty | ||
| } else { | ||
| true | ||
| } | ||
| } | ||
|
|
||
|
|
||
| private def pushDownJoin( | ||
| join: Join, | ||
| grandchild: LogicalPlan)(insertJoin: Option[Expression] => LogicalPlan): LogicalPlan = { | ||
| if (join.condition.isEmpty) { | ||
| insertJoin(None) | ||
| } else { | ||
| val (pushDown, stayUp) = splitConjunctivePredicates(join.condition.get) | ||
| .partition {_.references.subsetOf(grandchild.outputSet ++ join.right.outputSet)} | ||
|
|
||
| val rightOpColumns = AttributeSet(stayUp.toSet).intersect(join.right.outputSet) | ||
| if (pushDown.nonEmpty && rightOpColumns.isEmpty) { | ||
| val newChild = insertJoin(Option(pushDown.reduceLeft(And))) | ||
| if (stayUp.nonEmpty) { | ||
| Filter(stayUp.reduceLeft(And), newChild) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. wait, can we safely do this? What if the
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we need to create a join instead of Filter here, if
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. or we shouldn't do pushdown if
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @cloud-fan You are right.. I missed this case. |
||
| } else { | ||
| newChild | ||
| } | ||
| } else { | ||
| join | ||
| } | ||
| } | ||
| } | ||
| } | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just for curiosity, does left anti/semi join always a condition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan Ha.. i had the same question a couple of days back. So i quickly tried :
We end up getting all the rows from t1 (if i remember correctly).
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for left-anti join, it returns no result.
Then it makes me think that, we should always pushdown the join if the condition is empty. For left semi join it's just a noop, and for left-anti join it helps a lot. You already did it in the rule, except https://github.com/apache/spark/pull/23750/files#diff-44d3a3f876bcf811fdbf71fce1f7072aR192
A new optimizer rule can be: we turn left-semi join to the left child if join condition is empty, and turn left-anti to empty relation if join condition is empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan When i click on the link it shows me show many diffs. Were you referring me to a few lines when you said "except" ?
2ndly, what can i say ? When i said i did try the left semi join on empty join conditions i wrote this in my notes :
"Explore if there any optimization opportunity when there are empty join condition. Is the join necessary.. need to study more" :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For 2ndly, it's just an orthogonal optimizer rule, you are welcome to do it in another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan Sure wenchen. I will do it.