-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-12613] [SQL] Outer Join Elimination by Parent Join Condition #10566
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-12613] [SQL] Outer Join Elimination by Parent Join Condition #10566
Conversation
…ionByParentJoinPredicate
…ionByParentJoinPredicate
|
Test build #48631 has finished for PR 10566 at commit
|
|
btw i created this: https://issues.apache.org/jira/browse/SPARK-12616 seems like something you can do? |
|
Sure, I can make a try! Thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I commented on the other PR, I think we should have a more general way to infer null propagation / filtering. Maybe you can discuss with @sameeragarwal and then update these PRs after his machinery is available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will do. Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gatorsmile similar to #10566, I think we should now be just able to apply this optimization rule more generally along the lines of:
def apply(plan: LogicalPlan): LogicalPlan = plan transform {
case f @ Filter(condition, j @ Join(_, _, RightOuter | LeftOuter | FullOuter, _)) =>
Filter(condition, buildNewJoin(f, j))
// Case 1: when parent join is Inner|LeftSemi|LeftOuter and the child join is on the right side
case pj @ Join(pLeft, j @ Join(left, right, RightOuter|LeftOuter|FullOuter, condition), Inner|LeftSemi|LeftOuter, Some(pJoinCond)) =>
Join(pLeft, buildNewJoin(pj, j), pj.joinType, Some(pJoinCond))
// Case 2: when parent join is Inner|LeftSemi|RightOuter and the child join is on the left side
case pj @ Join(j @ Join(left, right, RightOuter|LeftOuter|FullOuter, condition), pRight, Inner|LeftSemi|RightOuter, Some(pJoinCond)) =>
Join(buildNewJoin(pj, j), pRight, pj.joinType, Some(pJoinCond))
}Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will do the changes. Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sameeragarwal Unfortunately, they are unable to share the same buildNewJoin function.
For example, if the parent join is full outer, the parent join will not have any IsNotNull constraint. In the current constraint propagation, its constraints is Set.empty[Expression]. However, the join condition of this parent join still can be used for outer join elimination of the child join.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me do the outer join elimination by Filter at first. That one can directly use the existing infrastructure of constraint propagation. #10567 Thanks!
Our current Intersect physical operator simply delegates to RDD.intersect. We should remove the Intersect physical operator and simply transform a logical intersect into a semi-join with distinct. This way, we can take advantage of all the benefits of join implementations (e.g. managed memory, code generation, broadcast joins). After a search, I found one of the mainstream RDBMS did the same. In their query explain, Intersect is replaced by Left-semi Join. Left-semi Join could help outer-join elimination in Optimizer, as shown in the PR: #10566 Author: gatorsmile <[email protected]> Author: xiaoli <[email protected]> Author: Xiao Li <[email protected]> Closes #10630 from gatorsmile/IntersectBySemiJoin.
|
To you want to update this now? |
|
Will do it tonight. Thanks! |
|
First, will add test cases to Second, the current fix does not cover all the possible cases. I need to get the inputs from you about the issues this PR is facing: val df = Seq((1, 2, "1"), (3, 4, "3")).toDF("int", "int2", "str").as("a")
val df2 = Seq((1, 2, "1"), (5, 6, "5")).toDF("int", "int2", "str").as("b")
val df3 = Seq((1, 3, "1"), (4, 6, "5")).toDF("int", "int2", "str").as("c")
// Full -> Left
val full2Left = df.join(df2, $"a.int" === $"b.int", "full")
.join(df3, $"c.int" === $"a.int", "right").select($"a.*", $"b.*", $"c.*")In the above case, the parent join condition
However, the parent join condition Does that look good to you? Thanks! : ) |
|
Test build #51943 has finished for PR 10566 at commit
|
Why isn't the constraint present? We should fix that instead of inventing another unrelated way to reason about nullability. |
|
The existing constraint propagation is bottom up. The join conditions of full-outer joins will not filter out NULL in the outputs of this Join. Here, it is top down. The join conditions of full-outer joins can filter out the NULL of the child outer joins. Will open a separate PR for top-down constraint propagation. Thanks for your suggestions! |
|
: ) Basically, top-down constraint propagation has been done in the optimizer rules:
Plan to add a new rule in optimizer for NULL constraints pushdown. |
|
After more thinking, in my opinion, the best way is to add extra Let me first create a PR to do Filter removal/cleaning. Update: #11406 is created. |
|
Thanks for the pull request. I'm going through a list of pull requests to cut them down since the sheer number is breaking some of the tooling we have. Due to lack of activity on this pull request, I'm going to push a commit to close it. Feel free to reopen it or create a new one. We can also continue the discussion on the JIRA ticket. |
This PR is another enhancement to Optimizer. It does not conflict with the other PRs (#10567 and #10551).
Given an outer join (OJ) is involved in another join (called parent join PJ), when the join type of PJ is
inner,left-semi,left-outerandright-outer, checking if the join condition of the PJ satisfies the following two conditions:If having such join predicates, execute the elimination rules:
full outer->innerif both sides of OJ have such predicatesleft outer->innerif the right side of OJ has such predicatesright outer->innerif the left side of OJ has such predicatesfull outer->left outerif only the left side of OJ has such predicatesfull outer->right outerif only the right side of OJ has such predicatesIf applicable, this can greatly improve the performance, since
outer joinis much slower thaninner join,full outerjoin is much slower thanleft/right outerjoin.BTW, since the rule is different from the rule in #10567, I did not merge them in the same one for simplifying the code review.