-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-12602] [SQL] Join Reordering: Pushing Inner Join Through Left/Right Outer Join #10551
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #48575 has finished for PR 10551 at commit
|
|
retest this please |
|
Test build #48576 has finished for PR 10551 at commit
|
|
Test build #48577 has finished for PR 10551 at commit
|
|
Test build #48578 has finished for PR 10551 at commit
|
|
Test build #48588 has finished for PR 10551 at commit
|
|
Test build #48593 has finished for PR 10551 at commit
|
|
Mind closing this one as well? |
|
Let me close it. Thanks! |
This PR is to push
Inner JointhroughLeft/Right Outer Join.The basic idea is built on the associativity property of outer and inner joins:
R1 inner (R2 left R3 on p23) on p12 = (R1 inner R2 on p12) left R3 on p23R1 inner (R2 right R3 on p23) on p13 = R2 right (R1 inner R3 on p13) on p23 = (R1 inner R3 on p13) left R2 on p23(R1 left R2 on p12) inner R3 on p13 = (R1 inner R3 on p13) left R2 on p12(R1 right R2 on p12) inner R3 on p23 = R1 right (R2 inner R3 on p23) on p12 = (R2 inner R3 on p23) left R1 on p12In this PR, the reordering can reduce the number of processed rows since the
Inner Joinalways can generate less (or equivalent) rows thanLeft/Right Outer Join. The join predicates ofLeft/Right Outer Joinwill not affect the number of returned rows. This PR can improve the query performance in most cases, especially when the join predicates ofInner Joinare highly selective.When cost-based optimization is available, we can switch the order of tables in each join type based on their costs. The order of joined tables in the inner join does not affect the results and the right outer join can be changed to the left outer join. This part is out of scope here.
For example, given the following eligible query:
df.join(df2, $"a.int" === $"b.int", "right").join(df3, $"c.int" === $"b.int", "inner")Before the fix, the logical plan is like
After the fix, the logical plan is like