Skip to content

Conversation

@gatorsmile
Copy link
Member

This PR is to push Inner Join through Left/Right Outer Join.

The basic idea is built on the associativity property of outer and inner joins:

  • R1 inner (R2 left R3 on p23) on p12 = (R1 inner R2 on p12) left R3 on p23
  • R1 inner (R2 right R3 on p23) on p13 = R2 right (R1 inner R3 on p13) on p23 = (R1 inner R3 on p13) left R2 on p23
  • (R1 left R2 on p12) inner R3 on p13 = (R1 inner R3 on p13) left R2 on p12
  • (R1 right R2 on p12) inner R3 on p23 = R1 right (R2 inner R3 on p23) on p12 = (R2 inner R3 on p23) left R1 on p12

In this PR, the reordering can reduce the number of processed rows since the Inner Join always can generate less (or equivalent) rows than Left/Right Outer Join. The join predicates of Left/Right Outer Join will not affect the number of returned rows. This PR can improve the query performance in most cases, especially when the join predicates of Inner Join are highly selective.

When cost-based optimization is available, we can switch the order of tables in each join type based on their costs. The order of joined tables in the inner join does not affect the results and the right outer join can be changed to the left outer join. This part is out of scope here.

For example, given the following eligible query:
df.join(df2, $"a.int" === $"b.int", "right").join(df3, $"c.int" === $"b.int", "inner")

Before the fix, the logical plan is like

Join Inner, Some((int#15 = int#9))
:- Join RightOuter, Some((int#3 = int#9))
:  :- LocalRelation [int#3,int2#4,str#5], [[1,2,1],[3,4,3]]
:  +- LocalRelation [int#9,int2#10,str#11], [[1,3,1],[5,6,5]]
+- LocalRelation [int#15,int2#16,str#17], [[1,9,8],[5,0,4]]

After the fix, the logical plan is like

Join LeftOuter, Some((int#3 = int#9))
:- Join Inner, Some((int#15 = int#9))
:  :- LocalRelation [int#9,int2#10,str#11], [[1,3,1],[5,6,5]]
:  +- LocalRelation [int#15,int2#16,str#17], [[1,9,8],[5,0,4]]
+- LocalRelation [int#3,int2#4,str#5], [[1,2,1],[3,4,3]]

@SparkQA
Copy link

SparkQA commented Jan 2, 2016

Test build #48575 has finished for PR 10551 at commit d9b5411.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Jan 2, 2016

Test build #48576 has finished for PR 10551 at commit d9b5411.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 2, 2016

Test build #48577 has finished for PR 10551 at commit 5f67a74.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 2, 2016

Test build #48578 has finished for PR 10551 at commit 5f67a74.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 2, 2016

Test build #48588 has finished for PR 10551 at commit 70336a4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 3, 2016

Test build #48593 has finished for PR 10551 at commit 7909d4c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Jan 12, 2016

Mind closing this one as well?

@gatorsmile
Copy link
Member Author

Let me close it. Thanks!

@gatorsmile gatorsmile closed this Jan 12, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants