-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-10484][SQL] Optimize the cartesian join with broadcast join for some cases #8652
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #42126 has finished for PR 8652 at commit
|
|
Looks this is a part of #7417 |
|
Hi @chenghao-intel , |
|
After optimized patch , we can see "CartesianProduct" optimized to "BroadcastNestedLoopJoin" from physical plan for cross join. The benchmark result showed ~42% performance gain(15m1s vs. 26m37s). == Physical Plan == |
|
Hi @yhuai , I saw the PR is ready for some time. could you help to review this PR. Hopefully it can be fixed in 1.5.1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What will be the joinType at here. Why the exception at https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastNestedLoopJoin.scala is not triggered?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not so sure your mean about the exception at BroadcastNestedLoopJoin.scala, actually I also add the code change BroadcastNestedLoopJoin.scala below, to support the INNER join, did that your mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, sorry. I missed that line. For a CROSS JOIN, we will use INNER as the join type and the condition should be None. Can explicitly match these? I mean to let others easy to understand the code, case logical.Join(CanBroadcast(left), right, JoinType.Inner, None) will be better, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, we are trying to support the Cartesian join by using the Broadcast join, so it will support all of the join types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cartesian means that there is no join condition, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the join type is CROSS_JOIN. At here we just put INNER as a placeholder because the type does not really matter.
|
Test build #43763 has finished for PR 8652 at commit
|
07144ba to
26818c7
Compare
|
@yhuai, after some investigation, I think the we still can optimize the left/right/full outer join in broadcasting way if they don't have condition, but LeftSemi are not included, and it should be a bug in previous implementation. And I also add more unit test by comparing with Hive. |
|
Test build #43775 has finished for PR 8652 at commit
|
|
retest this please |
|
Test build #43779 has finished for PR 8652 at commit
|
26818c7 to
4ee1797
Compare
|
Test build #43820 has finished for PR 8652 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indentation is off here.
|
Thank you @liancheng updated! |
|
Test build #43906 has finished for PR 8652 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may miss some context at here. This is a strategy for CartesianProduct. We have a strategy BroadcastNestedLoopJoin to handle other cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I noticed that also, however, without these changes, the cases I added will transform into the operator CartesianProduct instead of BroadcastNestedLoopJoin, as the strategy BroadcastNestedLoopJoin is the rule after CartesianProduct.
Besides, explicitly providing the rules for the optimization, probably will be helpful for people to understand how the logic behind.
PS: The rule BroadcastNestedLoopJoin has to be the last gate, as it's supposed to handle all kinds of joins.
|
/cc @harsha2010, this PR may interest you given your similar broadcast cartesian join optimizations in Magellan. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this comment related to this case? Also, I still do not think these two rules should be in this Strategy because the name of this strategy is CartesianProduct, but these first two cases are not for CartesianProduct. Actually, can we combine CartesianProduct and BroadcastNestedLoopJoin strategies?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, why condition needs to be None?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the comment is stale.
If we restrict the outer join condition as None here, then it's more like a CartesianProduct, that's why I put the rule in the CartesianProduct, and more importantly, we'd like to take those 2 rules as higher priority than the rule in Line 292.
I am totally agree with you to combine the CartesianProduct and BroadcastNestedLoopJoin, as the later just a special case of former.
Will update the code soon.
fab8923 to
564abd3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it a proper name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any suggestion for the name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about NonEquiJoinSelection? w.r.t EquiJoinSelection
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BroadcastNestedLoopJoin actually supports equi-join, thus we never run into this case, as we have more optimal solution for it in previous rules.
After a double think, I am a little hesitate to combine the rules in CartesianProduct and BroadcastNestedLoopJoin, as the later is supposed to be the last gate for JOIN, and works for all kinds of JOIN type w/ or w/o join condition, the others can be considered as the optimization compared to it.
I am going to revert the code change if @yhuai is not strongly opposite to it. Or we can refactor the JOIN strategy after this PR been merged.
What do you think @yhuai ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if BroadcastNestedLoopJoin is a good last gate. It is possible that we join two large tables and we cannot really use broadcast join, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But, it is fine to revert it since we will not make that case worse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these two rules using CartesianProduct should go to the bottom.
|
Test build #44329 has finished for PR 8652 at commit
|
|
Test build #44331 has finished for PR 8652 at commit
|
293e5ff to
975eb46
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The renaming is intended, as some of the users will extends the SQLContext, they may not ware the changing here.
|
Thank you @yhuai , I've just split the |
|
Test build #44341 has finished for PR 8652 at commit
|
|
Test build #44410 has finished for PR 8652 at commit
|
|
@yhuai Any more comment on this? |
|
LGTM. Merging to master. |
In some cases, we can broadcast the smaller relation in cartesian join, which improve the performance significantly.