-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-4485][SQL]Add BroadcastHashOuterJoin #3362
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Can one of the admins verify this patch? |
|
Semantically, I don't think the outer join can be done via broadcast the small table in some cases. For example: But if it's the |
|
@chenghao-intel ,you are right , the |
|
Oh, Thanks for clarification. I read the code again and it does work for this particular optimization. there is some duplicated code with |
|
By the way, |
|
test this please |
|
Test build #23699 has started for PR 3362 at commit
|
|
Test build #23699 has finished for PR 3362 at commit
|
|
Test FAILed. |
|
@liancheng Add testsuite. |
b565a3c to
9a28591
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should throw a exception here, because other types of outer join should not use BroadcastHashOuterJoin
|
it is very good! |
|
ok to test |
|
The benchmark figures are pretty impressive! Sorry for the delay, 1.1.1 and 1.2.0 releases plus Thanksgiving slow down PR reviewing/merging a lot, will come back to this soon later this week or early next week. |
|
Test build #23928 has started for PR 3362 at commit
|
|
Test build #23928 has finished for PR 3362 at commit
|
|
Test PASSed. |
|
Sorry for the delay! The benchmark results are very impressive so I think we should definitely add this feature! Here is what I think needs to be done:
Since we are trying to keep the PR queue limited to active PRs, I propose we close this issue for now and reopen it as soon as the new version is ready. Looking forward to this improved join implementation :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we are never planning HashOuterJoin? We need to make sure we don't lose test coverage here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is always MetastoreRelation
JIRA issue: SPARK-4485
We are planning to create a
BroadcastHashouterJointo implement the broadcast join forleft outer joinandright outer joinIn left outer join :
If the size of data from right side is smaller than the user-settable threshold
AUTO_BROADCASTJOIN_THRESHOLD,the planner would mark it as the broadcast relation and mark the other relation as the stream side. The broadcast table will be broadcasted to all of the executors involved in the join, as aorg.apache.spark.broadcast.Broadcast object. It will use joins.BroadcastHashouterJoin.,else it will usejoins.HashOuterJoin.
In right outer join:
If the size of data from left side is smaller than the user-settable threshold
AUTO_BROADCASTJOIN_THRESHOLD,the planner would mark it as the broadcast relation and mark the other relation as the stream side. The broadcast table will be broadcasted to all of the executors involved in the join, asaorg.apache.spark.broadcast.Broadcastobject. It will usejoins.BroadcastHashouterJoin. else it will usejoins.HashOuterJoin.The benchmark suggests these made the optimized version 7x faster when
left outer joinandright outer joinThe micro benchmark load
data1/kv3.txtinto a normal Hive table.Benchmark code: