-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-12656] [SQL] Implement Intersect with Left-semi Join #10630
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@rxin Please review the implementation. Thank you! |
|
Which mainstream RDBMS is that? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LeftSemi -> LeftSemiJoin or just SemiJoin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah. Forgot to specify the join type
|
MS SQL Server did that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should go into one of the optimizer unit test suite, not here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, will add a new test suite for it.
|
LGTM. cc @cloud-fan to take a look too. |
|
Test build #48900 has finished for PR 10630 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use transformUp?
cc @yhuai
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually nvm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we put it in one line?
|
When resolving the conflicts, I realized the multi-children Let me know if we need to open a separate PR to do it now. So far, unlike |
|
Test build #49936 has finished for PR 10630 at commit
|
|
I don't think its a problem for there to be conflicting attribute ids for set operations, this is because only one child's attribute references need to be propagated up (unlike with a join). |
|
Yeah, agree! Thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems we need to remove this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, sure, will do.
|
Test build #50176 has finished for PR 10630 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do.
|
Test build #50257 has finished for PR 10630 at commit
|
|
Test build #50313 has finished for PR 10630 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now we can keep this message as it only checks join :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can users observe the error? or it can be considered as an internal errors? BTW, we are about to convert it to an internal error in the PR: #41476
|
LGTM. we can merge it first and @gatorsmile can address remaining comments in a follow-up PR. |
|
This is not that big. Let's just do it together here. |
|
Thank you! Just cleaned the codes. : ) |
|
LGTM, pending test |
|
Test build #50368 has finished for PR 10630 at commit
|
|
Thanks - I'm going to merge this. |
Our current Intersect physical operator simply delegates to RDD.intersect. We should remove the Intersect physical operator and simply transform a logical intersect into a semi-join with distinct. This way, we can take advantage of all the benefits of join implementations (e.g. managed memory, code generation, broadcast joins).
After a search, I found one of the mainstream RDBMS did the same. In their query explain, Intersect is replaced by Left-semi Join. Left-semi Join could help outer-join elimination in Optimizer, as shown in the PR: #10566