-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-9372] [SQL] For joins, insert IS NOT NULL filters to children. #10209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #47375 has finished for PR 10209 at commit
|
Some join types and conditions imply that the join keys cannot be NULL and can be filtered out by the children. This patch does this for inner joins and introduces a mechanism to generate predicates. The complex part of doing this is to make sure the transformation is stable. The problem that we want to avoid is generating a filter in the join, having that pushed down and then having the join regenerate the filter. This patch solves this by having the join remember predicates that it has generated. This mechanism should be general enough that we can infer other predicates, for example "a join b where a.id = b.id AND a.id = 10" could also use this mechanism to generate the predicate "b.id = 10".
|
Test build #47488 has finished for PR 10209 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we can use if !j.selfJoinResolved
|
This looks very cool! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is semi-public API cause I think some advanced projects do dig into catalyst and we've never changed the signature of something as basic as Join before. Could we do this instead by fixing nullablity propagation and only inserting the filter if the attribute is nullable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started down that path but can you think of a way to make that handle the more general case of predicate propagation?
t1.key join t2.key where t1.key = t2.key and t1.key = 5.
How do we generate the predicate t2.key = 5? how do we make this more general?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In an earlier version of catalyst we also had equivalence classes propagate up the logical plans. Would that give you enough information?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Equivalence classes is one thing, we can compute that no problem I think. The issue is how to remember that t2.key = 5 was generated and not to generate it again. The trick of setting nullable doesn't work here. We could maintain value constraints (where nullability is a subset).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't Literal(5) be in the equivalence class and we could check for that?
That said, I also like the idea more general value constraints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Literal(5) would work for equivalence but we want to track more than equality. If it was t1.key join t2.key where t1.key = t2.key and t1.key > 5, we'd similarly want to add t2.key > 5.
Are you suggesting we don't change the operator and walk the tree bottom up to collect these constraints? This seems extremely expensive to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking operators would propagate the set of constraints up from their children (possibly augmenting or clearing as appropriate) and we'd save it in a lazy val.
|
Test build #47531 has finished for PR 10209 at commit
|
|
Is this already fixed by #7768 ? |
Some join types and conditions imply that the join keys cannot be NULL and
can be filtered out by the children. This patch does this for inner joins
and introduces a mechanism to generate predicates. The complex part of doing
this is to make sure the transformation is stable. The problem that we want
to avoid is generating a filter in the join, having that pushed down and then
having the join regenerate the filter.
This patch solves this by having the join remember predicates that it has
generated. This mechanism should be general enough that we can infer other
predicates, for example "a join b where a.id = b.id AND a.id = 10" could
also use this mechanism to generate the predicate "b.id = 10".