-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-9372] [SQL] Filter nulls in join keys #7768
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #38952 has finished for PR 7768 at commit
|
|
Test build #38967 has finished for PR 7768 at commit
|
|
Test build #39013 has finished for PR 7768 at commit
|
|
test this please |
|
Test build #39018 has finished for PR 7768 at commit
|
|
Jenkins, retest this please. |
|
Test build #39081 has finished for PR 7768 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to briefly clarify, I guess that the problem was that AtLeastNNulls also dropped NaNs but that we can't do that since it would lead to a violation of our NaN-equality semantics when joining on float/double columns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. Because null means Unknown, so when you have a predicate null = null, the result is false (meaning Unknown). But for NaN, in our current semantic, two NaN are equal.
|
Jenkins, retest this please. |
|
Test build #39445 has finished for PR 7768 at commit
|
|
@JoshRosen If you think changes in this PR are good, how about we merge it? |
|
Looking now... sorry for delay. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment looks out-of-date, probably a result of the splitting of the larger patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These arguments are slightly underindented.
|
LGTM overall, aside from a minor comment about a minor out-of-date comment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically I suppose that we could also add a filter if b is null, since null + 1 == null, leading to an empty join result for those rows? We can figure this out for a simple case like this, but I guess the logic is too complicated to apply to arbitrary expressions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. We need to understand if an expression can generate null if the input is non-nullable.
|
Test build #39506 has finished for PR 7768 at commit
|
|
Since this passed tests, I'm going to merge this into master to unblock the other null-related patch. |
This PR adds an optimization rule,
FilterNullsInJoinKey, to addFilterbefore join operators to filter out rows having null values for join keys.This optimization is guarded by a new SQL conf,
spark.sql.advancedOptimization.The code in this PR was authored by @yhuai; I'm opening this PR to factor out this change from #7685, a larger pull request which contains two other optimizations.