-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-9372] [SQL] Filter nulls in Inner joins (null-skew) #9451
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
9acab52 to
9a6d9dc
Compare
|
ok to test |
|
How about we update the title to include the jira? Is https://issues.apache.org/jira/browse/SPARK-9372 the right one? |
|
Regarding the format of the title, we can do |
|
Test build #44974 has finished for PR 9451 at commit
|
|
Test build #45010 has finished for PR 9451 at commit
|
|
so any comments, guys? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ideas on better/simpler way to extract left/right join key columns ?
maybe:
joinConditionsOnBothRelations.map { case EqualTo(leftColumn, rightColumn) =>
// check columns on both sides of join condition,
// and take the one which refers to the required join side
Seq(leftColumn, rightColumn)
.filter(canEvaluate(_, leftOrRight))
.filter(_.nullable)
}is there a big difference between checking for nullability one side of EqualTo() predicate vs magically extracting the equivalent attribute from left/right LogicalPlans'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so is catalyst.planning.patterns.ExtractEquiJoinKeys is the right way to go (?)
|
Test build #45291 has finished for PR 9451 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in Inner | Semi join case, the null filter could be added to joinCondition (instead of left/right relations), assuming that I'll be pushed down by subsequent optimizer rules.
which do you prefer?
|
Test build #45301 has finished for PR 9451 at commit
|
|
Hey, thanks for working on this! I probably won't have time to look at this in depth until after the Spark 1.6 release (early december). |
0fa27c4 to
d05a63d
Compare
|
@marmbrus ping ;) |
|
Test build #48673 has finished for PR 9451 at commit
|
d05a63d to
1bcf9aa
Compare
|
Test build #48694 has finished for PR 9451 at commit
|
i.e. should not rewrite <=> or comparison, where null semantics are more subtle
it will be pushed down by other rules, such as PushPredicateThroughJoin
1bcf9aa to
cd8ca34
Compare
|
Test build #48754 has finished for PR 9451 at commit
|
|
@marmbrus pinging you to tag Catalyst plan rewriting guru. For current inner join PR, there's a flanky test in python, couldn't track it down yet. For more generic case (next PR), It doesn't seem to be easy not to loose table aliases, and add a randomized spraying column as extra left join key. P.S. I'm in SF bay until Fri 8 Apr (better before Thu 9), so I could come over to chat to you guys live. |
|
@vidma I think this is already fixed in master (having constraints for join and turn constraints into predicate, push down the predicates), do you mind to close this PR? |
Do not merge yet: Work in progress / waiting for comments
Draft of first step in optimizing skew in joins (it is quite common to have skew in data, and lots of
nullson either side of join is quite common (for us), especially with left join, say when joiningdimensionstofacttables)feel free to propose a better approach / add commits.
any ideas for an easy way to check if the rule was already applied? After adding a
isNotNullfiltersomeAttribute.nullablestill returnstrue. I couldn't come up with anything better than simply doing a separate batch of 1 iteration.@marmbrus (as discussed at Spark Summit EU)
the next more serious step will be to fight skew in left join, where most helpers of this PR will be reused.
here is a rather simple implementation with DataFrames, solves the null skew, and don't seem to add lots of overhead (though tried only on subset of all our joins which used another abstraction of ours).
however this, so far, seems harder to express in optimizer rules: