-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-7871] [SQL] Improve the outputPartitioning for outer joins. #7886
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #39520 has finished for PR 7886 at commit
|
|
Test build #39553 has finished for PR 7886 at commit
|
|
I'd like to try to review this now since I think it's going to conflict with the SMJ outer join patch. |
|
One high-level comment: unless I've overlooked it, there doesn't seem to be any documentation in the code to explain what the |
|
Expression's use of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you use a while loop here instead of a for comprehension or pair of nested for loops?
|
Actually I'm going to drop review of this for now and focus on pulling in SMJ first. That will conflict with this patch but we can remember to update SMJ's OutputPartitioning as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to overwrite the PartitioningCollection.nullSafe.
|
I am closing it for now. Will reopen it when I get a chance to work on it. |
https://issues.apache.org/jira/browse/SPARK-7871
This PR adds the concept of
nullSafetoClusteredDistributionandHashPartitioning. For aClusteredDistribution, if itsnullSafefield is false, it does not require all rows whoseclustering expressionshave nulls be clustered. For aHashPartitioning, if itsnullSafefield is false, it does not guarantee that rows whoseclustering expressionshave nulls be clustered.This concept can be used with equal joins. A shuffled equal join operator (
ShuffledHashJoin,ShuffledHashOuterJoin, andSortMergeJoin) can useClusteredDistributions withnullSafe = false. By adding this concept, we can avoid shuffle data when we have outer joins. For example, we only need threeExchangeoperators for a query likeSELECT ... A LEFT OUTER JOIN B ON (A.key = B.key) LEFT OUTER JOIN (B.key = C.key)instead of fourExchangeoperators.BTW, this PR does not shuffle rows with null partition keys randomly (#7685 has that part. We can add that part later).