[SPARK-17698] [SQL] Join predicates should not contain filter clauses #15600
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
This is a backport of #15272 to 2.0 branch.
Jira : https://issues.apache.org/jira/browse/SPARK-17698
ExtractEquiJoinKeysis incorrectly using filter predicates as the join condition for joins.canEvaluate[0] tries to see if the anExpressioncan be evaluated using output of a givenPlan. In case of filter predicates (eg.a.id='1'), theExpressionpassed for the right hand side (ie. '1' ) is aLiteralwhich does not have any attribute references. Thusexpr.referencesis an empty set which theoretically is a subset of any set. This leads tocanEvaluatereturningtrueanda.id='1'is treated as a join predicate. While this does not lead to incorrect results but in case of bucketed + sorted tables, we might miss out on avoiding un-necessary shuffle + sort. See example below:[0] : https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala#L91
eg.
BEFORE: This is doing shuffle + sort over table scan outputs which is not needed as both tables are bucketed and sorted on the same columns and have same number of buckets. This should be a single stage job.
AFTER :
How was this patch tested?
SPARK-17698 Join predicates should not contain filter clausesBucketedReadSuite