[SPARK-9372] [SQL] Filter nulls in join keys #7768

JoshRosen · 2015-07-30T02:20:28Z

This PR adds an optimization rule, FilterNullsInJoinKey, to add Filter before join operators to filter out rows having null values for join keys.

This optimization is guarded by a new SQL conf, spark.sql.advancedOptimization.

The code in this PR was authored by @yhuai; I'm opening this PR to factor out this change from #7685, a larger pull request which contains two other optimizations.

…n-key

SparkQA · 2015-07-30T02:41:31Z

Test build #38952 has finished for PR 7768 at commit 303236b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

…lterNullsInJoinKey.

SparkQA · 2015-07-30T05:15:56Z

Test build #38967 has finished for PR 7768 at commit be88760.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class AtLeastNNulls(n: Int, children: Seq[Expression]) extends Predicate
- case class AtLeastNNonNullNans(n: Int, children: Seq[Expression]) extends Predicate
- class DefaultOptimizer extends Optimizer
- case class FilterNullsInJoinKey(

SparkQA · 2015-07-30T07:03:02Z

Test build #39013 has finished for PR 7768 at commit 0a8e096.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class AtLeastNNulls(n: Int, children: Seq[Expression]) extends Predicate
- case class AtLeastNNonNullNans(n: Int, children: Seq[Expression]) extends Predicate
- class DefaultOptimizer extends Optimizer
- case class FilterNullsInJoinKey(

yhuai · 2015-07-30T07:03:56Z

test this please

SparkQA · 2015-07-30T08:54:42Z

Test build #39018 has finished for PR 7768 at commit 0a8e096.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class AtLeastNNulls(n: Int, children: Seq[Expression]) extends Predicate
- case class AtLeastNNonNullNans(n: Int, children: Seq[Expression]) extends Predicate
- class DefaultOptimizer extends Optimizer
- case class FilterNullsInJoinKey(

JoshRosen · 2015-07-30T18:23:24Z

Jenkins, retest this please.

SparkQA · 2015-07-30T20:22:24Z

Test build #39081 has finished for PR 7768 at commit 0a8e096.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class AtLeastNNulls(n: Int, children: Seq[Expression]) extends Predicate
- case class AtLeastNNonNullNans(n: Int, children: Seq[Expression]) extends Predicate
- class DefaultOptimizer extends Optimizer
- case class FilterNullsInJoinKey(

JoshRosen · 2015-07-30T23:14:05Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/nullFunctions.scala

Just to briefly clarify, I guess that the problem was that AtLeastNNulls also dropped NaNs but that we can't do that since it would lead to a violation of our NaN-equality semantics when joining on float/double columns?

Yeah. Because null means Unknown, so when you have a predicate null = null, the result is false (meaning Unknown). But for NaN, in our current semantic, two NaN are equal.

JoshRosen · 2015-08-02T21:24:08Z

Jenkins, retest this please.

SparkQA · 2015-08-02T23:21:41Z

Test build #39445 has finished for PR 7768 at commit 0a8e096.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class AtLeastNNulls(n: Int, children: Seq[Expression]) extends Predicate
- case class AtLeastNNonNullNans(n: Int, children: Seq[Expression]) extends Predicate
- class DefaultOptimizer extends Optimizer
- case class FilterNullsInJoinKey(

yhuai · 2015-08-03T00:46:11Z

@JoshRosen If you think changes in this PR are good, how about we merge it?

JoshRosen · 2015-08-03T03:59:55Z

Looking now... sorry for delay.

JoshRosen · 2015-08-03T04:14:45Z

sql/core/src/main/scala/org/apache/spark/sql/optimizer/extendedOperatorOptimizations.scala

This comment looks out-of-date, probably a result of the splitting of the larger patch.

JoshRosen · 2015-08-03T04:16:32Z

sql/core/src/main/scala/org/apache/spark/sql/optimizer/extendedOperatorOptimizations.scala

These arguments are slightly underindented.

JoshRosen · 2015-08-03T04:24:34Z

LGTM overall, aside from a minor comment about a minor out-of-date comment.

JoshRosen · 2015-08-03T04:28:13Z

sql/core/src/test/scala/org/apache/spark/sql/optimizer/FilterNullsInJoinKeySuite.scala

Technically I suppose that we could also add a filter if b is null, since null + 1 == null, leading to an empty join result for those rows? We can figure this out for a simple case like this, but I guess the logic is too complicated to apply to arbitrary expressions.

Yeah. We need to understand if an expression can generate null if the input is non-nullable.

SparkQA · 2015-08-03T06:27:55Z

Test build #39506 has finished for PR 7768 at commit c02fc3f.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class AtLeastNNulls(n: Int, children: Seq[Expression]) extends Predicate
- case class AtLeastNNonNullNans(n: Int, children: Seq[Expression]) extends Predicate
- class DefaultOptimizer extends Optimizer
- case class FilterNullsInJoinKey(

JoshRosen · 2015-08-03T06:30:40Z

Since this passed tests, I'm going to merge this into master to unblock the other null-related patch.

yhuai and others added 11 commits July 27, 2015 14:15

Filter out rows that will not be joined in equal joins early.

2201129

Do not add unnessary filters.

d5b84c3

Introduce NullSafeHashPartitioning and NullUnsafePartitioning.

69bb072

Bug fix and refactoring.

7c2d2d8

wip

e616d3b

Add PartitioningCollection.

c6667e7

Style

f9516b0

First round of cleanup.

d3d2e64

Bug fix.

c57a954

Merge remote-tracking branch 'origin/master' into filter-nulls-in-joi…

40eeece

…n-key

Revert changes that are unrelated to null join key filtering

303236b

yhuai added 2 commits July 29, 2015 20:28

Fix non-deterministic tests.

8bb39ad

Make it clear that FilterNullsInJoinKeySuite.scala is used to test Fi…

be88760

…lterNullsInJoinKey.

yhuai added 2 commits July 29, 2015 23:20

Make sure we do not keep adding filters.

ea7d5a6

Update comments.

0a8e096

JoshRosen changed the title ~~[SPARK-9372] [SQL] [WIP] Filter nulls in join keys~~ [SPARK-9372] [SQL] Filter nulls in join keys Jul 30, 2015

JoshRosen reviewed Jul 30, 2015
View reviewed changes

JoshRosen reviewed Aug 3, 2015
View reviewed changes

Address Josh's comments.

c02fc3f

asfgit closed this in 687c8c3 Aug 3, 2015

srowen mentioned this pull request May 6, 2016

[SPARK-9372] [SQL] For joins, insert IS NOT NULL filters to children. #10209

Closed

[SPARK-9372] [SQL] Filter nulls in join keys #7768

[SPARK-9372] [SQL] Filter nulls in join keys #7768

Uh oh!

Conversation

JoshRosen commented Jul 30, 2015

Uh oh!

SparkQA commented Jul 30, 2015

Uh oh!

SparkQA commented Jul 30, 2015

Uh oh!

SparkQA commented Jul 30, 2015

Uh oh!

yhuai commented Jul 30, 2015

Uh oh!

SparkQA commented Jul 30, 2015

Uh oh!

JoshRosen commented Jul 30, 2015

Uh oh!

SparkQA commented Jul 30, 2015

Uh oh!

JoshRosen Jul 30, 2015

Choose a reason for hiding this comment

Uh oh!

yhuai Jul 31, 2015

Choose a reason for hiding this comment

Uh oh!

JoshRosen commented Aug 2, 2015

Uh oh!

SparkQA commented Aug 2, 2015

Uh oh!

yhuai commented Aug 3, 2015

Uh oh!

JoshRosen commented Aug 3, 2015

Uh oh!

JoshRosen Aug 3, 2015

Choose a reason for hiding this comment

Uh oh!

JoshRosen Aug 3, 2015

Choose a reason for hiding this comment

Uh oh!

JoshRosen commented Aug 3, 2015

Uh oh!

JoshRosen Aug 3, 2015

Choose a reason for hiding this comment

Uh oh!

yhuai Aug 3, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 3, 2015

Uh oh!

JoshRosen commented Aug 3, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants