Skip to content

Conversation

@JoshRosen
Copy link
Contributor

This PR adds an optimization rule, FilterNullsInJoinKey, to add Filter before join operators to filter out rows having null values for join keys.

This optimization is guarded by a new SQL conf, spark.sql.advancedOptimization.

The code in this PR was authored by @yhuai; I'm opening this PR to factor out this change from #7685, a larger pull request which contains two other optimizations.

@SparkQA
Copy link

SparkQA commented Jul 30, 2015

Test build #38952 has finished for PR 7768 at commit 303236b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 30, 2015

Test build #38967 has finished for PR 7768 at commit be88760.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class AtLeastNNulls(n: Int, children: Seq[Expression]) extends Predicate
    • case class AtLeastNNonNullNans(n: Int, children: Seq[Expression]) extends Predicate
    • class DefaultOptimizer extends Optimizer
    • case class FilterNullsInJoinKey(

@JoshRosen JoshRosen changed the title [SPARK-9372] [SQL] [WIP] Filter nulls in join keys [SPARK-9372] [SQL] Filter nulls in join keys Jul 30, 2015
@SparkQA
Copy link

SparkQA commented Jul 30, 2015

Test build #39013 has finished for PR 7768 at commit 0a8e096.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class AtLeastNNulls(n: Int, children: Seq[Expression]) extends Predicate
    • case class AtLeastNNonNullNans(n: Int, children: Seq[Expression]) extends Predicate
    • class DefaultOptimizer extends Optimizer
    • case class FilterNullsInJoinKey(

@yhuai
Copy link
Contributor

yhuai commented Jul 30, 2015

test this please

@SparkQA
Copy link

SparkQA commented Jul 30, 2015

Test build #39018 has finished for PR 7768 at commit 0a8e096.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class AtLeastNNulls(n: Int, children: Seq[Expression]) extends Predicate
    • case class AtLeastNNonNullNans(n: Int, children: Seq[Expression]) extends Predicate
    • class DefaultOptimizer extends Optimizer
    • case class FilterNullsInJoinKey(

@JoshRosen
Copy link
Contributor Author

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Jul 30, 2015

Test build #39081 has finished for PR 7768 at commit 0a8e096.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class AtLeastNNulls(n: Int, children: Seq[Expression]) extends Predicate
    • case class AtLeastNNonNullNans(n: Int, children: Seq[Expression]) extends Predicate
    • class DefaultOptimizer extends Optimizer
    • case class FilterNullsInJoinKey(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to briefly clarify, I guess that the problem was that AtLeastNNulls also dropped NaNs but that we can't do that since it would lead to a violation of our NaN-equality semantics when joining on float/double columns?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. Because null means Unknown, so when you have a predicate null = null, the result is false (meaning Unknown). But for NaN, in our current semantic, two NaN are equal.

@JoshRosen
Copy link
Contributor Author

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Aug 2, 2015

Test build #39445 has finished for PR 7768 at commit 0a8e096.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class AtLeastNNulls(n: Int, children: Seq[Expression]) extends Predicate
    • case class AtLeastNNonNullNans(n: Int, children: Seq[Expression]) extends Predicate
    • class DefaultOptimizer extends Optimizer
    • case class FilterNullsInJoinKey(

@yhuai
Copy link
Contributor

yhuai commented Aug 3, 2015

@JoshRosen If you think changes in this PR are good, how about we merge it?

@JoshRosen
Copy link
Contributor Author

Looking now... sorry for delay.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment looks out-of-date, probably a result of the splitting of the larger patch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These arguments are slightly underindented.

@JoshRosen
Copy link
Contributor Author

LGTM overall, aside from a minor comment about a minor out-of-date comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically I suppose that we could also add a filter if b is null, since null + 1 == null, leading to an empty join result for those rows? We can figure this out for a simple case like this, but I guess the logic is too complicated to apply to arbitrary expressions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. We need to understand if an expression can generate null if the input is non-nullable.

@SparkQA
Copy link

SparkQA commented Aug 3, 2015

Test build #39506 has finished for PR 7768 at commit c02fc3f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class AtLeastNNulls(n: Int, children: Seq[Expression]) extends Predicate
    • case class AtLeastNNonNullNans(n: Int, children: Seq[Expression]) extends Predicate
    • class DefaultOptimizer extends Optimizer
    • case class FilterNullsInJoinKey(

@JoshRosen
Copy link
Contributor Author

Since this passed tests, I'm going to merge this into master to unblock the other null-related patch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants