[SPARK-9372] [SQL] For joins, insert IS NOT NULL filters to children. #10209

nongli · 2015-12-08T23:55:10Z

Some join types and conditions imply that the join keys cannot be NULL and
can be filtered out by the children. This patch does this for inner joins
and introduces a mechanism to generate predicates. The complex part of doing
this is to make sure the transformation is stable. The problem that we want
to avoid is generating a filter in the join, having that pushed down and then
having the join regenerate the filter.

This patch solves this by having the join remember predicates that it has
generated. This mechanism should be general enough that we can infer other
predicates, for example "a join b where a.id = b.id AND a.id = 10" could
also use this mechanism to generate the predicate "b.id = 10".

SparkQA · 2015-12-09T00:22:52Z

Test build #47375 has finished for PR 10209 at commit 565d15e.

This patch fails to build.
This patch does not merge cleanly.
This patch adds no public classes.

Some join types and conditions imply that the join keys cannot be NULL and can be filtered out by the children. This patch does this for inner joins and introduces a mechanism to generate predicates. The complex part of doing this is to make sure the transformation is stable. The problem that we want to avoid is generating a filter in the join, having that pushed down and then having the join regenerate the filter. This patch solves this by having the join remember predicates that it has generated. This mechanism should be general enough that we can infer other predicates, for example "a join b where a.id = b.id AND a.id = 10" could also use this mechanism to generate the predicate "b.id = 10".

SparkQA · 2015-12-10T06:28:32Z

Test build #47488 has finished for PR 10209 at commit 00e957c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2015-12-10T13:40:59Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala

nit: we can use if !j.selfJoinResolved

cloud-fan · 2015-12-10T14:02:33Z

This looks very cool!
But I'm a little worried about making the Join stateful, how about adding a new batch in Optimizer and only execute it Once? So that we can avoid having the join regenerate the filter as your rule will be applied after all other optimization rules.

marmbrus · 2015-12-10T18:57:29Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala

This is semi-public API cause I think some advanced projects do dig into catalyst and we've never changed the signature of something as basic as Join before. Could we do this instead by fixing nullablity propagation and only inserting the filter if the attribute is nullable?

I started down that path but can you think of a way to make that handle the more general case of predicate propagation?

t1.key join t2.key where t1.key = t2.key and t1.key = 5.

How do we generate the predicate t2.key = 5? how do we make this more general?

In an earlier version of catalyst we also had equivalence classes propagate up the logical plans. Would that give you enough information?

Equivalence classes is one thing, we can compute that no problem I think. The issue is how to remember that t2.key = 5 was generated and not to generate it again. The trick of setting nullable doesn't work here. We could maintain value constraints (where nullability is a subset).

Couldn't Literal(5) be in the equivalence class and we could check for that?

That said, I also like the idea more general value constraints.

Literal(5) would work for equivalence but we want to track more than equality. If it was t1.key join t2.key where t1.key = t2.key and t1.key > 5, we'd similarly want to add t2.key > 5.

Are you suggesting we don't change the operator and walk the tree bottom up to collect these constraints? This seems extremely expensive to do.

I was thinking operators would propagate the set of constraints up from their children (possibly augmenting or clearing as appropriate) and we'd save it in a lazy val.

SparkQA · 2015-12-10T19:52:24Z

Test build #47531 has finished for PR 10209 at commit fb562fb.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-05-06T17:20:42Z

Is this already fixed by #7768 ?

nongli force-pushed the spark-9372 branch from 565d15e to ea0edd4 Compare December 10, 2015 05:50

Long line fix.

00e957c

Fix pattern with casts and add more test cases.

fb562fb

cloud-fan reviewed Dec 10, 2015
View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala

Copy link

Contributor

cloud-fan Dec 10, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can use if !j.selfJoinResolved

marmbrus reviewed Dec 10, 2015
View reviewed changes

srowen mentioned this pull request May 11, 2016

[BUILD] Test closing stale PRs #13052

Closed

asfgit closed this in 5bb62b8 May 12, 2016

[SPARK-9372] [SQL] For joins, insert IS NOT NULL filters to children. #10209

[SPARK-9372] [SQL] For joins, insert IS NOT NULL filters to children. #10209

Uh oh!

Conversation

nongli commented Dec 8, 2015

Uh oh!

SparkQA commented Dec 9, 2015

Uh oh!

SparkQA commented Dec 10, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Dec 10, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 10, 2015

Uh oh!

srowen commented May 6, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants