Skip to content

Conversation

@nongli
Copy link
Contributor

@nongli nongli commented Dec 8, 2015

Some join types and conditions imply that the join keys cannot be NULL and
can be filtered out by the children. This patch does this for inner joins
and introduces a mechanism to generate predicates. The complex part of doing
this is to make sure the transformation is stable. The problem that we want
to avoid is generating a filter in the join, having that pushed down and then
having the join regenerate the filter.

This patch solves this by having the join remember predicates that it has
generated. This mechanism should be general enough that we can infer other
predicates, for example "a join b where a.id = b.id AND a.id = 10" could
also use this mechanism to generate the predicate "b.id = 10".

@SparkQA
Copy link

SparkQA commented Dec 9, 2015

Test build #47375 has finished for PR 10209 at commit 565d15e.

  • This patch fails to build.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

Some join types and conditions imply that the join keys cannot be NULL and
can be filtered out by the children. This patch does this for inner joins
and introduces a mechanism to generate predicates. The complex part of doing
this is to make sure the transformation is stable. The problem that we want
to avoid is generating a filter in the join, having that pushed down and then
having the join regenerate the filter.

This patch solves this by having the join remember predicates that it has
generated. This mechanism should be general enough that we can infer other
predicates, for example "a join b where a.id = b.id AND a.id = 10" could
also use this mechanism to generate the predicate "b.id = 10".
@SparkQA
Copy link

SparkQA commented Dec 10, 2015

Test build #47488 has finished for PR 10209 at commit 00e957c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can use if !j.selfJoinResolved

@cloud-fan
Copy link
Contributor

This looks very cool!
But I'm a little worried about making the Join stateful, how about adding a new batch in Optimizer and only execute it Once? So that we can avoid having the join regenerate the filter as your rule will be applied after all other optimization rules.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is semi-public API cause I think some advanced projects do dig into catalyst and we've never changed the signature of something as basic as Join before. Could we do this instead by fixing nullablity propagation and only inserting the filter if the attribute is nullable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started down that path but can you think of a way to make that handle the more general case of predicate propagation?

t1.key join t2.key where t1.key = t2.key and t1.key = 5.

How do we generate the predicate t2.key = 5? how do we make this more general?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In an earlier version of catalyst we also had equivalence classes propagate up the logical plans. Would that give you enough information?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Equivalence classes is one thing, we can compute that no problem I think. The issue is how to remember that t2.key = 5 was generated and not to generate it again. The trick of setting nullable doesn't work here. We could maintain value constraints (where nullability is a subset).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't Literal(5) be in the equivalence class and we could check for that?

That said, I also like the idea more general value constraints.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Literal(5) would work for equivalence but we want to track more than equality. If it was t1.key join t2.key where t1.key = t2.key and t1.key > 5, we'd similarly want to add t2.key > 5.

Are you suggesting we don't change the operator and walk the tree bottom up to collect these constraints? This seems extremely expensive to do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking operators would propagate the set of constraints up from their children (possibly augmenting or clearing as appropriate) and we'd save it in a lazy val.

@SparkQA
Copy link

SparkQA commented Dec 10, 2015

Test build #47531 has finished for PR 10209 at commit fb562fb.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented May 6, 2016

Is this already fixed by #7768 ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants