Skip to content

Conversation

@gatorsmile
Copy link
Member

This PR is another enhancement to Optimizer. It does not conflict with the other PRs (#10567 and #10551).

Given an outer join (OJ) is involved in another join (called parent join PJ), when the join type of PJ is inner, left-semi, left-outer and right-outer, checking if the join condition of the PJ satisfies the following two conditions:

  1. there exist null filtering predicates against the columns in the null-supplying side of PJ.
  2. these columns are from OJ.

If having such join predicates, execute the elimination rules:

  • full outer -> inner if both sides of OJ have such predicates
  • left outer -> inner if the right side of OJ has such predicates
  • right outer -> inner if the left side of OJ has such predicates
  • full outer -> left outer if only the left side of OJ has such predicates
  • full outer -> right outer if only the right side of OJ has such predicates

If applicable, this can greatly improve the performance, since outer join is much slower than inner join, full outer join is much slower than left/right outer join.

BTW, since the rule is different from the rule in #10567, I did not merge them in the same one for simplifying the code review.

@gatorsmile gatorsmile changed the title Outer join elimination by parent join predicate Outer Join Elimination by Parent Join Condition Jan 4, 2016
@gatorsmile gatorsmile changed the title Outer Join Elimination by Parent Join Condition [SPARK-12613] [SQL] Outer Join Elimination by Parent Join Condition Jan 4, 2016
@SparkQA
Copy link

SparkQA commented Jan 4, 2016

Test build #48631 has finished for PR 10566 at commit d6a6e9c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Jan 4, 2016

btw i created this: https://issues.apache.org/jira/browse/SPARK-12616

seems like something you can do?

@gatorsmile
Copy link
Member Author

Sure, I can make a try! Thank you!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I commented on the other PR, I think we should have a more general way to infer null propagation / filtering. Maybe you can discuss with @sameeragarwal and then update these PRs after his machinery is available.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will do. Thank you!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile similar to #10566, I think we should now be just able to apply this optimization rule more generally along the lines of:

  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
    case f @ Filter(condition, j @ Join(_, _, RightOuter | LeftOuter | FullOuter, _)) =>
      Filter(condition, buildNewJoin(f, j))

    // Case 1: when parent join is Inner|LeftSemi|LeftOuter and the child join is on the right side
    case pj @ Join(pLeft, j @ Join(left, right, RightOuter|LeftOuter|FullOuter, condition), Inner|LeftSemi|LeftOuter, Some(pJoinCond)) =>
      Join(pLeft, buildNewJoin(pj, j), pj.joinType, Some(pJoinCond))

    // Case 2: when parent join is Inner|LeftSemi|RightOuter and the child join is on the left side
    case pj @ Join(j @ Join(left, right, RightOuter|LeftOuter|FullOuter, condition), pRight, Inner|LeftSemi|RightOuter, Some(pJoinCond)) =>
      Join(buildNewJoin(pj, j), pRight, pj.joinType, Some(pJoinCond))
  }

Thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will do the changes. Thank you!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sameeragarwal Unfortunately, they are unable to share the same buildNewJoin function.

For example, if the parent join is full outer, the parent join will not have any IsNotNull constraint. In the current constraint propagation, its constraints is Set.empty[Expression]. However, the join condition of this parent join still can be used for outer join elimination of the child join.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me do the outer join elimination by Filter at first. That one can directly use the existing infrastructure of constraint propagation. #10567 Thanks!

asfgit pushed a commit that referenced this pull request Jan 29, 2016
Our current Intersect physical operator simply delegates to RDD.intersect. We should remove the Intersect physical operator and simply transform a logical intersect into a semi-join with distinct. This way, we can take advantage of all the benefits of join implementations (e.g. managed memory, code generation, broadcast joins).

After a search, I found one of the mainstream RDBMS did the same. In their query explain, Intersect is replaced by Left-semi Join. Left-semi Join could help outer-join elimination in Optimizer, as shown in the PR: #10566

Author: gatorsmile <[email protected]>
Author: xiaoli <[email protected]>
Author: Xiao Li <[email protected]>

Closes #10630 from gatorsmile/IntersectBySemiJoin.
@marmbrus
Copy link
Contributor

To you want to update this now?

@gatorsmile
Copy link
Member Author

Will do it tonight. Thanks!

@gatorsmile
Copy link
Member Author

First, will add test cases to OuterJoinEliminationSuite tomorrow.

Second, the current fix does not cover all the possible cases. I need to get the inputs from you about the issues this PR is facing:

    val df = Seq((1, 2, "1"), (3, 4, "3")).toDF("int", "int2", "str").as("a")
    val df2 = Seq((1, 2, "1"), (5, 6, "5")).toDF("int", "int2", "str").as("b")
    val df3 = Seq((1, 3, "1"), (4, 6, "5")).toDF("int", "int2", "str").as("c")

    // Full -> Left
    val full2Left = df.join(df2, $"a.int" === $"b.int", "full")
      .join(df3, $"c.int" === $"a.int", "right").select($"a.*", $"b.*", $"c.*")

In the above case, the parent join condition $"c.int" === $"a.int" is not eligible for the two ways we are currently using to decide if the predicates are null filtering.

  1. The first way is based on the constraints. If the parent join is full outer, the parent join will not have any IsNotNull constraint. In the current constraint propagation, its constraints is Set.empty[Expression]. Thus, $"c.int" === $"a.int" is not eligible for using the first way.
  2. The second way is based on the run-time evaluation, canFilterOutNull. This requires that all the attributes are from the same side. In the predicate $"c.int" === $"a.int", $"a.int" is from the left side, but $"c.int" is not from the left side. (Actually, $"c.int" is from the other side in the parent join.) Thus, it is ineligible for the second way too.

However, the parent join condition $"c.int" === $"a.int" is very common in the join condition. We definitely can use such predicates as null-filtering predicates. Maybe we can keep the original way as the third way, as shown in the following link:
https://github.com/gatorsmile/spark/blob/d6a6e9cc31b0f7547b35cf25884135ea65b03676/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L798-L799

Does that look good to you? Thanks! : )

@SparkQA
Copy link

SparkQA commented Feb 25, 2016

Test build #51943 has finished for PR 10566 at commit 1a9ebdf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@marmbrus
Copy link
Contributor

The first way is based on the constraints. If the parent join is full outer, the parent join will not have any IsNotNull constraint. In the current constraint propagation, its constraints is Set.empty[Expression]. Thus, $"c.int" === $"a.int" is not eligible for using the first way.

Why isn't the constraint present? We should fix that instead of inventing another unrelated way to reason about nullability.

@gatorsmile
Copy link
Member Author

The existing constraint propagation is bottom up. The join conditions of full-outer joins will not filter out NULL in the outputs of this Join.

Here, it is top down. The join conditions of full-outer joins can filter out the NULL of the child outer joins.

Will open a separate PR for top-down constraint propagation. Thanks for your suggestions!

@gatorsmile
Copy link
Member Author

: ) Basically, top-down constraint propagation has been done in the optimizer rules:

  •   PushPredicateThroughJoin,
    
  •   PushPredicateThroughProject,
    
  •   PushPredicateThroughGenerate,
    
  •   PushPredicateThroughAggregate,
    

Plan to add a new rule in optimizer for NULL constraints pushdown.

@gatorsmile
Copy link
Member Author

After more thinking, in my opinion, the best way is to add extra Filter between two Join and let the existing Filter-condition-based rule to do outer join elimination, but we need another rule to remove unnecessary Filter which only contains Null constraints. This new rule can be more general.

Let me first create a PR to do Filter removal/cleaning.

Update: #11406 is created.

@rxin
Copy link
Contributor

rxin commented Jun 15, 2016

Thanks for the pull request. I'm going through a list of pull requests to cut them down since the sheer number is breaking some of the tooling we have. Due to lack of activity on this pull request, I'm going to push a commit to close it. Feel free to reopen it or create a new one. We can also continue the discussion on the JIRA ticket.

@asfgit asfgit closed this in 1a33f2e Jun 15, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants