[SPARK-34514][SQL] Push down limit for LEFT SEMI and LEFT ANTI join #31630

c21 · 2021-02-24T03:34:32Z

What changes were proposed in this pull request?

I found out during code review of #31567 (comment), where we can push down limit to the left side of LEFT SEMI and LEFT ANTI join, if the join condition is empty.

Why it's safe to push down limit:

The semantics of LEFT SEMI join without condition:
(1). if right side is non-empty, output all rows from left side.
(2). if right side is empty, output nothing.

The semantics of LEFT ANTI join without condition:
(1). if right side is non-empty, output nothing.
(2). if right side is empty, output all rows from left side.

With the semantics of output all rows from left side or nothing (all or nothing), it's safe to push down limit to left side.
NOTE: LEFT SEMI / LEFT ANTI join with non-empty condition is not safe for limit push down, because output can be a portion of left side rows.

Reference: physical operator implementation for LEFT SEMI / LEFT ANTI join without condition - https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastNestedLoopJoinExec.scala#L200-L204 .

Why are the changes needed?

Better performance. Save CPU and IO for these joins, as limit being pushed down before join.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added unit test in LimitPushdownSuite.scala and SQLQuerySuite.scala.

c21 · 2021-02-24T03:35:33Z

cc @wangyum , @maropu , @viirya , @HyukjinKwon and @cloud-fan for review if you have time, thanks.

maropu · 2021-02-24T04:09:38Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

            left = maybePushLocalLimit(exp, left),
            right = maybePushLocalLimit(exp, right))
+        case LeftSemi | LeftAnti if conditionOpt.isEmpty =>
+          join.copy(left = maybePushLocalLimit(exp, left))


hm, in this case, we need the join itself?

scala> sql("select * from l1").show() +----+ | id| +----+ | 1| | 2| |null| +----+ scala> sql("select * from r1").show() +----+ | id| +----+ | 2| |null| +----+ scala> sql("select * from l1 left semi join r1").show() +----+ | id| +----+ | 1| | 2| |null| +----+ scala> sql("select * from l1 left anti join r1").show() +---+ | id| +---+ +---+

I think we still need. Whether to output all rows or nothing, is depending on whether right side is empty, and this can only be known during runtime.

@maropu - this actually reminds me whether we can further optimize during runtime, and I found I already did it for LEFT SEMI with AQE - #29484 . Similarly for LEFT ANTI join without condition, we can convert join logical plan node to an empty relation if right build side is not empty. Will submit a followup PR tomorrow.

In addition, after taking a deep look at BroadcastNestedLoopJoinExec (never looked closely to that because it's not popular in our environment), I found many places that we can optimize:

populate outputOrdering and outputPartitioning when possible to avoid shuffle/sort in later stage.

shortcut for LEFT SEMI/ANTI in defaultJoin() as we don't need to look through all rows when there's no join condition.

code-gen the operator.

I will file an umbrella JIRA with minor priority and do it gradually.

Similarly for LEFT ANTI join without condition, we can convert join logical plan node to an empty relation if right build side is not empty. Will submit a followup PR tomorrow.

Ah, I see. That sounds reasonable. Nice idea, @c21 .

cloud-fan · 2021-02-24T04:59:54Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

+      spark.range(5).toDF().repartition(1).write.saveAsTable("left_table")
+      spark.range(3).write.saveAsTable("nonempty_right_table")
+      spark.range(0).write.saveAsTable("empty_right_table")
+      Seq("LEFT SEMI").foreach { joinType =>


seems LEFT ANTI is missing

@cloud-fan - good catch, I was accidentally removing it during debugging, fixed.

SparkQA · 2021-02-24T05:34:53Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39982/

SparkQA · 2021-02-24T06:03:26Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39982/

SparkQA · 2021-02-24T08:35:15Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39989/

SparkQA · 2021-02-24T09:04:07Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39989/

cloud-fan · 2021-02-24T10:22:59Z

thanks, merging to master!

c21 · 2021-02-24T10:46:12Z

Thank you all for review!

SparkQA · 2021-02-24T12:21:01Z

Test build #135409 has finished for PR 31630 at commit 22bfd5e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

### What changes were proposed in this pull request? I discovered from review discussion - #31630 (comment) , that we can eliminate LEFT ANTI join (with no join condition) to empty relation, if the right side is known to be non-empty. So with AQE, this is doable similar to #29484 . ### Why are the changes needed? This can help eliminate the join operator during logical plan optimization. Before this PR, [left side physical plan `execute()` will be called](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastNestedLoopJoinExec.scala#L192), so if left side is complicated (e.g. contain broadcast exchange operator), then some computation would happen. However after this PR, the join operator will be removed during logical plan, and nothing will be computed from left side. Potentially it can save resource for these kinds of query. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added unit tests for positive and negative queries in `AdaptiveQueryExecSuite.scala`. Closes #31641 from c21/left-anti-aqe. Authored-by: Cheng Su <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

Push down limit for LEFT SEMI and LEFT ANTI join

ca09339

Update previous comment for the rule

d188602

github-actions bot added the SQL label Feb 24, 2021

maropu reviewed Feb 24, 2021

View reviewed changes

cloud-fan reviewed Feb 24, 2021

View reviewed changes

cloud-fan approved these changes Feb 24, 2021

View reviewed changes

maropu approved these changes Feb 24, 2021

View reviewed changes

Fix unit test

22bfd5e

viirya approved these changes Feb 24, 2021

View reviewed changes

wangyum approved these changes Feb 24, 2021

View reviewed changes

cloud-fan closed this in 6ef57d3 Feb 24, 2021

c21 deleted the limit-pushdown branch February 24, 2021 10:46

c21 mentioned this pull request Feb 25, 2021

[SPARK-34533][SQL] Eliminate LEFT ANTI join to empty relation in AQE #31641

Closed

[SPARK-34514][SQL] Push down limit for LEFT SEMI and LEFT ANTI join #31630

[SPARK-34514][SQL] Push down limit for LEFT SEMI and LEFT ANTI join #31630

Uh oh!

Conversation

c21 commented Feb 24, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

c21 commented Feb 24, 2021

Uh oh!

maropu Feb 24, 2021

Choose a reason for hiding this comment

Uh oh!

c21 Feb 24, 2021

Choose a reason for hiding this comment

Uh oh!

c21 Feb 24, 2021

Choose a reason for hiding this comment

Uh oh!

maropu Feb 24, 2021

Choose a reason for hiding this comment

Uh oh!

cloud-fan Feb 24, 2021

Choose a reason for hiding this comment

Uh oh!

c21 Feb 24, 2021

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 24, 2021

Uh oh!

SparkQA commented Feb 24, 2021

Uh oh!

SparkQA commented Feb 24, 2021

Uh oh!

SparkQA commented Feb 24, 2021

Uh oh!

cloud-fan commented Feb 24, 2021

Uh oh!

c21 commented Feb 24, 2021

Uh oh!

SparkQA commented Feb 24, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants