Skip to content

Conversation

@wangyum
Copy link
Member

@wangyum wangyum commented Nov 1, 2023

What changes were proposed in this pull request?

This PR makes Dataset.isEmpty() to execute global limit 1 first. LimitPushDown may push down global limit 1 to lower nodes to improve query performance.

Note that we use global limit 1 here, because the local limit cannot be pushed down the group only case:

// Push down limit 1 through Aggregate and turn Aggregate into Project if it is group only.
case Limit(le @ IntegerLiteral(1), a: Aggregate) if a.groupOnly =>
Limit(le, Project(a.aggregateExpressions, LocalLimit(le, a.child)))
case Limit(le @ IntegerLiteral(1), p @ Project(_, a: Aggregate)) if a.groupOnly =>
Limit(le, p.copy(child = Project(a.aggregateExpressions, LocalLimit(le, a.child))))

Why are the changes needed?

Improve query performance.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Manual testing:

spark.range(300000000).selectExpr("id", "array(id, id % 10, id % 100) as eo").write.saveAsTable("t1")
spark.range(100000000).selectExpr("id", "array(id, id % 10, id % 1000) as eo").write.saveAsTable("t2")
println(spark.sql("SELECT * FROM t1 LATERAL VIEW explode_outer(eo) AS e UNION SELECT * FROM t2 LATERAL VIEW explode_outer(eo) AS e").isEmpty)
Before this PR After this PR
image image

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Nov 1, 2023
Copy link
Contributor

@beliefer beliefer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM if tests passed.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @wangyum and all.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-45755][SQL] Push down limit through Dataset.isEmpty() [SPARK-45755][SQL] Push down limit through Dataset.isEmpty() Nov 1, 2023
@dongjoon-hyun dongjoon-hyun changed the title [SPARK-45755][SQL] Push down limit through Dataset.isEmpty() [SPARK-45755][SQL] Improve Dataset.isEmpty() by applying global limit 1 Nov 1, 2023
@dongjoon-hyun
Copy link
Member

I revised the PR title a little, @wangyum . You can change it back if you want.

@wangyum
Copy link
Member Author

wangyum commented Nov 1, 2023

Thank you @dongjoon-hyun The new PR title looks better than the previous one.

@beliefer beliefer closed this in c7bba9b Nov 1, 2023
@beliefer
Copy link
Contributor

beliefer commented Nov 1, 2023

Merged to master. Thank you @wangyum
Thank you @dongjoon-hyun @HyukjinKwon @yaooqinn too.

@wangyum wangyum deleted the SPARK-45755 branch November 1, 2023 12:55
turboFei pushed a commit to turboFei/spark that referenced this pull request Nov 6, 2025
…it `1` (apache#251)

### What changes were proposed in this pull request?

This PR makes `Dataset.isEmpty()` to execute global limit 1 first. `LimitPushDown` may push down global limit 1 to lower nodes to improve query performance.

Note that we use global limit 1 here, because the local limit cannot be pushed down the group only case: https://github.com/apache/spark/blob/89ca8b6065e9f690a492c778262080741d50d94d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L766-L770

### Why are the changes needed?

Improve query performance.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual testing:
```scala
spark.range(300000000).selectExpr("id", "array(id, id % 10, id % 100) as eo").write.saveAsTable("t1")
spark.range(100000000).selectExpr("id", "array(id, id % 10, id % 1000) as eo").write.saveAsTable("t2")
println(spark.sql("SELECT * FROM t1 LATERAL VIEW explode_outer(eo) AS e UNION SELECT * FROM t2 LATERAL VIEW explode_outer(eo) AS e").isEmpty)
```

Before this PR | After this PR
-- | --
<img width="430" alt="image" src="https://github.com/apache/spark/assets/5399861/417adc05-4160-4470-b63c-125faac08c9c"> | <img width="430" alt="image" src="https://github.com/apache/spark/assets/5399861/bdeff231-e725-4c55-9da2-1b4cd59ec8c8">

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#43617 from wangyum/SPARK-45755.

Lead-authored-by: Yuming Wang <[email protected]>
Co-authored-by: Yuming Wang <[email protected]>
Signed-off-by: Jiaan Geng <[email protected]>
(cherry picked from commit c7bba9b)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants