[SPARK-45755][SQL] Improve `Dataset.isEmpty()` by applying global limit `1` #43617

wangyum · 2023-11-01T06:18:03Z

What changes were proposed in this pull request?

This PR makes Dataset.isEmpty() to execute global limit 1 first. LimitPushDown may push down global limit 1 to lower nodes to improve query performance.

Note that we use global limit 1 here, because the local limit cannot be pushed down the group only case:

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

Lines 766 to 770 in 89ca8b6

    
           // Push down limit 1 through Aggregate and turn Aggregate into Project if it is group only. 
        
           case Limit(le @ IntegerLiteral(1), a: Aggregate) if a.groupOnly => 
        
             Limit(le, Project(a.aggregateExpressions, LocalLimit(le, a.child))) 
        
           case Limit(le @ IntegerLiteral(1), p @ Project(_, a: Aggregate)) if a.groupOnly => 
        
             Limit(le, p.copy(child = Project(a.aggregateExpressions, LocalLimit(le, a.child))))

Why are the changes needed?

Improve query performance.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Manual testing:

spark.range(300000000).selectExpr("id", "array(id, id % 10, id % 100) as eo").write.saveAsTable("t1")
spark.range(100000000).selectExpr("id", "array(id, id % 10, id % 1000) as eo").write.saveAsTable("t2")
println(spark.sql("SELECT * FROM t1 LATERAL VIEW explode_outer(eo) AS e UNION SELECT * FROM t2 LATERAL VIEW explode_outer(eo) AS e").isEmpty)

Before this PR	After this PR

Was this patch authored or co-authored using generative AI tooling?

No.

beliefer

LGTM if tests passed.

dongjoon-hyun

+1, LGTM. Thank you, @wangyum and all.

dongjoon-hyun · 2023-11-01T07:22:42Z

I revised the PR title a little, @wangyum . You can change it back if you want.

wangyum · 2023-11-01T07:28:38Z

Thank you @dongjoon-hyun The new PR title looks better than the previous one.

beliefer · 2023-11-01T11:26:51Z

Merged to master. Thank you @wangyum
Thank you @dongjoon-hyun @HyukjinKwon @yaooqinn too.

…it `1` (apache#251) ### What changes were proposed in this pull request? This PR makes `Dataset.isEmpty()` to execute global limit 1 first. `LimitPushDown` may push down global limit 1 to lower nodes to improve query performance. Note that we use global limit 1 here, because the local limit cannot be pushed down the group only case: https://github.com/apache/spark/blob/89ca8b6065e9f690a492c778262080741d50d94d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L766-L770 ### Why are the changes needed? Improve query performance. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual testing: ```scala spark.range(300000000).selectExpr("id", "array(id, id % 10, id % 100) as eo").write.saveAsTable("t1") spark.range(100000000).selectExpr("id", "array(id, id % 10, id % 1000) as eo").write.saveAsTable("t2") println(spark.sql("SELECT * FROM t1 LATERAL VIEW explode_outer(eo) AS e UNION SELECT * FROM t2 LATERAL VIEW explode_outer(eo) AS e").isEmpty) ``` Before this PR | After this PR -- | -- <img width="430" alt="image" src="https://github.com/apache/spark/assets/5399861/417adc05-4160-4470-b63c-125faac08c9c"> | <img width="430" alt="image" src="https://github.com/apache/spark/assets/5399861/bdeff231-e725-4c55-9da2-1b4cd59ec8c8"> ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#43617 from wangyum/SPARK-45755. Lead-authored-by: Yuming Wang <[email protected]> Co-authored-by: Yuming Wang <[email protected]> Signed-off-by: Jiaan Geng <[email protected]> (cherry picked from commit c7bba9b)

Push down limit through Dataset.isEmpty()

8bac7b5

github-actions bot added the SQL label Nov 1, 2023

beliefer approved these changes Nov 1, 2023

View reviewed changes

yaooqinn approved these changes Nov 1, 2023

View reviewed changes

dongjoon-hyun approved these changes Nov 1, 2023

View reviewed changes

dongjoon-hyun changed the title ~~[SPARK-45755][SQL] Push down limit through Dataset.isEmpty()~~ [SPARK-45755][SQL] Push down limit through Dataset.isEmpty() Nov 1, 2023

dongjoon-hyun changed the title ~~[SPARK-45755][SQL] Push down limit through Dataset.isEmpty()~~ [SPARK-45755][SQL] Improve Dataset.isEmpty() by applying global limit 1 Nov 1, 2023

Merge branch 'apache:master' into SPARK-45755

81db8b5

HyukjinKwon approved these changes Nov 1, 2023

View reviewed changes

beliefer closed this in c7bba9b Nov 1, 2023

wangyum deleted the SPARK-45755 branch November 1, 2023 12:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-45755][SQL] Improve `Dataset.isEmpty()` by applying global limit `1` #43617

[SPARK-45755][SQL] Improve `Dataset.isEmpty()` by applying global limit `1` #43617

Uh oh!

wangyum commented Nov 1, 2023 •

edited

Loading

Uh oh!

beliefer left a comment

Uh oh!

dongjoon-hyun left a comment

Uh oh!

dongjoon-hyun commented Nov 1, 2023

Uh oh!

wangyum commented Nov 1, 2023

Uh oh!

beliefer commented Nov 1, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

	// Push down limit 1 through Aggregate and turn Aggregate into Project if it is group only.
	case Limit(le @ IntegerLiteral(1), a: Aggregate) if a.groupOnly =>
	Limit(le, Project(a.aggregateExpressions, LocalLimit(le, a.child)))
	case Limit(le @ IntegerLiteral(1), p @ Project(_, a: Aggregate)) if a.groupOnly =>
	Limit(le, p.copy(child = Project(a.aggregateExpressions, LocalLimit(le, a.child))))

[SPARK-45755][SQL] Improve Dataset.isEmpty() by applying global limit 1 #43617

[SPARK-45755][SQL] Improve Dataset.isEmpty() by applying global limit 1 #43617

Uh oh!

Conversation

wangyum commented Nov 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

beliefer left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Nov 1, 2023

Uh oh!

wangyum commented Nov 1, 2023

Uh oh!

beliefer commented Nov 1, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-45755][SQL] Improve `Dataset.isEmpty()` by applying global limit `1` #43617

[SPARK-45755][SQL] Improve `Dataset.isEmpty()` by applying global limit `1` #43617

wangyum commented Nov 1, 2023 •

edited

Loading