-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-45755][SQL] Improve Dataset.isEmpty() by applying global limit 1
#43617
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM if tests passed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you, @wangyum and all.
Dataset.isEmpty()
Dataset.isEmpty()Dataset.isEmpty() by applying global limit 1
|
I revised the PR title a little, @wangyum . You can change it back if you want. |
|
Thank you @dongjoon-hyun The new PR title looks better than the previous one. |
|
Merged to master. Thank you @wangyum |
…it `1` (apache#251) ### What changes were proposed in this pull request? This PR makes `Dataset.isEmpty()` to execute global limit 1 first. `LimitPushDown` may push down global limit 1 to lower nodes to improve query performance. Note that we use global limit 1 here, because the local limit cannot be pushed down the group only case: https://github.com/apache/spark/blob/89ca8b6065e9f690a492c778262080741d50d94d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L766-L770 ### Why are the changes needed? Improve query performance. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual testing: ```scala spark.range(300000000).selectExpr("id", "array(id, id % 10, id % 100) as eo").write.saveAsTable("t1") spark.range(100000000).selectExpr("id", "array(id, id % 10, id % 1000) as eo").write.saveAsTable("t2") println(spark.sql("SELECT * FROM t1 LATERAL VIEW explode_outer(eo) AS e UNION SELECT * FROM t2 LATERAL VIEW explode_outer(eo) AS e").isEmpty) ``` Before this PR | After this PR -- | -- <img width="430" alt="image" src="https://github.com/apache/spark/assets/5399861/417adc05-4160-4470-b63c-125faac08c9c"> | <img width="430" alt="image" src="https://github.com/apache/spark/assets/5399861/bdeff231-e725-4c55-9da2-1b4cd59ec8c8"> ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#43617 from wangyum/SPARK-45755. Lead-authored-by: Yuming Wang <[email protected]> Co-authored-by: Yuming Wang <[email protected]> Signed-off-by: Jiaan Geng <[email protected]> (cherry picked from commit c7bba9b)
What changes were proposed in this pull request?
This PR makes
Dataset.isEmpty()to execute global limit 1 first.LimitPushDownmay push down global limit 1 to lower nodes to improve query performance.Note that we use global limit 1 here, because the local limit cannot be pushed down the group only case:
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
Lines 766 to 770 in 89ca8b6
Why are the changes needed?
Improve query performance.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Manual testing:
Was this patch authored or co-authored using generative AI tooling?
No.