-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-33442][SQL] Change Combine Limit to Eliminate limit using max row #30368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
Show resolved
Hide resolved
|
Kubernetes integration test starting |
|
Test build #131054 has finished for PR 30368 at commit
|
|
Kubernetes integration test status failure |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
Show resolved
Hide resolved
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #131066 has finished for PR 30368 at commit
|
|
Test build #131068 has finished for PR 30368 at commit
|
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
Outdated
Show resolved
Hide resolved
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #131092 has finished for PR 30368 at commit
|
| @@ -1,5 +1,5 @@ | |||
| == Physical Plan == | |||
| TakeOrderedAndProject (34) | |||
| * Sort (34) | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It changes from TakeOrderedAndProject to Sort seems because Limit after Sort is removed?
It might have additional shuffle for global Sort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This q92 sql:
SELECT sum(ws_ext_discount_amt) AS `Excess Discount Amount `
FROM web_sales, item, date_dim
WHERE i_manufact_id = 350
AND i_item_sk = ws_item_sk
AND d_date BETWEEN '2000-01-27' AND (cast('2000-01-27' AS DATE) + INTERVAL 90 days)
AND d_date_sk = ws_sold_date_sk
AND ws_ext_discount_amt >
(
SELECT 1.3 * avg(ws_ext_discount_amt)
FROM web_sales, date_dim
WHERE ws_item_sk = i_item_sk
AND d_date BETWEEN '2000-01-27' AND (cast('2000-01-27' AS DATE) + INTERVAL 90 days)
AND d_date_sk = ws_sold_date_sk
)
ORDER BY sum(ws_ext_discount_amt)
LIMIT 100
yes, Limit after Sort is a special case, we will convert to TakeOrderedAndProject, but it seems not necessary to do both sort and limit if child maxRow == 1. Maybe we can do an another check seems like if sort.child.maxRow <= 1 then remove sort ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other thought, we can add this pattern
case sort @ Sort(order, true, child)
if sort.maxRow < conf.topKSortFallbackThreshold =>
TakeOrderedAndProjectExec()
In this way, we can infer the TakeOrderedAndProjectExec from Sort which has not Limit after.
What do you think about this? @maropu @viirya @cloud-fan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, @viirya , nice catch! Yea, how about simply excluding the case in the EliminateLimits rule? @ulysses-you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #131104 has finished for PR 30368 at commit
|
|
Test build #131160 has finished for PR 30368 at commit
|
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
Outdated
Show resolved
Hide resolved
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #131182 has finished for PR 30368 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #131199 has finished for PR 30368 at commit
|
|
Test build #131197 has finished for PR 30368 at commit
|
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
Show resolved
Hide resolved
cloud-fan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #131225 has finished for PR 30368 at commit
|
|
Hi @ulysses-you , can you put the conclusion of #30368 (comment) in PR description, to mention that we may end up replacing |
|
@cloud-fan updated the description. |
|
thanks, merging to master! |
|
thanks for merging! |
What changes were proposed in this pull request?
Change
CombineLimitsname toEliminateLimitsand add check ifLimitchild max row <= limit.Why are the changes needed?
In Add-hoc scene, we always add limit for the query if user have no special limit value, but not all limit is nesessary.
A general negative example is
It will be great if we can eliminate limit at Spark side.
Also, we make a benchmark for this case
and the result is
It shows that it makes sense to replace
TakeOrderedAndProjectExecwithSort + Project.Does this PR introduce any user-facing change?
No.
How was this patch tested?
Add test.