[SPARK-36183][SQL] Push down limit 1 through Aggregate if it is group only #33397

wangyum · 2021-07-16T16:00:17Z

What changes were proposed in this pull request?

Push down limit 1 and turn Aggregate into Project through Aggregate if it is group only. For example:

create table t1 using parquet as select id from range(100000000L);
create table t2 using parquet as select id from range(100000000L);
create view v1 as select * from t1 union select * from t2;
select * from v1 limit 1;

Before this PR	After this PR

Why are the changes needed?

Improve query performance. This is a real case from the cluster:

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit test.

SparkQA · 2021-07-16T17:40:24Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45676/

SparkQA · 2021-07-16T21:41:01Z

Test build #141165 has finished for PR 33397 at commit 11c266b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

SparkQA · 2021-07-19T14:44:25Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45772/

SparkQA · 2021-07-19T15:17:26Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45772/

SparkQA · 2021-07-19T16:39:51Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45780/

SparkQA · 2021-07-19T17:17:44Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45780/

SparkQA · 2021-07-19T18:42:09Z

Test build #141258 has finished for PR 33397 at commit c630a63.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-07-19T20:32:13Z

Test build #141266 has finished for PR 33397 at commit ec3df29.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

…mizer/Optimizer.scala Co-authored-by: Wenchen Fan <[email protected]>

SparkQA · 2021-07-20T10:03:32Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45825/

SparkQA · 2021-07-20T10:48:31Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45825/

wangyum · 2021-07-20T12:25:33Z

Merged to master.

SparkQA · 2021-07-20T13:06:07Z

Test build #141311 has finished for PR 33397 at commit 1d60d64.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2021-07-26T03:57:21Z

The benchmark result:

Before this PR	After this PR

…oject ### What changes were proposed in this pull request? This is a follow-up of #33397 to avoid sub-optimal plans. After converting `Aggregate` to `Project`, there is information lost: `Aggregate` doesn't care about the data order of inputs, but `Project` cares. `EliminateSorts` can remove `Sort` below `Aggregate`, but it doesn't work anymore if we convert `Aggregate` to `Project`. This PR fixes this issue by tagging the `Project` to be order-irrelevant if it's converted from `Aggregate`. Then `EliminateSorts` optimizes the tagged `Project`. ### Why are the changes needed? avoid sub-optimal plans ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? new test ### Was this patch authored or co-authored using generative AI tooling? No Closes #44310 from cloud-fan/sort. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…oject ### What changes were proposed in this pull request? This is a follow-up of apache#33397 to avoid sub-optimal plans. After converting `Aggregate` to `Project`, there is information lost: `Aggregate` doesn't care about the data order of inputs, but `Project` cares. `EliminateSorts` can remove `Sort` below `Aggregate`, but it doesn't work anymore if we convert `Aggregate` to `Project`. This PR fixes this issue by tagging the `Project` to be order-irrelevant if it's converted from `Aggregate`. Then `EliminateSorts` optimizes the tagged `Project`. ### Why are the changes needed? avoid sub-optimal plans ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? new test ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#44310 from cloud-fan/sort. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

Push down limit 1 through Aggregate if it is group only.

11c266b

github-actions bot added the SQL label Jul 16, 2021

wangyum requested a review from cloud-fan July 19, 2021 01:12

cloud-fan approved these changes Jul 19, 2021

View reviewed changes

cloud-fan reviewed Jul 19, 2021

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala Outdated Show resolved Hide resolved

fix

c630a63

Update comment

ec3df29

cloud-fan reviewed Jul 20, 2021

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala Outdated Show resolved Hide resolved

cloud-fan approved these changes Jul 20, 2021

View reviewed changes

Update sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/opti…

1d60d64

…mizer/Optimizer.scala Co-authored-by: Wenchen Fan <[email protected]>

wangyum closed this in af978c8 Jul 20, 2021

wangyum deleted the SPARK-36183 branch July 20, 2021 12:25

cloud-fan mentioned this pull request Dec 12, 2023

[SPARK-46378][SQL] Still remove Sort after converting Aggregate to Project #44310

Closed

[SPARK-36183][SQL] Push down limit 1 through Aggregate if it is group only #33397

[SPARK-36183][SQL] Push down limit 1 through Aggregate if it is group only #33397

Uh oh!

Conversation

wangyum commented Jul 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Jul 16, 2021

Uh oh!

SparkQA commented Jul 16, 2021

Uh oh!

Uh oh!

SparkQA commented Jul 19, 2021

Uh oh!

SparkQA commented Jul 19, 2021

Uh oh!

SparkQA commented Jul 19, 2021

Uh oh!

SparkQA commented Jul 19, 2021

Uh oh!

SparkQA commented Jul 19, 2021

Uh oh!

SparkQA commented Jul 19, 2021

Uh oh!

Uh oh!

SparkQA commented Jul 20, 2021

Uh oh!

SparkQA commented Jul 20, 2021

Uh oh!

wangyum commented Jul 20, 2021

Uh oh!

SparkQA commented Jul 20, 2021

Uh oh!

wangyum commented Jul 26, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wangyum commented Jul 16, 2021 •

edited

Loading