Skip to content

Conversation

@wangyum
Copy link
Member

@wangyum wangyum commented Jul 16, 2021

What changes were proposed in this pull request?

Push down limit 1 and turn Aggregate into Project through Aggregate if it is group only. For example:

create table t1 using parquet as select id from range(100000000L);
create table t2 using parquet as select id from range(100000000L);
create view v1 as select * from t1 union select * from t2;
select * from v1 limit 1;
Before this PR After this PR
image image

Why are the changes needed?

Improve query performance. This is a real case from the cluster:
image

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit test.

@github-actions github-actions bot added the SQL label Jul 16, 2021
@SparkQA
Copy link

SparkQA commented Jul 16, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45676/

@SparkQA
Copy link

SparkQA commented Jul 16, 2021

Test build #141165 has finished for PR 33397 at commit 11c266b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum wangyum requested a review from cloud-fan July 19, 2021 01:12
@SparkQA
Copy link

SparkQA commented Jul 19, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45772/

@SparkQA
Copy link

SparkQA commented Jul 19, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45772/

@SparkQA
Copy link

SparkQA commented Jul 19, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45780/

@SparkQA
Copy link

SparkQA commented Jul 19, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45780/

@SparkQA
Copy link

SparkQA commented Jul 19, 2021

Test build #141258 has finished for PR 33397 at commit c630a63.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 19, 2021

Test build #141266 has finished for PR 33397 at commit ec3df29.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 20, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45825/

@SparkQA
Copy link

SparkQA commented Jul 20, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45825/

@wangyum wangyum closed this in af978c8 Jul 20, 2021
@wangyum
Copy link
Member Author

wangyum commented Jul 20, 2021

Merged to master.

@wangyum wangyum deleted the SPARK-36183 branch July 20, 2021 12:25
@SparkQA
Copy link

SparkQA commented Jul 20, 2021

Test build #141311 has finished for PR 33397 at commit 1d60d64.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum
Copy link
Member Author

wangyum commented Jul 26, 2021

The benchmark result:

Before this PR After this PR
image image

cloud-fan added a commit that referenced this pull request Dec 12, 2023
…oject

### What changes were proposed in this pull request?

This is a follow-up of #33397 to avoid sub-optimal plans. After converting `Aggregate` to `Project`, there is information lost: `Aggregate` doesn't care about the data order of inputs, but `Project` cares. `EliminateSorts` can remove `Sort` below `Aggregate`, but it doesn't work anymore if we convert `Aggregate` to `Project`.

This PR fixes this issue by tagging the `Project` to be order-irrelevant if it's converted from `Aggregate`. Then `EliminateSorts` optimizes the tagged `Project`.

### Why are the changes needed?

avoid sub-optimal plans

### Does this PR introduce _any_ user-facing change?

No
### How was this patch tested?

new test

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #44310 from cloud-fan/sort.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
szehon-ho pushed a commit to szehon-ho/spark that referenced this pull request Feb 7, 2024
…oject

### What changes were proposed in this pull request?

This is a follow-up of apache#33397 to avoid sub-optimal plans. After converting `Aggregate` to `Project`, there is information lost: `Aggregate` doesn't care about the data order of inputs, but `Project` cares. `EliminateSorts` can remove `Sort` below `Aggregate`, but it doesn't work anymore if we convert `Aggregate` to `Project`.

This PR fixes this issue by tagging the `Project` to be order-irrelevant if it's converted from `Aggregate`. Then `EliminateSorts` optimizes the tagged `Project`.

### Why are the changes needed?

avoid sub-optimal plans

### Does this PR introduce _any_ user-facing change?

No
### How was this patch tested?

new test

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#44310 from cloud-fan/sort.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants