[SPARK-33122][SQL] Remove redundant aggregates in the Optimzier #30018

tanelk · 2020-10-12T20:31:16Z

What changes were proposed in this pull request?

Added optimizer rule RemoveRedundantAggregates. It removes redundant aggregates from a query plan. A redundant aggregate is an aggregate whose only goal is to keep distinct values, while its parent aggregate would ignore duplicate values.

The affected part of the query plan for TPCDS q87:

Before:

== Physical Plan ==
*(26) HashAggregate(keys=[], functions=[count(1)])
+- Exchange SinglePartition, true, [id=#785]
   +- *(25) HashAggregate(keys=[], functions=[partial_count(1)])
      +- *(25) HashAggregate(keys=[c_last_name#61, c_first_name#60, d_date#26], functions=[])
         +- *(25) HashAggregate(keys=[c_last_name#61, c_first_name#60, d_date#26], functions=[])
            +- *(25) HashAggregate(keys=[c_last_name#61, c_first_name#60, d_date#26], functions=[])
               +- *(25) HashAggregate(keys=[c_last_name#61, c_first_name#60, d_date#26], functions=[])
                  +- *(25) HashAggregate(keys=[c_last_name#61, c_first_name#60, d_date#26], functions=[])
                     +- Exchange hashpartitioning(c_last_name#61, c_first_name#60, d_date#26, 5), true, [id=#724]
                        +- *(24) HashAggregate(keys=[c_last_name#61, c_first_name#60, d_date#26], functions=[])
                           +- SortMergeJoin [coalesce(c_last_name#61, ), isnull(c_last_name#61), coalesce(c_first_name#60, ), isnull(c_first_name#60), coalesce(d_date#26, 0), isnull(d_date#26)], [coalesce(c_last_name#221, ), isnull(c_last_name#221), coalesce(c_first_name#220, ), isnull(c_first_name#220), coalesce(d_date#186, 0), isnull(d_date#186)], LeftAnti
                              :- ...

After:

== Physical Plan ==
*(26) HashAggregate(keys=[], functions=[count(1)])
+- Exchange SinglePartition, true, [id=#751]
   +- *(25) HashAggregate(keys=[], functions=[partial_count(1)])
      +- *(25) HashAggregate(keys=[c_last_name#61, c_first_name#60, d_date#26], functions=[])
         +- Exchange hashpartitioning(c_last_name#61, c_first_name#60, d_date#26, 5), true, [id=#694]
            +- *(24) HashAggregate(keys=[c_last_name#61, c_first_name#60, d_date#26], functions=[])
               +- SortMergeJoin [coalesce(c_last_name#61, ), isnull(c_last_name#61), coalesce(c_first_name#60, ), isnull(c_first_name#60), coalesce(d_date#26, 0), isnull(d_date#26)], [coalesce(c_last_name#221, ), isnull(c_last_name#221), coalesce(c_first_name#220, ), isnull(c_first_name#220), coalesce(d_date#186, 0), isnull(d_date#186)], LeftAnti
                  :- ...

Why are the changes needed?

Performance improvements - few TPCDS queries have these kinds of duplicate aggregates.

Does this PR introduce any user-facing change?

No

How was this patch tested?

UT

Benchmarks (sf=5):

OpenJDK 64-Bit Server VM 1.8.0_265-b01 on Linux 5.8.13-arch1-1
Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz

Query	Before	After	Speedup
q14a	44s	44s	1x
q14b	41s	41s	1x
q38	6.5s	5.9s	1.1x
q87	7.2s	6.8s	1.1x
q14a-v2.7	55s	53s	1x

tanelk · 2020-10-12T20:32:12Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala

 * @param groupingExpressions expressions for grouping keys
 * @param aggregateExpressions expressions for a project list, which could contain
- *                             [[AggregateFunction]]s.
+ *                             [[AggregateExpression]]s.


This caused some confusion while making this PR

tanelk · 2020-10-12T20:33:32Z

I'll try do get the actual performance change for the TPCDS queries soon.

SparkQA · 2020-10-12T21:17:55Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34311/

SparkQA · 2020-10-12T21:42:37Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34311/

maropu · 2020-10-12T23:06:23Z

Could you describe more about the "redundant" case in the PR description? e.g., plan changes before/after this PR

SparkQA · 2020-10-13T01:28:13Z

Test build #129705 has finished for PR 30018 at commit 14f3033.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tanelk · 2020-10-14T18:24:33Z

Added a changed query plan sample and some TPCDS results. The change is not remarkable, but for bigger datasets it can add up.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

SparkQA · 2020-10-14T21:21:58Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34359/

SparkQA · 2020-10-14T21:46:02Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34359/

SparkQA · 2020-10-14T23:47:07Z

Test build #129753 has finished for PR 30018 at commit 29701dc.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-10-15T00:39:33Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34375/

SparkQA · 2020-10-15T01:04:24Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34375/

SparkQA · 2020-10-15T04:30:28Z

Test build #129769 has finished for PR 30018 at commit 4ce0644.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

SparkQA · 2020-10-16T12:49:03Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34500/

SparkQA · 2020-10-16T13:12:41Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34500/

SparkQA · 2020-10-16T13:16:49Z

Test build #129895 has finished for PR 30018 at commit ef64abf.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait PredicateHelper extends Logging with AliasHelper
trait AliasHelper

tanelk · 2020-10-16T13:41:27Z

Test build #129895 has finished for PR 30018 at commit ef64abf.

This patch fails Spark unit tests.

This patch merges cleanly.

This patch adds the following public classes (experimental):

trait PredicateHelper extends Logging with AliasHelper

trait AliasHelper

The org.apache.spark.sql.hive.thriftserver.ThriftServerQueryTestSuite.subquery/scalar-subquery/scalar-subquery-select.sql seems to failing on other PRs also. For example https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129890/

SparkQA · 2021-01-26T18:18:57Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39100/

SparkQA · 2021-01-26T18:51:37Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39100/

maropu · 2021-03-04T12:59:06Z

retest this please

SparkQA · 2021-03-04T13:55:31Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40340/

SparkQA · 2021-03-04T14:05:43Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40340/

SparkQA · 2021-03-04T17:40:33Z

Test build #135757 has finished for PR 30018 at commit 37dc4b1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu

LGTM again (I've checked the latest changes)

tanelk · 2021-03-17T07:33:59Z

@maropu , this has been aproved for a while now, any change, that we can merge this?

maropu · 2021-03-17T07:53:40Z

Anyone could check this? @cloud-fan @viirya @dongjoon-hyun @HyukjinKwon If no one has more comments, I'll merge this into master in a few days.

maropu · 2021-03-17T12:37:00Z

retest this please

SparkQA · 2021-03-17T13:37:27Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40740/

SparkQA · 2021-03-17T13:45:57Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40740/

maropu · 2021-03-19T01:43:17Z

hm, Jenkins still looks unstable. Could you add an empty commit to invoke GA, @tanelk?

tanelk · 2021-03-19T10:09:46Z

@maropu , the checks did pass

maropu · 2021-03-20T02:16:08Z

okay, we have much time until the next release, so I'll merge this for now. If there are more comments, please feel free to leave them.

maropu · 2021-03-20T02:17:33Z

Thanks! Merged to master. cc: @cloud-fan @viirya @dongjoon-hyun @HyukjinKwon

dongjoon-hyun · 2021-03-20T21:10:10Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+      }
+  }
+
+  private def lowerIsRedundant(upper: Aggregate, lower: Aggregate): Boolean = {


nit. Usually, isXXX is better and consistent with Apache Spark convention.

dongjoon-hyun · 2021-03-20T21:13:24Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+          replaceAliasButKeepName(_, aliasMap))
+      )
+
+      // We might have introduces non-deterministic grouping expression


introduces -> introduced

expression -> expressions

dongjoon-hyun · 2021-03-20T21:19:11Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+ * Remove redundant aggregates from a query plan. A redundant aggregate is an aggregate whose
+ * only goal is to keep distinct values, while its parent aggregate would ignore duplicate values.
+ */
+object RemoveRedundantAggregates extends Rule[LogicalPlan] with AliasHelper {


Could you move this optimizer into a new file please, @tanelk ?

dongjoon-hyun · 2021-03-20T21:37:32Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+      lower
+        .aggregateExpressions
+        .filter(_.deterministic)
+        .filter(!isAggregate(_))


- .filter(!isAggregate(_)) + .filterNot(isAggregate)

dongjoon-hyun

Looks reasonable. I left a few minor comments. Thank you, @tanelk , @maropu , @peter-toth .

maropu · 2021-03-21T03:01:12Z

Thanks for the reviews, @dongjoon-hyun ! Please open a new follow-up PR to address them, @tanelk .

cloud-fan · 2021-03-22T07:39:05Z

The affected part of the query plan for TPCDS q87:

Why is the golden file of TPCDS q87 not updated in this PR?

tanelk · 2021-03-22T07:52:24Z

The affected part of the query plan for TPCDS q87:

Why is the golden file of TPCDS q87 not updated in this PR?

The LeftSemi/LeftAnti pushdown rule was changed while this PR was in review and the situation where this rule applied did not occure any more.

…er rule to apply to more cases ### What changes were proposed in this pull request? Addressed the dongjoon-hyun comments on the previous PR #30018. Extended the `RemoveRedundantAggregates` rule to remove redundant aggregations in even more queries. For example in ``` dataset .dropDuplicates() .groupBy('a) .agg(max('b)) ``` the `dropDuplicates` is not needed, because the result on `max` does not depend on duplicate values. ### Why are the changes needed? Improve performance. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Closes #31914 from tanelk/SPARK-33122_redundant_aggs_followup. Lead-authored-by: [email protected] <[email protected]> Co-authored-by: Tanel Kiis <[email protected]> Signed-off-by: Takeshi Yamamuro <[email protected]>

tanelk added 2 commits October 12, 2020 23:19

RemoveRedundantAggregates

84ba723

Merge branch 'master' into SPARK-33122

14f3033

tanelk commented Oct 12, 2020

View reviewed changes

Clearer naming

29701dc

tanelk commented Oct 14, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala Outdated Show resolved Hide resolved

Clearer naming

4ce0644

tanelk commented Oct 14, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala Outdated Show resolved Hide resolved

maropu reviewed Oct 15, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala Outdated Show resolved Hide resolved

maropu reviewed Oct 15, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala Show resolved Hide resolved

maropu reviewed Oct 15, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala Show resolved Hide resolved

dongjoon-hyun added the SQL label Oct 15, 2020

Handle aliases

ef64abf

tanelk added 2 commits October 16, 2020 16:44

Merge branch 'master' into SPARK-33122

ca974c7

UTs for non-deterministic cases

4bf08bb

maropu approved these changes Mar 8, 2021

View reviewed changes

Trigger build

07e758d

maropu closed this in 620cae0 Mar 20, 2021

dongjoon-hyun reviewed Mar 20, 2021

View reviewed changes

tanelk mentioned this pull request Mar 21, 2021

[SPARK-33122][SQL][FOLLOWUP] Extend RemoveRedundantAggregates optimizer rule to apply to more cases #31914

Closed

dongjoon-hyun mentioned this pull request Aug 6, 2025

[SPARK-53155][SQL] Global lower agggregation should not be replaced with a project #51884

Closed

[SPARK-33122][SQL] Remove redundant aggregates in the Optimzier #30018

[SPARK-33122][SQL] Remove redundant aggregates in the Optimzier #30018

Uh oh!

Conversation

tanelk commented Oct 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

tanelk Oct 12, 2020

Choose a reason for hiding this comment

Uh oh!

tanelk commented Oct 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Oct 12, 2020

Uh oh!

SparkQA commented Oct 12, 2020

Uh oh!

maropu commented Oct 12, 2020

Uh oh!

SparkQA commented Oct 13, 2020

Uh oh!

tanelk commented Oct 14, 2020

Uh oh!

Uh oh!

Uh oh!

SparkQA commented Oct 14, 2020

Uh oh!

SparkQA commented Oct 14, 2020

Uh oh!

SparkQA commented Oct 14, 2020

Uh oh!

SparkQA commented Oct 15, 2020

Uh oh!

SparkQA commented Oct 15, 2020

Uh oh!

SparkQA commented Oct 15, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SparkQA commented Oct 16, 2020

Uh oh!

SparkQA commented Oct 16, 2020

Uh oh!

SparkQA commented Oct 16, 2020

Uh oh!

tanelk commented Oct 16, 2020

Uh oh!

SparkQA commented Jan 26, 2021

Uh oh!

SparkQA commented Jan 26, 2021

Uh oh!

maropu commented Mar 4, 2021

Uh oh!

SparkQA commented Mar 4, 2021

Uh oh!

SparkQA commented Mar 4, 2021

Uh oh!

SparkQA commented Mar 4, 2021

Uh oh!

maropu left a comment

Choose a reason for hiding this comment

Uh oh!

tanelk commented Mar 17, 2021

Uh oh!

maropu commented Mar 17, 2021

Uh oh!

maropu commented Mar 17, 2021

Uh oh!

SparkQA commented Mar 17, 2021

Uh oh!

SparkQA commented Mar 17, 2021

Uh oh!

maropu commented Mar 19, 2021

Uh oh!

tanelk commented Mar 19, 2021

tanelk commented Oct 12, 2020 •

edited

Loading

tanelk commented Oct 12, 2020 •

edited

Loading

dongjoon-hyun Mar 20, 2021 •

edited

Loading

dongjoon-hyun left a comment •

edited

Loading