-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-33122][SQL] Remove redundant aggregates in the Optimzier #30018
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| * @param groupingExpressions expressions for grouping keys | ||
| * @param aggregateExpressions expressions for a project list, which could contain | ||
| * [[AggregateFunction]]s. | ||
| * [[AggregateExpression]]s. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This caused some confusion while making this PR
|
I'll try do get the actual performance change for the TPCDS queries soon. |
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Could you describe more about the "redundant" case in the PR description? e.g., plan changes before/after this PR |
|
Test build #129705 has finished for PR 30018 at commit
|
|
Added a changed query plan sample and some TPCDS results. The change is not remarkable, but for bigger datasets it can add up. |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
Outdated
Show resolved
Hide resolved
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #129753 has finished for PR 30018 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #129769 has finished for PR 30018 at commit
|
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
Show resolved
Hide resolved
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #129895 has finished for PR 30018 at commit
|
The |
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
retest this please |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #135757 has finished for PR 30018 at commit
|
maropu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM again (I've checked the latest changes)
|
@maropu , this has been aproved for a while now, any change, that we can merge this? |
|
Anyone could check this? @cloud-fan @viirya @dongjoon-hyun @HyukjinKwon If no one has more comments, I'll merge this into master in a few days. |
|
retest this please |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
hm, Jenkins still looks unstable. Could you add an empty commit to invoke GA, @tanelk? |
|
@maropu , the checks did pass |
|
okay, we have much time until the next release, so I'll merge this for now. If there are more comments, please feel free to leave them. |
|
Thanks! Merged to master. cc: @cloud-fan @viirya @dongjoon-hyun @HyukjinKwon |
| } | ||
| } | ||
|
|
||
| private def lowerIsRedundant(upper: Aggregate, lower: Aggregate): Boolean = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit. Usually, isXXX is better and consistent with Apache Spark convention.
| replaceAliasButKeepName(_, aliasMap)) | ||
| ) | ||
|
|
||
| // We might have introduces non-deterministic grouping expression |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
introduces->introducedexpression->expressions
| * Remove redundant aggregates from a query plan. A redundant aggregate is an aggregate whose | ||
| * only goal is to keep distinct values, while its parent aggregate would ignore duplicate values. | ||
| */ | ||
| object RemoveRedundantAggregates extends Rule[LogicalPlan] with AliasHelper { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you move this optimizer into a new file please, @tanelk ?
| lower | ||
| .aggregateExpressions | ||
| .filter(_.deterministic) | ||
| .filter(!isAggregate(_)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- .filter(!isAggregate(_))
+ .filterNot(isAggregate)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable. I left a few minor comments. Thank you, @tanelk , @maropu , @peter-toth .
|
Thanks for the reviews, @dongjoon-hyun ! Please open a new follow-up PR to address them, @tanelk . |
Why is the golden file of TPCDS q87 not updated in this PR? |
The |
…er rule to apply to more cases ### What changes were proposed in this pull request? Addressed the dongjoon-hyun comments on the previous PR #30018. Extended the `RemoveRedundantAggregates` rule to remove redundant aggregations in even more queries. For example in ``` dataset .dropDuplicates() .groupBy('a) .agg(max('b)) ``` the `dropDuplicates` is not needed, because the result on `max` does not depend on duplicate values. ### Why are the changes needed? Improve performance. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Closes #31914 from tanelk/SPARK-33122_redundant_aggs_followup. Lead-authored-by: [email protected] <[email protected]> Co-authored-by: Tanel Kiis <[email protected]> Signed-off-by: Takeshi Yamamuro <[email protected]>
What changes were proposed in this pull request?
Added optimizer rule
RemoveRedundantAggregates. It removes redundant aggregates from a query plan. A redundant aggregate is an aggregate whose only goal is to keep distinct values, while its parent aggregate would ignore duplicate values.The affected part of the query plan for TPCDS q87:
Before:
After:
Why are the changes needed?
Performance improvements - few TPCDS queries have these kinds of duplicate aggregates.
Does this PR introduce any user-facing change?
No
How was this patch tested?
UT
Benchmarks (sf=5):
OpenJDK 64-Bit Server VM 1.8.0_265-b01 on Linux 5.8.13-arch1-1
Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz