-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-46378][SQL] Still remove Sort after converting Aggregate to Project #44310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| case Limit(le @ IntegerLiteral(1), a: Aggregate) if a.groupOnly => | ||
| Limit(le, Project(a.aggregateExpressions, LocalLimit(le, a.child))) | ||
| val project = Project(a.aggregateExpressions, LocalLimit(le, a.child)) | ||
| project.setTagValue(Project.dataOrderIrrelevantTag, ()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the EliminateSorts, it's data order relevant if only group expressions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's irrelevant, see
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
Lines 1629 to 1647 in 8d64cb4
| private def isOrderIrrelevantAggs(aggs: Seq[NamedExpression]): Boolean = { | |
| def isOrderIrrelevantAggFunction(func: AggregateFunction): Boolean = func match { | |
| case _: Min | _: Max | _: Count | _: BitAggregate => true | |
| // Arithmetic operations for floating-point values are order-sensitive | |
| // (they are not associative). | |
| case _: Sum | _: Average | _: CentralMomentAgg => | |
| !Seq(FloatType, DoubleType) | |
| .exists(e => DataTypeUtils.sameType(e, func.children.head.dataType)) | |
| case _ => false | |
| } | |
| def checkValidAggregateExpression(expr: Expression): Boolean = expr match { | |
| case _: AttributeReference => true | |
| case ae: AggregateExpression => isOrderIrrelevantAggFunction(ae.aggregateFunction) | |
| case _: UserDefinedExpression => false | |
| case e => e.children.forall(checkValidAggregateExpression) | |
| } | |
| aggs.forall(checkValidAggregateExpression) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missing the checkValidAggregateExpression.
beliefer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
|
thanks for the review, merging to master! |
viirya
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM.
### What changes were proposed in this pull request? This is a followup of #44310 . It turns out that `TreeNodeTag` in `Project` is way too fragile. `Project` is a very basic node and very easy to get removed/transformed during plan optimization. This PR switches to a different approach: since we can't retain the information (input data order doesn't matter) from `Aggregate`, let's leverage this information immediately. We pull out the expensive part of `EliminateSorts` to a new rule, so that we can safely call `EliminateSorts` right before we turn `Aggregate` into `Project`. ### Why are the changes needed? to make the optimizer more robust. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #44429 from cloud-fan/sort. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…oject ### What changes were proposed in this pull request? This is a follow-up of apache#33397 to avoid sub-optimal plans. After converting `Aggregate` to `Project`, there is information lost: `Aggregate` doesn't care about the data order of inputs, but `Project` cares. `EliminateSorts` can remove `Sort` below `Aggregate`, but it doesn't work anymore if we convert `Aggregate` to `Project`. This PR fixes this issue by tagging the `Project` to be order-irrelevant if it's converted from `Aggregate`. Then `EliminateSorts` optimizes the tagged `Project`. ### Why are the changes needed? avoid sub-optimal plans ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? new test ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#44310 from cloud-fan/sort. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
This is a follow-up of #33397 to avoid sub-optimal plans. After converting
AggregatetoProject, there is information lost:Aggregatedoesn't care about the data order of inputs, butProjectcares.EliminateSortscan removeSortbelowAggregate, but it doesn't work anymore if we convertAggregatetoProject.This PR fixes this issue by tagging the
Projectto be order-irrelevant if it's converted fromAggregate. ThenEliminateSortsoptimizes the taggedProject.Why are the changes needed?
avoid sub-optimal plans
Does this PR introduce any user-facing change?
No
How was this patch tested?
new test
Was this patch authored or co-authored using generative AI tooling?
No