Skip to content

Conversation

@cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

This is a follow-up of #33397 to avoid sub-optimal plans. After converting Aggregate to Project, there is information lost: Aggregate doesn't care about the data order of inputs, but Project cares. EliminateSorts can remove Sort below Aggregate, but it doesn't work anymore if we convert Aggregate to Project.

This PR fixes this issue by tagging the Project to be order-irrelevant if it's converted from Aggregate. Then EliminateSorts optimizes the tagged Project.

Why are the changes needed?

avoid sub-optimal plans

Does this PR introduce any user-facing change?

No

How was this patch tested?

new test

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the SQL label Dec 12, 2023
@cloud-fan
Copy link
Contributor Author

cloud-fan commented Dec 12, 2023

cc @wangyum @ulysses-you @viirya

case Limit(le @ IntegerLiteral(1), a: Aggregate) if a.groupOnly =>
Limit(le, Project(a.aggregateExpressions, LocalLimit(le, a.child)))
val project = Project(a.aggregateExpressions, LocalLimit(le, a.child))
project.setTagValue(Project.dataOrderIrrelevantTag, ())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the EliminateSorts, it's data order relevant if only group expressions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's irrelevant, see

private def isOrderIrrelevantAggs(aggs: Seq[NamedExpression]): Boolean = {
def isOrderIrrelevantAggFunction(func: AggregateFunction): Boolean = func match {
case _: Min | _: Max | _: Count | _: BitAggregate => true
// Arithmetic operations for floating-point values are order-sensitive
// (they are not associative).
case _: Sum | _: Average | _: CentralMomentAgg =>
!Seq(FloatType, DoubleType)
.exists(e => DataTypeUtils.sameType(e, func.children.head.dataType))
case _ => false
}
def checkValidAggregateExpression(expr: Expression): Boolean = expr match {
case _: AttributeReference => true
case ae: AggregateExpression => isOrderIrrelevantAggFunction(ae.aggregateFunction)
case _: UserDefinedExpression => false
case e => e.children.forall(checkValidAggregateExpression)
}
aggs.forall(checkValidAggregateExpression)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missing the checkValidAggregateExpression.

Copy link
Contributor

@beliefer beliefer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@cloud-fan
Copy link
Contributor Author

thanks for the review, merging to master!

@cloud-fan cloud-fan closed this in c1ba963 Dec 12, 2023
Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

cloud-fan added a commit that referenced this pull request Dec 21, 2023
### What changes were proposed in this pull request?

This is a followup of #44310 . It turns out that `TreeNodeTag` in `Project` is way too fragile. `Project` is a very basic node and very easy to get removed/transformed during plan optimization.

This PR switches to a different approach: since we can't retain the information (input data order doesn't matter) from `Aggregate`, let's leverage this information immediately. We pull out the expensive part of `EliminateSorts` to a new rule, so that we can safely call `EliminateSorts` right before we turn `Aggregate` into `Project`.

### Why are the changes needed?

to make the optimizer more robust.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

existing tests

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #44429 from cloud-fan/sort.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
szehon-ho pushed a commit to szehon-ho/spark that referenced this pull request Feb 7, 2024
…oject

### What changes were proposed in this pull request?

This is a follow-up of apache#33397 to avoid sub-optimal plans. After converting `Aggregate` to `Project`, there is information lost: `Aggregate` doesn't care about the data order of inputs, but `Project` cares. `EliminateSorts` can remove `Sort` below `Aggregate`, but it doesn't work anymore if we convert `Aggregate` to `Project`.

This PR fixes this issue by tagging the `Project` to be order-irrelevant if it's converted from `Aggregate`. Then `EliminateSorts` optimizes the tagged `Project`.

### Why are the changes needed?

avoid sub-optimal plans

### Does this PR introduce _any_ user-facing change?

No
### How was this patch tested?

new test

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#44310 from cloud-fan/sort.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants