[SPARK-46378][SQL] Still remove Sort after converting Aggregate to Project #44310

cloud-fan · 2023-12-12T07:50:18Z

What changes were proposed in this pull request?

This is a follow-up of #33397 to avoid sub-optimal plans. After converting Aggregate to Project, there is information lost: Aggregate doesn't care about the data order of inputs, but Project cares. EliminateSorts can remove Sort below Aggregate, but it doesn't work anymore if we convert Aggregate to Project.

This PR fixes this issue by tagging the Project to be order-irrelevant if it's converted from Aggregate. Then EliminateSorts optimizes the tagged Project.

Why are the changes needed?

avoid sub-optimal plans

Does this PR introduce any user-facing change?

No

How was this patch tested?

new test

Was this patch authored or co-authored using generative AI tooling?

No

cloud-fan · 2023-12-12T07:50:49Z

cc @wangyum @ulysses-you @viirya

beliefer · 2023-12-12T08:10:47Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

    case Limit(le @ IntegerLiteral(1), a: Aggregate) if a.groupOnly =>
-      Limit(le, Project(a.aggregateExpressions, LocalLimit(le, a.child)))
+      val project = Project(a.aggregateExpressions, LocalLimit(le, a.child))
+      project.setTagValue(Project.dataOrderIrrelevantTag, ())


According to the EliminateSorts, it's data order relevant if only group expressions.

it's irrelevant, see

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

Lines 1629 to 1647 in 8d64cb4

private def isOrderIrrelevantAggs(aggs: Seq[NamedExpression]): Boolean = {

def isOrderIrrelevantAggFunction(func: AggregateFunction): Boolean = func match {

case _: Min | _: Max | _: Count | _: BitAggregate => true

// Arithmetic operations for floating-point values are order-sensitive

// (they are not associative).

case _: Sum | _: Average | _: CentralMomentAgg =>

!Seq(FloatType, DoubleType)

.exists(e => DataTypeUtils.sameType(e, func.children.head.dataType))

case _ => false

}

def checkValidAggregateExpression(expr: Expression): Boolean = expr match {

case _: AttributeReference => true

case ae: AggregateExpression => isOrderIrrelevantAggFunction(ae.aggregateFunction)

case _: UserDefinedExpression => false

case e => e.children.forall(checkValidAggregateExpression)

}

aggs.forall(checkValidAggregateExpression)

I missing the checkValidAggregateExpression.

beliefer

LGTM.

cloud-fan · 2023-12-12T18:04:18Z

thanks for the review, merging to master!

viirya

Looks good to me.

dongjoon-hyun

+1, LGTM.

### What changes were proposed in this pull request? This is a followup of #44310 . It turns out that `TreeNodeTag` in `Project` is way too fragile. `Project` is a very basic node and very easy to get removed/transformed during plan optimization. This PR switches to a different approach: since we can't retain the information (input data order doesn't matter) from `Aggregate`, let's leverage this information immediately. We pull out the expensive part of `EliminateSorts` to a new rule, so that we can safely call `EliminateSorts` right before we turn `Aggregate` into `Project`. ### Why are the changes needed? to make the optimizer more robust. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #44429 from cloud-fan/sort. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…oject ### What changes were proposed in this pull request? This is a follow-up of apache#33397 to avoid sub-optimal plans. After converting `Aggregate` to `Project`, there is information lost: `Aggregate` doesn't care about the data order of inputs, but `Project` cares. `EliminateSorts` can remove `Sort` below `Aggregate`, but it doesn't work anymore if we convert `Aggregate` to `Project`. This PR fixes this issue by tagging the `Project` to be order-irrelevant if it's converted from `Aggregate`. Then `EliminateSorts` optimizes the tagged `Project`. ### Why are the changes needed? avoid sub-optimal plans ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? new test ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#44310 from cloud-fan/sort. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

Still remove Sort after converting Aggregate to Project

af5f6cf

github-actions bot added the SQL label Dec 12, 2023

beliefer reviewed Dec 12, 2023

View reviewed changes

ulysses-you approved these changes Dec 12, 2023

View reviewed changes

beliefer approved these changes Dec 12, 2023

View reviewed changes

cloud-fan closed this in c1ba963 Dec 12, 2023

viirya reviewed Dec 12, 2023

View reviewed changes

dongjoon-hyun reviewed Dec 12, 2023

View reviewed changes

cloud-fan mentioned this pull request Dec 20, 2023

[SPARK-46378][SQL][FOLLOWUP] Do not rely on TreeNodeTag in Project #44429

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-46378][SQL] Still remove Sort after converting Aggregate to Project #44310

[SPARK-46378][SQL] Still remove Sort after converting Aggregate to Project #44310

Uh oh!

cloud-fan commented Dec 12, 2023

Uh oh!

cloud-fan commented Dec 12, 2023 •

edited

Loading

Uh oh!

beliefer Dec 12, 2023

Uh oh!

ulysses-you Dec 12, 2023

Uh oh!

beliefer Dec 12, 2023

Uh oh!

beliefer left a comment

Uh oh!

cloud-fan commented Dec 12, 2023

Uh oh!

viirya left a comment

Uh oh!

dongjoon-hyun left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

	private def isOrderIrrelevantAggs(aggs: Seq[NamedExpression]): Boolean = {
	def isOrderIrrelevantAggFunction(func: AggregateFunction): Boolean = func match {
	case _: Min \| _: Max \| _: Count \| _: BitAggregate => true
	// Arithmetic operations for floating-point values are order-sensitive
	// (they are not associative).
	case _: Sum \| _: Average \| _: CentralMomentAgg =>
	!Seq(FloatType, DoubleType)
	.exists(e => DataTypeUtils.sameType(e, func.children.head.dataType))
	case _ => false
	}

	def checkValidAggregateExpression(expr: Expression): Boolean = expr match {
	case _: AttributeReference => true
	case ae: AggregateExpression => isOrderIrrelevantAggFunction(ae.aggregateFunction)
	case _: UserDefinedExpression => false
	case e => e.children.forall(checkValidAggregateExpression)
	}

	aggs.forall(checkValidAggregateExpression)

[SPARK-46378][SQL] Still remove Sort after converting Aggregate to Project #44310

[SPARK-46378][SQL] Still remove Sort after converting Aggregate to Project #44310

Uh oh!

Conversation

cloud-fan commented Dec 12, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

cloud-fan commented Dec 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

beliefer Dec 12, 2023

Choose a reason for hiding this comment

Uh oh!

ulysses-you Dec 12, 2023

Choose a reason for hiding this comment

Uh oh!

beliefer Dec 12, 2023

Choose a reason for hiding this comment

Uh oh!

beliefer left a comment

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Dec 12, 2023

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

cloud-fan commented Dec 12, 2023 •

edited

Loading