[SPARK-42049][SQL] Improve AliasAwareOutputExpression #39556

ulysses-you · 2023-01-13T14:40:35Z

What changes were proposed in this pull request?

This pr moves AliasAwareOutputExpression from core to catalyst so both logical plan and physical plan can use it.

Improve the code of replace alias to support multi-alias so we can preverse ordering with all of aliased, for example:

SELECT c, c as x, c as y FROM (SELECT * FROM t ORDER BY c)

Improve the AliasAwareQueryOutputOrdering to support strip expression which does not affect result. For example Empty2Null.

Why are the changes needed?

AliasAwareOutputExpression now does not support if an attribute has more than one alias, and AliasAwareOutputExpression should also work for LogicalPlan.

Does this PR introduce any user-facing change?

improve performance and this also fix the issue in pr #39475

How was this patch tested?

add test

cloud-fan · 2023-01-13T14:43:04Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/AliasAwareOutputExpression.scala

do we need anything from the QueryPlan? Can it simply be trait AliasAwareOutputExpression extends SQLConfHelper?

we need override outputOrdering and outputPartitioning

AliasAwareOutputExpression itself can be simplified

cloud-fan · 2023-01-13T14:44:59Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/AliasAwareOutputExpression.scala

Suggested change

val aliasArray = attrWithAliasMap.getOrElseUpdate(strip(key).canonicalized,

new ArrayBuffer[Attribute]())

val aliasArray = attrWithAliasMap.getOrElseUpdate(

strip(key).canonicalized, new ArrayBuffer[Attribute]())

cloud-fan · 2023-01-13T15:01:10Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/AliasAwareOutputExpression.scala

shall we return a empty map immediately if aliasCandidateLimit < 1? I think it's better than checking it at https://github.com/apache/spark/pull/39556/files#diff-2d06454bd3d4226cab8749376af5298599e0d5a1de175d9ba462608390d7d593R64

cloud-fan · 2023-01-13T15:04:41Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/AliasAwareOutputExpression.scala

We should only append the new expr if all its references are contained by outputExpressions.map(_.toAttribute)

newExpr can contain other reference. for example df.orderby($"a" + $"b").selectExpr("a as x"), we only replace a to x but the expression Add has an another attribute b.

in this case, we should not report the ordering as x + b, as b is not even outputted by the plan.

it's a little hard do this in AliasAwareOutputExpression. for example PartitioningCollection(a, b) and a alias to x. If we want to return PartitioningCollection(a) only, then we need to prune b. It should be handled at AliasAwareOutputPartitioning?

We can't return PartitioningCollection(a) only. If a relation's child is partitioned by a and b, but b is not outputted by this relation (no alias either), then it's wrong to say this relation is partitioned by a. It can only be UnknownPartitioning.

PartitioningCollection(a, b) means t1 join t2 on a = b, not group by a, b..

Ah, then it's a flatMap semantic

cloud-fan · 2023-01-13T15:07:17Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala

this means we don't do any alias replacement?

no it just specifies it's outputExpressions to AliasAwareOutputExpression so that AliasAwareOutputExpression can build alias map

but child.output has no alias at all, right?

Project has overridden it..

override protected def outputExpressions: Seq[NamedExpression] = projectList

which subclass uses the default implementation?

Filter/Limit etc..

cloud-fan · 2023-01-13T15:08:06Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

Suggested change

buildConf("spark.sql.optimizer.outputPartitioningAndOrderingCandidateLimit")

buildConf("spark.sql.optimizer.expressionProjectionCandidateLimit")

We may apply it to more places like constraint, let's be general.

cloud-fan · 2023-01-13T15:11:55Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/AliasAwareOutputExpression.scala

I'm not sure this is correct. It only replace one alias from the input expressions. What happens if the output ordering is a + b and the alias is a as x, b as y? will we return x + y?

I think the algorithm should be

candidates = e for ((expr, aliases) <- aliasMap) { val newCandicates = candidates.transform ... candidates ++= newCandicates }

We can also add some early pruning

good point. I'd say no for now. Idealy, we should return a + y, x + b, x + y. The current code can not return x + y. But it seems a corner case.

It assumes the input is: a, a as x, a as y which is more likely happen..

@ulysses-you, @cloud-fan please take a look at my #37525 that is based on a new helper TreeNode.multiTransform() that I would like to add in #38034.
IMO TreeNode.multiTransform() would be a useful helper function to solve issues like this one and some others: #38034 (comment)

cloud-fan · 2023-01-13T15:56:06Z

One principle we should hold: a plan's output partitioning/ordering must only contain the attributes from its output, otherwise the semantic is hard to define. What do you mean your data is partitioned by a non-existing column?

peter-toth · 2023-01-13T16:57:25Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/AliasAwareOutputExpression.scala

I think with this loop on aliasMap elements one by one and always adding new elements to normalizedCandidates and then do some filtering after the aliasMap loop you might do the same issue as described 3rd in #38034 (comment) (constraint generation)

ulysses-you · 2023-01-14T04:30:13Z

@cloud-fan addreesed all comments, thank you @peter-toth

ulysses-you · 2023-01-14T04:31:35Z

sql/core/src/test/scala/org/apache/spark/sql/execution/PlannerSuite.scala

this test is for the comment #39556 (comment)

ulysses-you · 2023-01-14T04:32:20Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/AliasAwareOutputExpression.scala

pass a prune function to handle PartitionCollection and sameOrderExpression.

ulysses-you · 2023-01-14T04:36:09Z

sql/core/src/main/scala/org/apache/spark/sql/execution/AliasAwareOutputExpression.scala

to also handle such as RangePartitioning

EnricoMi · 2023-01-16T07:15:24Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

Suggested change

.doc("The maximum number of the candidate of out put expressions whose alias are replaced." +

" It can preserve the output partitioning and ordering." +

" Negative value means disable this optimization.")

.doc("The maximum number of candidates for output expressions whose aliases are replaced." +

" This can preserve the output partitioning and ordering." +

" A negative value means to disable this optimization.")

EnricoMi

This fixes #38356 / SPARK-40885.

The example now writes this plan, as expected:

WriteFiles
+- *(1) Project [id#10, sort_col#11, empty2null(p#12) AS p#19]
   +- *(1) Sort [p#12 ASC NULLS FIRST, sort_col#11 ASC NULLS FIRST], false, 0
      +- ShuffleQueryStage 0
         +- Exchange SinglePartition, REPARTITION_BY_NUM, [plan_id=18]
            +- LocalTableScan [id#10, sort_col#11, p#12]

github-actions bot added the SQL label Jan 13, 2023

ulysses-you mentioned this pull request Jan 13, 2023

[SPARK-41959][SQL] Improve v1 writes with empty2null #39475

Closed

cloud-fan reviewed Jan 13, 2023

View reviewed changes

cloud-fan mentioned this pull request Jan 13, 2023

[SPARK-40599][SQL] Add multiTransform methods to TreeNode to generate alternatives #38034

Closed

peter-toth reviewed Jan 13, 2023

View reviewed changes

ulysses-you force-pushed the SPARK-42049 branch from 9d437fb to 6773896 Compare January 14, 2023 04:27

ulysses-you commented Jan 14, 2023

View reviewed changes

ulysses-you changed the title ~~[WIP][SPARK-42049][SQL] Improve AliasAwareOutputExpression~~ [SPARK-42049][SQL] Improve AliasAwareOutputExpression Jan 14, 2023

ulysses-you commented Jan 14, 2023

View reviewed changes

peter-toth mentioned this pull request Jan 15, 2023

[SPARK-40086][SPARK-42049][SQL] Improve AliasAwareOutputPartitioning and AliasAwareQueryOutputOrdering to take all aliases into account #37525

Closed

EnricoMi reviewed Jan 16, 2023

View reviewed changes

EnricoMi approved these changes Jan 16, 2023

View reviewed changes

Improve AliasAwareOutputExpression

94f2588

ulysses-you force-pushed the SPARK-42049 branch from 8f241c3 to 94f2588 Compare January 18, 2023 02:30

ulysses-you closed this Feb 1, 2023

ulysses-you deleted the SPARK-42049 branch February 1, 2023 05:51

-      val aliasArray = attrWithAliasMap.getOrElseUpdate(strip(key).canonicalized,
-        new ArrayBuffer[Attribute]())
+      val aliasArray = attrWithAliasMap.getOrElseUpdate(
+        strip(key).canonicalized, new ArrayBuffer[Attribute]())

	buildConf("spark.sql.optimizer.outputPartitioningAndOrderingCandidateLimit")
	buildConf("spark.sql.optimizer.expressionProjectionCandidateLimit")

-      .doc("The maximum number of the candidate of out put expressions whose alias are replaced." +
-        " It can preserve the output partitioning and ordering." +
-        " Negative value means disable this optimization.")
+      .doc("The maximum number of candidates for output expressions whose aliases are replaced." +
+        " This can preserve the output partitioning and ordering." +
+        " A negative value means to disable this optimization.")

[SPARK-42049][SQL] Improve AliasAwareOutputExpression #39556

[SPARK-42049][SQL] Improve AliasAwareOutputExpression #39556

Uh oh!

Conversation

ulysses-you commented Jan 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

peter-toth Jan 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Jan 13, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ulysses-you commented Jan 14, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

EnricoMi left a comment

Choose a reason for hiding this comment

Uh oh!

ulysses-you commented Jan 13, 2023 •

edited

Loading

peter-toth Jan 13, 2023 •

edited

Loading