-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-42049][SQL] Improve AliasAwareOutputExpression #39556
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need anything from the QueryPlan? Can it simply be trait AliasAwareOutputExpression extends SQLConfHelper?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need override outputOrdering and outputPartitioning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AliasAwareOutputExpression itself can be simplified
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| val aliasArray = attrWithAliasMap.getOrElseUpdate(strip(key).canonicalized, | |
| new ArrayBuffer[Attribute]()) | |
| val aliasArray = attrWithAliasMap.getOrElseUpdate( | |
| strip(key).canonicalized, new ArrayBuffer[Attribute]()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we return a empty map immediately if aliasCandidateLimit < 1? I think it's better than checking it at https://github.com/apache/spark/pull/39556/files#diff-2d06454bd3d4226cab8749376af5298599e0d5a1de175d9ba462608390d7d593R64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should only append the new expr if all its references are contained by outputExpressions.map(_.toAttribute)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
newExpr can contain other reference. for example df.orderby($"a" + $"b").selectExpr("a as x"), we only replace a to x but the expression Add has an another attribute b.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in this case, we should not report the ordering as x + b, as b is not even outputted by the plan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's a little hard do this in AliasAwareOutputExpression. for example PartitioningCollection(a, b) and a alias to x. If we want to return PartitioningCollection(a) only, then we need to prune b. It should be handled at AliasAwareOutputPartitioning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't return PartitioningCollection(a) only. If a relation's child is partitioned by a and b, but b is not outputted by this relation (no alias either), then it's wrong to say this relation is partitioned by a. It can only be UnknownPartitioning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PartitioningCollection(a, b) means t1 join t2 on a = b, not group by a, b..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, then it's a flatMap semantic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this means we don't do any alias replacement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no it just specifies it's outputExpressions to AliasAwareOutputExpression so that AliasAwareOutputExpression can build alias map
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but child.output has no alias at all, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Project has overridden it..
override protected def outputExpressions: Seq[NamedExpression] = projectList
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which subclass uses the default implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filter/Limit etc..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| buildConf("spark.sql.optimizer.outputPartitioningAndOrderingCandidateLimit") | |
| buildConf("spark.sql.optimizer.expressionProjectionCandidateLimit") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may apply it to more places like constraint, let's be general.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is correct. It only replace one alias from the input expressions. What happens if the output ordering is a + b and the alias is a as x, b as y? will we return x + y?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the algorithm should be
candidates = e
for ((expr, aliases) <- aliasMap) {
val newCandicates = candidates.transform ...
candidates ++= newCandicates
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can also add some early pruning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point. I'd say no for now. Idealy, we should return a + y, x + b, x + y. The current code can not return x + y. But it seems a corner case.
It assumes the input is: a, a as x, a as y which is more likely happen..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ulysses-you, @cloud-fan please take a look at my #37525 that is based on a new helper TreeNode.multiTransform() that I would like to add in #38034.
IMO TreeNode.multiTransform() would be a useful helper function to solve issues like this one and some others: #38034 (comment)
|
One principle we should hold: a plan's output partitioning/ordering must only contain the attributes from its output, otherwise the semantic is hard to define. What do you mean your data is partitioned by a non-existing column? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think with this loop on aliasMap elements one by one and always adding new elements to normalizedCandidates and then do some filtering after the aliasMap loop you might do the same issue as described 3rd in #38034 (comment) (constraint generation)
9d437fb to
6773896
Compare
|
@cloud-fan addreesed all comments, thank you @peter-toth |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this test is for the comment #39556 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pass a prune function to handle PartitionCollection and sameOrderExpression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to also handle such as RangePartitioning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| .doc("The maximum number of the candidate of out put expressions whose alias are replaced." + | |
| " It can preserve the output partitioning and ordering." + | |
| " Negative value means disable this optimization.") | |
| .doc("The maximum number of candidates for output expressions whose aliases are replaced." + | |
| " This can preserve the output partitioning and ordering." + | |
| " A negative value means to disable this optimization.") |
EnricoMi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fixes #38356 / SPARK-40885.
The example now writes this plan, as expected:
WriteFiles
+- *(1) Project [id#10, sort_col#11, empty2null(p#12) AS p#19]
+- *(1) Sort [p#12 ASC NULLS FIRST, sort_col#11 ASC NULLS FIRST], false, 0
+- ShuffleQueryStage 0
+- Exchange SinglePartition, REPARTITION_BY_NUM, [plan_id=18]
+- LocalTableScan [id#10, sort_col#11, p#12]
8f241c3 to
94f2588
Compare
What changes were proposed in this pull request?
This pr moves
AliasAwareOutputExpressionfrom core to catalyst so both logical plan and physical plan can use it.Improve the code of replace alias to support multi-alias so we can preverse ordering with all of aliased, for example:
Improve the
AliasAwareQueryOutputOrderingto support strip expression which does not affect result. For exampleEmpty2Null.Why are the changes needed?
AliasAwareOutputExpressionnow does not support if an attribute has more than one alias, andAliasAwareOutputExpressionshould also work for LogicalPlan.Does this PR introduce any user-facing change?
improve performance and this also fix the issue in pr #39475
How was this patch tested?
add test