-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-40599][SQL] Add multiTransform methods to TreeNode to generate alternatives #38034
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I've opened 3 WIP PRs to demonstrate the usage of
Please note that all 3 are based on this PR so currently they contain the changes of this PR too. |
|
@cloud-fan, @gengliangwang, @sigmod could you please take a look at this PR? |
6bd10b6 to
cc73d00
Compare
|
@cloud-fan, @sigmod I still think |
|
|
Hi @sigmod,
I gave 3 concrete examples where
That's correct but that's why
No, the main pont in the above examples is that we need a set of results. But, this new method also gives the oportunitly to limit the number of alternatives returned by time. E.g. we could spend a certain max number of seconds generating and evaulating alternative trees to find the best one... |
|
@sigmod let me know if you need more examples or I should ellaboreate on the aboves. |
|
I'm a bit hesitant to touch the core
Are you sure the |
I think by early pruning you meant that some of the mapped expressions should comply some constraint. E.g. the mapped expressions should be unique in terms of their canonicalized form.
but that PR is WIP. It's main purpose would be to show
I'm not sure why don't we do that change in a simple |
|
@cloud-fan I've updated my previous comment with the right example for 2. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good example, and explains why we don't have multiTransformUp. Can we highlight this and remove multiTransform as callers should be aware that it's tramsform down?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is possible to add a multiTransformUp() version, although this PR doesn't contain it and I'm not sure when it would make sense to use it instead of the current top-down version.
I've removed the general multiTransform and kept the expilicit multiTransformDown versions.
I've also added a few more details to the examples. Let me know it we need more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not very clear to me that why we need this bool flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This flag is an internal flag and multiTransform and multiTransformWithPruning doesn't even return it. Probably multiTransformDownWithPruning should hide it as well using a private helper.
It is used internally after a non-leaf node is transformed to some alternatives but those alternatives doesn't contain the original node itself. E.g. please see a bit above that Add(a, b) -> Seq(11, 12, 21, 22), but the Seq doesn't contain Add(a, b) itself.
Now in this case we still need to consider the original Add(a, b) and traverse down and consider alternatives that might transform a or b.
This is done by adding Add(a, b) with a childrenTransformRequired=true flag to the afterRulesStream stream.
The returned flag you are asking about (we can call it childrenTransformed), is kind of the pair ot the previous flag.
Once the transformation of children are done that flag is "propagated up" and based on both flags we can decide if we need to add the Add(a, b) (with its children transformed) to the valid alternatives (!childrenTransformRequired || childrenTransformed). We do this in this collect:
https://github.com/apache/spark/blob/cc73d00f3e608076feebc65737e53e6454845bd4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala#L762-L768
Please note that there is another flag transformed in the afterRulesStream. If is pretty similar to negated childrenTransformRequired but they are not always negate each other to handle nodes where the rule didn't apply.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed, I've removed the autoContinue feature so these flags have been removed from the latest version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan, I realized that this semantics, that the rule returns Seq.emty for a node, could be used for early pruning.
Currently it means no transformation as if the rule didn't apply or returned a one element Seq with the original node.
But returned Seq.empty could mean early pruning to to acomplish something like pruneFunc does it @ulysses-you's latest version: #39556
Let me add that functionality here next week.
cc73d00 to
b99f2f9
Compare
… explicit flag to enable the `autoContinue` feature, add more examples, remove general versions to highlight this is a top-down rule
690e0c5 to
8de8f88
Compare
| /** | ||
| * Returns alternative copies of this node where `rule` has been recursively applied to the tree. | ||
| * | ||
| * Users should not expect a specific directionality. If a specific directionality is needed, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the comment needs an update. We should also explain why only the down direction is provided.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've modified the scala docs, let me know if we need more defails or fixes.
| * Returns alternative copies of this node where `rule` has been recursively applied to the tree. | ||
| * | ||
| * Users should not expect a specific directionality. If a specific directionality is needed, | ||
| * multiTransformDownWithPruning or multiTransformUpWithPruning should be used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| * Users should not expect a specific directionality. If a specific directionality is needed, | ||
| * multiTransformDownWithPruning or multiTransformUpWithPruning should be used. | ||
| * | ||
| * @param rule a function used to generate transformed alternatives for a node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| * @param rule a function used to generate transformed alternatives for a node | |
| * @param rule a function used to generate alternatives for a node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| * of its children (pre-order). | ||
| * | ||
| * As it is very easy to generate enormous number of alternatives when the input tree is huge or | ||
| * when the rule returns large number of alternatives, this function returns the alternatives as a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| * when the rule returns large number of alternatives, this function returns the alternatives as a | |
| * when the rule returns many alternatives for many nodes, this function returns the alternatives as a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| * valid alternative. | ||
| * | ||
| * The rule can return `Stream.empty` to indicate that the original node should be pruned. In this | ||
| * case `multiTransform` returns an empty `Stream`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's also mention, if the rule applies but you want the not-apply behavior, you can just return Seq(originalNode)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| * | ||
| * 3. | ||
| * It is not always easy to determine if we will do any child expression mapping but we can enable | ||
| * the `autoContinue` flag to get the same result: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit worried about making multiTransform too complicated. Given the only use case for now is projecting expressions such as output partitions/orderings, can we simplify the rule a little bit? My preference is to remove this autoContinue flag and fully rely on the callers.
It is not always easy to determine if we will do any child expression mapping
If we have a real SQL use case, I'm happy to change my mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, indeed. And I don't have any other use case either, so I will remove that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, removed.
|
I think this is more efficient than keeping a candidate plans list and transforming all the plans in the list and repeating, as I mentioned in #39556 (comment) . I'll take a closer look after the |
| }.map { rewritten_plan => | ||
| if (this eq rewritten_plan) { | ||
| markRuleAsIneffective(ruleId) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if my understanding is correct, the only way to mark this rule as ineffective is that it returns stream with one alternative which is eq to this ? then why we use map here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, thanks, this looks wrong. I will fix it with other requests today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Latest commit fixes this, although there is relevant issue that I'm not sure we can mark the rule ineffective if the one element original stream is returned:
https://github.com/apache/spark/pull/38034/files#diff-94575875fbf007fdaf43e4946c69c18649294fed974a46816ab1986f6350541bR691-R699
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's OK. We can mark the rule as ineffective only if the partial function does not apply. Once it applies, no matter what it returns, it's effective.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems fine
…ark a rule ineffective, move generateChildrenSeq as a top level helper as it will help with early pruning
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala
Outdated
Show resolved
Hide resolved
…s/TreeNode.scala Co-authored-by: Wenchen Fan <[email protected]>
| } | ||
|
|
||
| private def generateChildrenSeq[T](childrenStreams: Seq[Stream[T]]): Stream[Seq[T]] = { | ||
| childrenStreams.foldRight(Stream(Seq.empty[T]))((childrenStream, childrenSeqStream) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we fold from right to left?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to generate alternatives for the first children of an expression first.
E.g. if a + b is the input and a => Stream(a1, a2) and b => Stream(b1, b2) is the rule then I wanted to get the Stream(a1 + b1, a2 + b1, a1 + b2, a2 + b2) output in this order.
|
thanks, merging to master! |
|
Thanks for the review @cloud-fan, @ulysses-you! |
…es to be any kinds of Seq ### What changes were proposed in this pull request? This is a follow-up PR to #38034. It relaxes `multiTransformDown()`'s `rule` parameter type to accept any kinds of `Seq` and make `MultiTransform.generateCartesianProduct()` helper public. ### Why are the changes needed? API mprovement. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing UTs. Closes #39652 from peter-toth/SPARK-40599-multitransform-follow-up. Authored-by: Peter Toth <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
This PR introduce
TreeNode.multiTransform()methods to be able to recursively transform aTreeNode(and so a tree) into multiple alternatives. These functions are particularly useful if we want to transform an expression with a projection in which subexpressions can be aliased with multiple different attributes.E.g. if we have a partitioning expression
HashPartitioning(a + b)and we have aProjectnode that aliasesaasa1anda2andbasb1andb2we can easily generate a stream of alternative transformations of the original partitioning:The result of
multiTransformis a lazy stream to be able to limit the number of alternatives generated at the caller side as needed.Why are the changes needed?
TreeNode.multiTransform()is a useful helper method.Does this PR introduce any user-facing change?
No.
How was this patch tested?
New UTs are added.