-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-18609][SPARK-18841][SQL] Fix redundant Alias removal in the optimizer #16757
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #72202 has finished for PR 16757 at commit
|
|
Test build #72203 has finished for PR 16757 at commit
|
|
Test build #72248 has finished for PR 16757 at commit
|
Where do you implement "traversing the parent tree"? |
|
I do not explicitly implement 'traversing the parent tree'. I have opened up a few methods in |
|
The new |
| /** | ||
| * Returns true if the project list is semantically same as child output, after strip alias on | ||
| * attribute. | ||
| * Replace the attributes in an expression using the given mapping. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like this doc is wrong?
| case (a: Attribute, o) if a semanticEquals o => true | ||
| case _ => false | ||
| } | ||
| private def createAttributeMapping(current: LogicalPlan, next: LogicalPlan) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you explain what current and next means here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current is plan before we remove redundant aliases, and next is the plan after we have remove the redundant aliases. I'll update the doc.
| /** | ||
| * Get an appropriate alias cleaning method for the given node. | ||
| * | ||
| * We currently clean Project, Aggregate & Window nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this is an improvement right? previously we only clean Project. However I think this method is over engineered, we can just create a def needClean(plan: LogicalPlan): Boolean
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that is an improvement. I added all LogicalPlan nodes that are producing new attributes using named expressions.
I will inline this method.
| case _ => | ||
| // Drop blacklisted attributes that are masked in the current project. This allows us to | ||
| // remove redundant aliases in the subtree. | ||
| val childBlacklist = blacklist -- (plan.inputSet -- plan.outputSet) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this branch needed because Union reuse the output of left side? can we remove it if we fix Union?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean the case _ => right? That is needed for everything which is not a Join. We are doing manual tree traversal here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry I mean childBlacklist. We can just use blacklist if Union is fixed right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The child blacklist is an optimization. I can remove an attribute from the child's blacklist if I know that it is being created in the current node. This way I give the rule more freedom in removing attributes. The thing is that situation should only happen when there are multiple self joins, and this might be an over optimization.
|
Test build #72323 has finished for PR 16757 at commit
|
|
retest this please |
|
Test build #72372 has finished for PR 16757 at commit
|
|
retest this please. |
|
Test build #72374 has finished for PR 16757 at commit
|
| // Create a an expression cleaning function for nodes that can actually produce redundant | ||
| // aliases, use identity otherwise. | ||
| val clean: Expression => Expression = plan match { | ||
| case _: Project => removeRedundantAlias(_, blacklist) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if we clean expressions for all nodes? Or like rule CleanupAliases that we only skip for ObjectConsumer, ObjectProducer and AppendColumns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually can we merge this rule into CleanupAliases?
# Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
| val query = relation.select('b as 'b, 'a as 'a).analyze | ||
| val optimized = Optimize.execute(query) | ||
| val query = relation.select('b, 'a).analyze | ||
| val optimized = Optimize.execute(relation.select('b as 'b, 'a as 'a).analyze) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
according to most of the optimizer unit tests, the preferred code style should be
val query = relation.select('b as 'b, 'a as 'a).analyze
val optimized = Optimize.execute(query)
val expected = relation.select('b, 'a).analyze
comparePlans(optimized, expected)
|
LGTM except one minor comment |
|
Test build #72520 has finished for PR 16757 at commit
|
|
Retest this please |
|
Test build #72521 has finished for PR 16757 at commit
|
|
Merging this to master. |
|
Test build #72526 has finished for PR 16757 at commit
|
|
This was merged to master only right? |
|
It had a merge conflict, so I opened: #16843 |
…timizer ## What changes were proposed in this pull request? The optimizer tries to remove redundant alias only projections from the query plan using the `RemoveAliasOnlyProject` rule. The current rule identifies removes such a project and rewrites the project's attributes in the **entire** tree. This causes problems when parts of the tree are duplicated (for instance a self join on a temporary view/CTE) and the duplicated part contains the alias only project, in this case the rewrite will break the tree. This PR fixes these problems by using a blacklist for attributes that are not to be moved, and by making sure that attribute remapping is only done for the parent tree, and not for unrelated parts of the query plan. The current tree transformation infrastructure works very well if the transformation at hand requires little or a global contextual information. In this case we need to know both the attributes that were not to be moved, and we also needed to know which child attributes were modified. This cannot be done easily using the current infrastructure, and solutions typically involves transversing the query plan multiple times (which is super slow). I have moved around some code in `TreeNode`, `QueryPlan` and `LogicalPlan`to make this much more straightforward; this basically allows you to manually traverse the tree. This PR subsumes the following PRs by windpiger: Closes apache#16267 Closes apache#16255 ## How was this patch tested? I have added unit tests to `RemoveRedundantAliasAndProjectSuite` and I have added integration tests to the `SQLQueryTestSuite.union` and `SQLQueryTestSuite.cte` test cases. Author: Herman van Hovell <[email protected]> Closes apache#16757 from hvanhovell/SPARK-18609.
What changes were proposed in this pull request?
The optimizer tries to remove redundant alias only projections from the query plan using the
RemoveAliasOnlyProjectrule. The current rule identifies removes such a project and rewrites the project's attributes in the entire tree. This causes problems when parts of the tree are duplicated (for instance a self join on a temporary view/CTE) and the duplicated part contains the alias only project, in this case the rewrite will break the tree.This PR fixes these problems by using a blacklist for attributes that are not to be moved, and by making sure that attribute remapping is only done for the parent tree, and not for unrelated parts of the query plan.
The current tree transformation infrastructure works very well if the transformation at hand requires little or a global contextual information. In this case we need to know both the attributes that were not to be moved, and we also needed to know which child attributes were modified. This cannot be done easily using the current infrastructure, and solutions typically involves transversing the query plan multiple times (which is super slow). I have moved around some code in
TreeNode,QueryPlanandLogicalPlanto make this much more straightforward; this basically allows you to manually traverse the tree.This PR subsumes the following PRs by @windpiger:
Closes #16267
Closes #16255
How was this patch tested?
I have added unit tests to
RemoveRedundantAliasAndProjectSuiteand I have added integration tests to theSQLQueryTestSuite.unionandSQLQueryTestSuite.ctetest cases.