[SPARK-18609][SPARK-18841][SQL] Fix redundant Alias removal in the optimizer #16757

hvanhovell · 2017-01-31T16:47:15Z

What changes were proposed in this pull request?

The optimizer tries to remove redundant alias only projections from the query plan using the RemoveAliasOnlyProject rule. The current rule identifies removes such a project and rewrites the project's attributes in the entire tree. This causes problems when parts of the tree are duplicated (for instance a self join on a temporary view/CTE) and the duplicated part contains the alias only project, in this case the rewrite will break the tree.

This PR fixes these problems by using a blacklist for attributes that are not to be moved, and by making sure that attribute remapping is only done for the parent tree, and not for unrelated parts of the query plan.

The current tree transformation infrastructure works very well if the transformation at hand requires little or a global contextual information. In this case we need to know both the attributes that were not to be moved, and we also needed to know which child attributes were modified. This cannot be done easily using the current infrastructure, and solutions typically involves transversing the query plan multiple times (which is super slow). I have moved around some code in TreeNode, QueryPlan and LogicalPlanto make this much more straightforward; this basically allows you to manually traverse the tree.

This PR subsumes the following PRs by @windpiger:
Closes #16267
Closes #16255

How was this patch tested?

I have added unit tests to RemoveRedundantAliasAndProjectSuite and I have added integration tests to the SQLQueryTestSuite.union and SQLQueryTestSuite.cte test cases.

…dundantProject.

SparkQA · 2017-01-31T18:17:15Z

Test build #72202 has finished for PR 16757 at commit dac7ec9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2017-01-31T18:35:53Z

cc @cloud-fan @windpiger @sameeragarwal

SparkQA · 2017-01-31T20:48:22Z

Test build #72203 has finished for PR 16757 at commit 6aad5d8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-01T17:19:54Z

Test build #72248 has finished for PR 16757 at commit 81f2fa5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-02-03T10:52:26Z

This PR fixes these problems by using a blacklist for attributes that are not to be moved, and by making sure that attribute remapping is only done for the parent tree, and not for unrelated parts of the query plan.

Where do you implement "traversing the parent tree"?

hvanhovell · 2017-02-03T11:11:42Z

I do not explicitly implement 'traversing the parent tree'. I have opened up a few methods in TreeNode and QueryPlan so you can write your own (recursive) tree traversal. In this case this allows me to setup the blacklist before transforming the child nodes, to get determine which child attributes have changed, and to apply these changes to the parent node.

cloud-fan · 2017-02-03T12:24:18Z

The new RemoveRedundantAliases rule looks convoluted, is it possible to implement an O(1) complex isParent method on TreeNode? that could make the logic much simpler.

cloud-fan · 2017-02-03T15:23:31Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

  /**
-   * Returns true if the project list is semantically same as child output, after strip alias on
-   * attribute.
+   * Replace the attributes in an expression using the given mapping.


looks like this doc is wrong?

cloud-fan · 2017-02-03T15:24:04Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

-        case (a: Attribute, o) if a semanticEquals o => true
-        case _ => false
-      }
+  private def createAttributeMapping(current: LogicalPlan, next: LogicalPlan)


can you explain what current and next means here?

Current is plan before we remove redundant aliases, and next is the plan after we have remove the redundant aliases. I'll update the doc.

cloud-fan · 2017-02-03T15:26:35Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+  /**
+   * Get an appropriate alias cleaning method for the given node.
+   *
+   * We currently clean Project, Aggregate & Window nodes.


so this is an improvement right? previously we only clean Project. However I think this method is over engineered, we can just create a def needClean(plan: LogicalPlan): Boolean

Yeah that is an improvement. I added all LogicalPlan nodes that are producing new attributes using named expressions.

I will inline this method.

cloud-fan · 2017-02-03T15:30:52Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+      case _ =>
+        // Drop blacklisted attributes that are masked in the current project. This allows us to
+        // remove redundant aliases in the subtree.
+        val childBlacklist = blacklist -- (plan.inputSet -- plan.outputSet)


is this branch needed because Union reuse the output of left side? can we remove it if we fix Union?

You mean the case _ => right? That is needed for everything which is not a Join. We are doing manual tree traversal here.

sorry I mean childBlacklist. We can just use blacklist if Union is fixed right?

The child blacklist is an optimization. I can remove an attribute from the child's blacklist if I know that it is being created in the current node. This way I give the rule more freedom in removing attributes. The thing is that situation should only happen when there are multiple self joins, and this might be an over optimization.

SparkQA · 2017-02-03T23:20:44Z

Test build #72323 has finished for PR 16757 at commit acbb9e0.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2017-02-04T11:08:07Z

retest this please

SparkQA · 2017-02-04T12:43:45Z

Test build #72372 has finished for PR 16757 at commit acbb9e0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-02-04T13:31:28Z

retest this please.

SparkQA · 2017-02-04T15:59:16Z

Test build #72374 has finished for PR 16757 at commit acbb9e0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-02-06T07:27:02Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+        // Create a an expression cleaning function for nodes that can actually produce redundant
+        // aliases, use identity otherwise.
+        val clean: Expression => Expression = plan match {
+          case _: Project => removeRedundantAlias(_, blacklist)


what if we clean expressions for all nodes? Or like rule CleanupAliases that we only skip for ObjectConsumer, ObjectProducer and AppendColumns.

Actually can we merge this rule into CleanupAliases?

# Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

cloud-fan · 2017-02-07T16:48:25Z

...test/scala/org/apache/spark/sql/catalyst/optimizer/RemoveRedundantAliasAndProjectSuite.scala

-    val query = relation.select('b as 'b, 'a as 'a).analyze
-    val optimized = Optimize.execute(query)
+    val query = relation.select('b, 'a).analyze
+    val optimized = Optimize.execute(relation.select('b as 'b, 'a as 'a).analyze)


according to most of the optimizer unit tests, the preferred code style should be

val query = relation.select('b as 'b, 'a as 'a).analyze val optimized = Optimize.execute(query) val expected = relation.select('b, 'a).analyze comparePlans(optimized, expected)

cloud-fan · 2017-02-07T16:51:01Z

LGTM except one minor comment

SparkQA · 2017-02-07T18:08:30Z

Test build #72520 has finished for PR 16757 at commit 23743e1.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2017-02-07T18:12:48Z

Retest this please

SparkQA · 2017-02-07T19:34:14Z

Test build #72521 has finished for PR 16757 at commit 29c4696.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2017-02-07T20:28:54Z

Merging this to master.

SparkQA · 2017-02-07T20:42:34Z

Test build #72526 has finished for PR 16757 at commit 29c4696.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-02-08T03:58:29Z

This was merged to master only right?

hvanhovell · 2017-02-08T07:33:21Z

It had a merge conflict, so I opened: #16843

…timizer ## What changes were proposed in this pull request? The optimizer tries to remove redundant alias only projections from the query plan using the `RemoveAliasOnlyProject` rule. The current rule identifies removes such a project and rewrites the project's attributes in the **entire** tree. This causes problems when parts of the tree are duplicated (for instance a self join on a temporary view/CTE) and the duplicated part contains the alias only project, in this case the rewrite will break the tree. This PR fixes these problems by using a blacklist for attributes that are not to be moved, and by making sure that attribute remapping is only done for the parent tree, and not for unrelated parts of the query plan. The current tree transformation infrastructure works very well if the transformation at hand requires little or a global contextual information. In this case we need to know both the attributes that were not to be moved, and we also needed to know which child attributes were modified. This cannot be done easily using the current infrastructure, and solutions typically involves transversing the query plan multiple times (which is super slow). I have moved around some code in `TreeNode`, `QueryPlan` and `LogicalPlan`to make this much more straightforward; this basically allows you to manually traverse the tree. This PR subsumes the following PRs by windpiger: Closes apache#16267 Closes apache#16255 ## How was this patch tested? I have added unit tests to `RemoveRedundantAliasAndProjectSuite` and I have added integration tests to the `SQLQueryTestSuite.union` and `SQLQueryTestSuite.cte` test cases. Author: Herman van Hovell <[email protected]> Closes apache#16757 from hvanhovell/SPARK-18609.

hvanhovell added 2 commits January 30, 2017 13:11

Open-up TreeNode's transform logic.

6c89a15

Split RemoveAliasOnlyProject into RemoveRedundantAliases and RemoveRe…

dac7ec9

…dundantProject.

Fix union.

6aad5d8

Improve test coverage

81f2fa5

hvanhovell changed the title ~~[SPARK-18609][SQL] Fix redundant Alias removal in the optimizer~~ [SPARK-18609][SPARK-18841][SQL] Fix redundant Alias removal in the optimizer Feb 1, 2017

cloud-fan reviewed Feb 3, 2017

View reviewed changes

Code review

acbb9e0

cloud-fan reviewed Feb 6, 2017

View reviewed changes

hvanhovell added 2 commits February 7, 2017 17:03

Merge remote-tracking branch 'apache-github/master' into SPARK-18609

3103d1b

# Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

Update doc after CR

23743e1

cloud-fan reviewed Feb 7, 2017

View reviewed changes

Unit test should have similar styles.

29c4696

asfgit closed this in 73ee739 Feb 7, 2017

[SPARK-18609][SPARK-18841][SQL] Fix redundant Alias removal in the optimizer #16757

[SPARK-18609][SPARK-18841][SQL] Fix redundant Alias removal in the optimizer #16757

Uh oh!

Conversation

hvanhovell commented Jan 31, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jan 31, 2017

Uh oh!

hvanhovell commented Jan 31, 2017

Uh oh!

SparkQA commented Jan 31, 2017

Uh oh!

SparkQA commented Feb 1, 2017

Uh oh!

cloud-fan commented Feb 3, 2017

Uh oh!

hvanhovell commented Feb 3, 2017

Uh oh!

cloud-fan commented Feb 3, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hvanhovell Feb 3, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 3, 2017

Uh oh!

hvanhovell commented Feb 4, 2017

Uh oh!

SparkQA commented Feb 4, 2017

Uh oh!

viirya commented Feb 4, 2017

Uh oh!

SparkQA commented Feb 4, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Feb 7, 2017

Uh oh!

SparkQA commented Feb 7, 2017

Uh oh!

hvanhovell commented Feb 7, 2017

Uh oh!

SparkQA commented Feb 7, 2017

Uh oh!

hvanhovell commented Feb 7, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Feb 7, 2017

Uh oh!

cloud-fan commented Feb 8, 2017

Uh oh!

hvanhovell commented Feb 8, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

hvanhovell commented Jan 31, 2017 •

edited

Loading

hvanhovell Feb 3, 2017 •

edited

Loading

hvanhovell commented Feb 7, 2017 •

edited

Loading