Skip to content

Conversation

@hvanhovell
Copy link
Contributor

@hvanhovell hvanhovell commented Jan 31, 2017

What changes were proposed in this pull request?

The optimizer tries to remove redundant alias only projections from the query plan using the RemoveAliasOnlyProject rule. The current rule identifies removes such a project and rewrites the project's attributes in the entire tree. This causes problems when parts of the tree are duplicated (for instance a self join on a temporary view/CTE) and the duplicated part contains the alias only project, in this case the rewrite will break the tree.

This PR fixes these problems by using a blacklist for attributes that are not to be moved, and by making sure that attribute remapping is only done for the parent tree, and not for unrelated parts of the query plan.

The current tree transformation infrastructure works very well if the transformation at hand requires little or a global contextual information. In this case we need to know both the attributes that were not to be moved, and we also needed to know which child attributes were modified. This cannot be done easily using the current infrastructure, and solutions typically involves transversing the query plan multiple times (which is super slow). I have moved around some code in TreeNode, QueryPlan and LogicalPlanto make this much more straightforward; this basically allows you to manually traverse the tree.

This PR subsumes the following PRs by @windpiger:
Closes #16267
Closes #16255

How was this patch tested?

I have added unit tests to RemoveRedundantAliasAndProjectSuite and I have added integration tests to the SQLQueryTestSuite.union and SQLQueryTestSuite.cte test cases.

@SparkQA
Copy link

SparkQA commented Jan 31, 2017

Test build #72202 has finished for PR 16757 at commit dac7ec9.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hvanhovell
Copy link
Contributor Author

@SparkQA
Copy link

SparkQA commented Jan 31, 2017

Test build #72203 has finished for PR 16757 at commit 6aad5d8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hvanhovell hvanhovell changed the title [SPARK-18609][SQL] Fix redundant Alias removal in the optimizer [SPARK-18609][SPARK-18841][SQL] Fix redundant Alias removal in the optimizer Feb 1, 2017
@SparkQA
Copy link

SparkQA commented Feb 1, 2017

Test build #72248 has finished for PR 16757 at commit 81f2fa5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

This PR fixes these problems by using a blacklist for attributes that are not to be moved, and by making sure that attribute remapping is only done for the parent tree, and not for unrelated parts of the query plan.

Where do you implement "traversing the parent tree"?

@hvanhovell
Copy link
Contributor Author

I do not explicitly implement 'traversing the parent tree'. I have opened up a few methods in TreeNode and QueryPlan so you can write your own (recursive) tree traversal. In this case this allows me to setup the blacklist before transforming the child nodes, to get determine which child attributes have changed, and to apply these changes to the parent node.

@cloud-fan
Copy link
Contributor

The new RemoveRedundantAliases rule looks convoluted, is it possible to implement an O(1) complex isParent method on TreeNode? that could make the logic much simpler.

/**
* Returns true if the project list is semantically same as child output, after strip alias on
* attribute.
* Replace the attributes in an expression using the given mapping.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like this doc is wrong?

case (a: Attribute, o) if a semanticEquals o => true
case _ => false
}
private def createAttributeMapping(current: LogicalPlan, next: LogicalPlan)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain what current and next means here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current is plan before we remove redundant aliases, and next is the plan after we have remove the redundant aliases. I'll update the doc.

/**
* Get an appropriate alias cleaning method for the given node.
*
* We currently clean Project, Aggregate & Window nodes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this is an improvement right? previously we only clean Project. However I think this method is over engineered, we can just create a def needClean(plan: LogicalPlan): Boolean

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that is an improvement. I added all LogicalPlan nodes that are producing new attributes using named expressions.

I will inline this method.

case _ =>
// Drop blacklisted attributes that are masked in the current project. This allows us to
// remove redundant aliases in the subtree.
val childBlacklist = blacklist -- (plan.inputSet -- plan.outputSet)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this branch needed because Union reuse the output of left side? can we remove it if we fix Union?

Copy link
Contributor Author

@hvanhovell hvanhovell Feb 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean the case _ => right? That is needed for everything which is not a Join. We are doing manual tree traversal here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry I mean childBlacklist. We can just use blacklist if Union is fixed right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The child blacklist is an optimization. I can remove an attribute from the child's blacklist if I know that it is being created in the current node. This way I give the rule more freedom in removing attributes. The thing is that situation should only happen when there are multiple self joins, and this might be an over optimization.

@SparkQA
Copy link

SparkQA commented Feb 3, 2017

Test build #72323 has finished for PR 16757 at commit acbb9e0.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hvanhovell
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Feb 4, 2017

Test build #72372 has finished for PR 16757 at commit acbb9e0.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member

viirya commented Feb 4, 2017

retest this please.

@SparkQA
Copy link

SparkQA commented Feb 4, 2017

Test build #72374 has finished for PR 16757 at commit acbb9e0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

// Create a an expression cleaning function for nodes that can actually produce redundant
// aliases, use identity otherwise.
val clean: Expression => Expression = plan match {
case _: Project => removeRedundantAlias(_, blacklist)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if we clean expressions for all nodes? Or like rule CleanupAliases that we only skip for ObjectConsumer, ObjectProducer and AppendColumns.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually can we merge this rule into CleanupAliases?

# Conflicts:
#	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
val query = relation.select('b as 'b, 'a as 'a).analyze
val optimized = Optimize.execute(query)
val query = relation.select('b, 'a).analyze
val optimized = Optimize.execute(relation.select('b as 'b, 'a as 'a).analyze)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

according to most of the optimizer unit tests, the preferred code style should be

val query = relation.select('b as 'b, 'a as 'a).analyze
val optimized = Optimize.execute(query)
val expected = relation.select('b, 'a).analyze
comparePlans(optimized, expected)

@cloud-fan
Copy link
Contributor

LGTM except one minor comment

@SparkQA
Copy link

SparkQA commented Feb 7, 2017

Test build #72520 has finished for PR 16757 at commit 23743e1.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hvanhovell
Copy link
Contributor Author

Retest this please

@SparkQA
Copy link

SparkQA commented Feb 7, 2017

Test build #72521 has finished for PR 16757 at commit 29c4696.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hvanhovell
Copy link
Contributor Author

hvanhovell commented Feb 7, 2017

Merging this to master.

@SparkQA
Copy link

SparkQA commented Feb 7, 2017

Test build #72526 has finished for PR 16757 at commit 29c4696.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@asfgit asfgit closed this in 73ee739 Feb 7, 2017
@cloud-fan
Copy link
Contributor

This was merged to master only right?

@hvanhovell
Copy link
Contributor Author

It had a merge conflict, so I opened: #16843

cmonkey pushed a commit to cmonkey/spark that referenced this pull request Feb 15, 2017
…timizer

## What changes were proposed in this pull request?
The optimizer tries to remove redundant alias only projections from the query plan using the `RemoveAliasOnlyProject` rule. The current rule identifies removes such a project and rewrites the project's attributes in the **entire** tree. This causes problems when parts of the tree are duplicated (for instance a self join on a temporary view/CTE)  and the duplicated part contains the alias only project, in this case the rewrite will break the tree.

This PR fixes these problems by using a blacklist for attributes that are not to be moved, and by making sure that attribute remapping is only done for the parent tree, and not for unrelated parts of the query plan.

The current tree transformation infrastructure works very well if the transformation at hand requires little or a global contextual information. In this case we need to know both the attributes that were not to be moved, and we also needed to know which child attributes were modified. This cannot be done easily using the current infrastructure, and solutions typically involves transversing the query plan multiple times (which is super slow). I have moved around some code in `TreeNode`, `QueryPlan` and `LogicalPlan`to make this much more straightforward; this basically allows you to manually traverse the tree.

This PR subsumes the following PRs by windpiger:
Closes apache#16267
Closes apache#16255

## How was this patch tested?
I have added unit tests to `RemoveRedundantAliasAndProjectSuite` and I have added integration tests to the `SQLQueryTestSuite.union` and `SQLQueryTestSuite.cte` test cases.

Author: Herman van Hovell <[email protected]>

Closes apache#16757 from hvanhovell/SPARK-18609.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants