Skip to content

Conversation

@windpiger
Copy link
Contributor

@windpiger windpiger commented Dec 13, 2016

What changes were proposed in this pull request?

a union SQL with the same column name, after apply the rule RemoveAliasOnlyProject&PushProjectionThroughUnion many times, it will throw a exception

The reason is that RemoveAliasOnlyProject rule remove the right project child(alias only project) of Union, and replace the attribute of Union & the left project child of Union, then apply PushProjectionThroughUnion rule ,as the output attributes of a union are always equal to the left child's output, so it will throw a exception that the left child do not contain an attribute of the Union(RemoveAliasOnlyProject has replace it with right column).

for example:

    >spark.sql("DROP TABLE IF EXISTS p1")
    >spark.sql("DROP TABLE IF EXISTS p2")
    >spark.sql("DROP TABLE IF EXISTS p3")
    >spark.sql("CREATE TABLE p1 (col int)" )
    >spark.sql("CREATE TABLE p2 (col int)")
    >spark.sql("CREATE TABLE p3 (col int)")
    >spark.sql("set spark.sql.crossJoin.enabled = true")
    >spark.sql("SELECT 1 as cste,col FROM (SELECT col as col FROM (SELECT p1.col as col FROM p1 LEFT JOIN p2 UNION ALL SELECT col FROM p3 ) T1) T2").show

exception:

key not found: col#16
java.util.NoSuchElementException: key not found: col#16
        at scala.collection.MapLike$class.default(MapLike.scala:228)
        at org.apache.spark.sql.catalyst.expressions.AttributeMap.default(AttributeMap.scala:31)
        at scala.collection.MapLike$class.apply(MapLike.scala:141)
        at org.apache.spark.sql.catalyst.expressions.AttributeMap.apply(AttributeMap.scala:31)
        at org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$2.applyOrElse(Optimizer.scala:346)
        at org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$2.applyOrElse(Optimizer.scala:345)
        at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:292)
        at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:292)
        at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:74)
        at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:291)
        at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:281)
        at org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$.org$apache$spark$sql$catalyst$optimizer$PushProjectionThroughUnion$$pushToRight(Optimizer.scala:345)
        at org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$apply$4$$anonfun$8$$anonfun$apply$31.apply(Optimizer.scala:378)
        at org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$apply$4$$anonfun$8$$anonfun$apply$31.apply(Optimizer.scala:378)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at scala.collection.immutable.List.foreach(List.scala:381)
        at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
        at scala.collection.immutable.List.map(List.scala:285)
        at org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$apply$4$$anonfun$8.apply(Optimizer.scala:378)

How was this patch tested?

unit test added

@SparkQA
Copy link

SparkQA commented Dec 13, 2016

Test build #70086 has finished for PR 16267 at commit b7b27af.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 13, 2016

Test build #70087 has finished for PR 16267 at commit ab6d8d4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@windpiger
Copy link
Contributor Author

windpiger commented Dec 14, 2016

cc @hvanhovell @cloud-fan

@asfgit asfgit closed this in 73ee739 Feb 7, 2017
cmonkey pushed a commit to cmonkey/spark that referenced this pull request Feb 15, 2017
…timizer

## What changes were proposed in this pull request?
The optimizer tries to remove redundant alias only projections from the query plan using the `RemoveAliasOnlyProject` rule. The current rule identifies removes such a project and rewrites the project's attributes in the **entire** tree. This causes problems when parts of the tree are duplicated (for instance a self join on a temporary view/CTE)  and the duplicated part contains the alias only project, in this case the rewrite will break the tree.

This PR fixes these problems by using a blacklist for attributes that are not to be moved, and by making sure that attribute remapping is only done for the parent tree, and not for unrelated parts of the query plan.

The current tree transformation infrastructure works very well if the transformation at hand requires little or a global contextual information. In this case we need to know both the attributes that were not to be moved, and we also needed to know which child attributes were modified. This cannot be done easily using the current infrastructure, and solutions typically involves transversing the query plan multiple times (which is super slow). I have moved around some code in `TreeNode`, `QueryPlan` and `LogicalPlan`to make this much more straightforward; this basically allows you to manually traverse the tree.

This PR subsumes the following PRs by windpiger:
Closes apache#16267
Closes apache#16255

## How was this patch tested?
I have added unit tests to `RemoveRedundantAliasAndProjectSuite` and I have added integration tests to the `SQLQueryTestSuite.union` and `SQLQueryTestSuite.cte` test cases.

Author: Herman van Hovell <[email protected]>

Closes apache#16757 from hvanhovell/SPARK-18609.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants