[SPARK-18841][SQL]fix PushProjectionThroughUnion throw exception when there are same column name #16267

windpiger · 2016-12-13T14:53:29Z

What changes were proposed in this pull request?

a union SQL with the same column name, after apply the rule RemoveAliasOnlyProject&PushProjectionThroughUnion many times, it will throw a exception

The reason is that RemoveAliasOnlyProject rule remove the right project child(alias only project) of Union, and replace the attribute of Union & the left project child of Union, then apply PushProjectionThroughUnion rule ,as the output attributes of a union are always equal to the left child's output, so it will throw a exception that the left child do not contain an attribute of the Union(RemoveAliasOnlyProject has replace it with right column).

for example:

    >spark.sql("DROP TABLE IF EXISTS p1")
    >spark.sql("DROP TABLE IF EXISTS p2")
    >spark.sql("DROP TABLE IF EXISTS p3")
    >spark.sql("CREATE TABLE p1 (col int)" )
    >spark.sql("CREATE TABLE p2 (col int)")
    >spark.sql("CREATE TABLE p3 (col int)")
    >spark.sql("set spark.sql.crossJoin.enabled = true")
    >spark.sql("SELECT 1 as cste,col FROM (SELECT col as col FROM (SELECT p1.col as col FROM p1 LEFT JOIN p2 UNION ALL SELECT col FROM p3 ) T1) T2").show

exception:

key not found: col#16
java.util.NoSuchElementException: key not found: col#16
        at scala.collection.MapLike$class.default(MapLike.scala:228)
        at org.apache.spark.sql.catalyst.expressions.AttributeMap.default(AttributeMap.scala:31)
        at scala.collection.MapLike$class.apply(MapLike.scala:141)
        at org.apache.spark.sql.catalyst.expressions.AttributeMap.apply(AttributeMap.scala:31)
        at org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$2.applyOrElse(Optimizer.scala:346)
        at org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$2.applyOrElse(Optimizer.scala:345)
        at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:292)
        at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:292)
        at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:74)
        at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:291)
        at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:281)
        at org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$.org$apache$spark$sql$catalyst$optimizer$PushProjectionThroughUnion$$pushToRight(Optimizer.scala:345)
        at org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$apply$4$$anonfun$8$$anonfun$apply$31.apply(Optimizer.scala:378)
        at org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$apply$4$$anonfun$8$$anonfun$apply$31.apply(Optimizer.scala:378)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at scala.collection.immutable.List.foreach(List.scala:381)
        at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
        at scala.collection.immutable.List.map(List.scala:285)
        at org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$apply$4$$anonfun$8.apply(Optimizer.scala:378)

How was this patch tested?

unit test added

… there are same column name

SparkQA · 2016-12-13T15:08:51Z

Test build #70086 has finished for PR 16267 at commit b7b27af.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-12-13T17:24:20Z

Test build #70087 has finished for PR 16267 at commit ab6d8d4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

windpiger · 2016-12-14T02:24:09Z

cc @hvanhovell @cloud-fan

…timizer ## What changes were proposed in this pull request? The optimizer tries to remove redundant alias only projections from the query plan using the `RemoveAliasOnlyProject` rule. The current rule identifies removes such a project and rewrites the project's attributes in the **entire** tree. This causes problems when parts of the tree are duplicated (for instance a self join on a temporary view/CTE) and the duplicated part contains the alias only project, in this case the rewrite will break the tree. This PR fixes these problems by using a blacklist for attributes that are not to be moved, and by making sure that attribute remapping is only done for the parent tree, and not for unrelated parts of the query plan. The current tree transformation infrastructure works very well if the transformation at hand requires little or a global contextual information. In this case we need to know both the attributes that were not to be moved, and we also needed to know which child attributes were modified. This cannot be done easily using the current infrastructure, and solutions typically involves transversing the query plan multiple times (which is super slow). I have moved around some code in `TreeNode`, `QueryPlan` and `LogicalPlan`to make this much more straightforward; this basically allows you to manually traverse the tree. This PR subsumes the following PRs by windpiger: Closes apache#16267 Closes apache#16255 ## How was this patch tested? I have added unit tests to `RemoveRedundantAliasAndProjectSuite` and I have added integration tests to the `SQLQueryTestSuite.union` and `SQLQueryTestSuite.cte` test cases. Author: Herman van Hovell <[email protected]> Closes apache#16757 from hvanhovell/SPARK-18609.

windpiger added 3 commits December 13, 2016 22:34

[SPARK-18841][SQL]fix PushProjectionThroughUnion throw exception when…

b7b27af

… there are same column name

fix a comment

7443eaa

fix a style

ab6d8d4

hvanhovell mentioned this pull request Feb 1, 2017

[SPARK-18609][SPARK-18841][SQL] Fix redundant Alias removal in the optimizer #16757

Closed

asfgit closed this in 73ee739 Feb 7, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-18841][SQL]fix PushProjectionThroughUnion throw exception when there are same column name #16267

[SPARK-18841][SQL]fix PushProjectionThroughUnion throw exception when there are same column name #16267

Uh oh!

windpiger commented Dec 13, 2016 •

edited

Loading

Uh oh!

SparkQA commented Dec 13, 2016

Uh oh!

SparkQA commented Dec 13, 2016

Uh oh!

windpiger commented Dec 14, 2016 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[SPARK-18841][SQL]fix PushProjectionThroughUnion throw exception when there are same column name #16267

[SPARK-18841][SQL]fix PushProjectionThroughUnion throw exception when there are same column name #16267

Uh oh!

Conversation

windpiger commented Dec 13, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Dec 13, 2016

Uh oh!

SparkQA commented Dec 13, 2016

Uh oh!

windpiger commented Dec 14, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

windpiger commented Dec 13, 2016 •

edited

Loading

windpiger commented Dec 14, 2016 •

edited

Loading