[SPARK-18609][SQL]Fix when CTE with Join between two table with same column name #16255

windpiger · 2016-12-12T14:51:03Z

What changes were proposed in this pull request?

After optimize the SQL with CTE and Join between two tables with the same column name, there will throw a exception when the Physical Plan executed.
SQL:

>spark.sql("CREATE TABLE p1 (col STRING)"
>spark.sql("CREATE TABLE p2 (col STRING) "
>spark.sql("set spark.sql.crossJoin.enabled = true")
>spark.sql("WITH CTE AS (SELECT s2.col as col FROM p1 CROSS JOIN (SELECT e.col as col FROM p2 E) s2) SELECT T1.col as c1, T2.col as c2 FROM CTE T1 CROSS JOIN CTE T2")

before fixed the SQL explained:

== Physical Plan ==
*Project [col#123 AS c1#104, col#126 AS c2#105]
+- CartesianProduct
   :- CartesianProduct
   :  :- HiveTableScan MetastoreRelation default, p1
   :  +- HiveTableScan [col#123], MetastoreRelation default, p2, E
   +- *!Project [col#123 AS col#126]
      +- CartesianProduct
         :- HiveTableScan MetastoreRelation default, p1
         +- HiveTableScan MetastoreRelation default, p2, E

when execute the Physical Plan above, it will throw exception because of there is no attribute for col#123 to bind( the rules ColumnPruning/RemoveAliasOnlyProject are the root cause):
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: col#123 at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:285) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:291) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:291) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186) at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:291) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:275) at org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87) at org.apache.spark.sql.execution.ProjectExec$$anonfun$4.apply(basicPhysicalOperators.scala:61) at org.apache.spark.sql.execution.ProjectExec$$anonfun$4.apply(basicPhysicalOperators.scala:60) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)

after fixed the SQL explained:

== Physical Plan ==
*Project [col#12 AS c1#0, col#12 AS c2#1]
+- CartesianProduct
   :- CartesianProduct
   :  :- HiveTableScan MetastoreRelation default, p1
   :  +- HiveTableScan [col#12], MetastoreRelation default, p2
   +- CartesianProduct
      :- HiveTableScan MetastoreRelation default, p1
      +- HiveTableScan [col#12], MetastoreRelation default, p2

and the result is

+---+---+
| c1| c2|
+---+---+
+---+---+

How was this patch tested?

unit test added

…same column name

SparkQA · 2016-12-12T16:36:12Z

Test build #70021 has finished for PR 16255 at commit a85db5e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2016-12-12T23:47:47Z

@windpiger I am not sure that this is an analysis problem. I suspect that optimizer is messing up the query. Could you pinpoint the rule that causes the issue? Tip: in the spark shell (./build/sbt sparkShell) use sc.setLogLevel("TRACE"), and that will print all rules being applied.

windpiger · 2016-12-13T01:16:41Z

ok, the optimizer rule apply as follows:

before optimizer:

Project [col#7 AS c1#4, col#10 AS c2#5]
+- Join Cross
:- Project [col#6 AS col#7]
: +- Join Cross
: :- MetastoreRelation default, p1
: +- Project [col#9 AS col#6]
: +- MetastoreRelation default, p2
+- Project [col#6 AS col#10]
+- Join Cross
:- MetastoreRelation default, p1
+- Project [col#9 AS col#6]
+- MetastoreRelation default, p2

after first ColumnPruning:

Project [col#7 AS c1#4, col#10 AS c2#5]
+- Join Cross
:- Project [col#6 AS col#7]
: +- Join Cross
: :- Project
: : +- MetastoreRelation default, p1
: +- Project [col#9 AS col#6]
: +- MetastoreRelation default, p2
+- Project [col#6 AS col#10]
+- Join Cross
:- Project
: +- MetastoreRelation default, p1
+- Project [col#9 AS col#6]
+- MetastoreRelation default, p2

after first RemoveAliasOnlyProject:

Project [col#6 AS c1#4, col#10 AS c2#5]
+- Join Cross
:- Join Cross
: :- Project
: : +- MetastoreRelation default, p1
: +- Project [col#9 AS col#6]
: +- MetastoreRelation default, p2
+- Project [col#6 AS col#10]
+- Join Cross
:- Project
: +- MetastoreRelation default, p1
+- Project [col#9 AS col#6]
+- MetastoreRelation default, p2
`

after second ColumnPruning:

Project [col#6 AS c1#4, col#10 AS c2#5]
+- Join Cross
:- Join Cross
: :- Project
: : +- MetastoreRelation default, p1
: +- Project [col#9 AS col#6]
: +- MetastoreRelation default, p2
+- Project [col#6 AS col#10]
+- Join Cross
:- Project
: +- MetastoreRelation default, p1
+- Project [col#9 AS col#6]
+- MetastoreRelation default, p2

after second RemoveAliasOnlyProject:

Project [col#9 AS c1#4, col#10 AS c2#5]
+- Join Cross
:- Join Cross
: :- Project
: : +- MetastoreRelation default, p1
: +- MetastoreRelation default, p2
+- !Project [col#9 AS col#10]
+- Join Cross
:- Project
: +- MetastoreRelation default, p1
+- Project [col#9 AS col#6]
+- MetastoreRelation default, p2

here

+- Join Cross
:- Project
: +- MetastoreRelation default, p1
+- Project [col#9 AS col#6]
+- MetastoreRelation default, p2

col#6 should not be existed after using col#9 to replace col#6 in this RemoveAliasOnlyProject.
Project [col#9 AS col#6] can be processed to Project [col#9]

after third ColumnPruning:

Project [col#9 AS c1#4, col#10 AS c2#5]
+- Join Cross
:- Join Cross
: :- Project
: : +- MetastoreRelation default, p1
: +- MetastoreRelation default, p2
+- !Project [col#9 AS col#10]
+- Join Cross
:- Project
: +- MetastoreRelation default, p1
+- Project
+- Project
+- MetastoreRelation default, p2

here

+- !Project [col#9 AS col#10]
+- Join Cross
:- Project
: +- MetastoreRelation default, p1
+- Project
+- Project
+- MetastoreRelation default, p2

col#9 has no attribute to bind when do BindRefference

after CollapseProject:

Project [col#9 AS c1#4, col#10 AS c2#5]
+- Join Cross
:- Join Cross
: :- Project
: : +- MetastoreRelation default, p1
: +- MetastoreRelation default, p2
+- !Project [col#9 AS col#10]
+- Join Cross
:- Project
: +- MetastoreRelation default, p1
+- Project
+- MetastoreRelation default, p2

when optimizer finished, the logic plan is

Project [col#9 AS c1#4, col#10 AS c2#5]
+- Join Cross
:- Join Cross
: :- Project
: : +- MetastoreRelation default, p1
: +- MetastoreRelation default, p2
+- !Project [col#9 AS col#10]
+- Join Cross
:- Project
: +- MetastoreRelation default, p1
+- Project
+- MetastoreRelation default, p2

SparkQA · 2016-12-13T03:08:05Z

Test build #70048 has finished for PR 16255 at commit f8d602a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

windpiger · 2016-12-13T15:01:17Z

@hvanhovell please help to review this, thanks a lot!

hvanhovell · 2016-12-13T15:05:08Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

-    case p @ Project(_, child) if sameOutput(child.output, p.output) => child
-
+    case p @ Project(_, child) if sameOutput(child.output, p.output) =>
+      if (child.isInstanceOf[CatalogRelation]) p else child


Why is a CatalogRelation different?

This works in your example case, but will fail for a more complex query.

The problem is that the transform on L199 also modifies parts of the tree, that are totally unrelated to the projection at hand. In this case right side of the join. This can currently only happen in case of a self-join on a CTE, such a tree can contain duplicate attribute ids because we expand the tree after we have analyzed the CTE.

yes, it will transform the right side of the join, and the problem happened when replace the right side of the join. Above step 5, it apply RemoveAliasOnlyProject rule on step 4's plan, which replace col#6 with col#9, but it does not have effect on Project [col#9 AS col#6] of the right side of the join. I think the right side Project [col#9 AS col#6] should be changed to Project [col#9]

Project [col#9 AS c1#4, col#10 AS c2#5] +- Join Cross :- Join Cross : :- Project : : +- MetastoreRelation default, p1 : +- MetastoreRelation default, p2 +- !Project [col#9 AS col#10] +- Join Cross :- Project : +- MetastoreRelation default, p1 +- Project [col#9 AS col#6] +- MetastoreRelation default, p2

SparkQA · 2016-12-14T04:24:06Z

Test build #70114 has finished for PR 16255 at commit 0413f9d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-12-14T09:15:24Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

      }
      val attrMap = AttributeMap(attributesToReplace)
      plan transform {
        case plan: Project if plan eq proj => plan.child


The problem is: this rule assumes that, if we find an alias-only project, e.g. alias a#1 to a#2, it's safe to remove this project and replace all a#2 with a#1 in this plan. However, this is not true for complex cases like https://github.com/apache/spark/pull/16255/files#diff-1ea02a6fab84e938582f7f87cc4d9ea1R2023 .

Let's see if there is a way to fix this problem entirely.

a naive way to do this is, make sure we only replace attributes in the ancestor nodes of the alias-only project:

plan transform { case plan: Project if plan eq proj => plan.child case plan if plan.collect { case p if p eq project }.nonEmpty => // do the replace }

It's very inefficient, maybe we can improve TreeNode to maintain the parent-child relationship between nodes.

it is safe to only replace attributes in the ancestor nodes.
Alias with the same exprId but not the same object, replace the alias with it's child. it is not safe ,right?
Project [col#9 AS col#6] -> Project [col#9]

what do you mean?

sorry, describe it clearly,this is not safe? https://github.com/windpiger/spark/blob/0413f9dad4ad1294e3400dc0f42f66529b1b055b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L203-L204

cloud-fan · 2017-01-03T05:53:30Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

        case plan => plan transformExpressions {
          case a: Attribute if attrMap.contains(a) => attrMap(a)
+          case b: Alias if attrMap.exists(_._1.exprId == b.exprId)
+            && b.child.isInstanceOf[NamedExpression] => b.child


How do you reason about this? why we treat Alias differently here?

As you said, if we find an alias-only project, e.g. alias a#1 to a#2, it's safe to remove this project and replace all a#2 with a#1 in this plan. So another Alias which is also alias a#1 to a#2, but not the same object with the first one, it will not be processed.

here, the logic shows that we process the situation.

I don't get it, for alias a#1 to a#2, we wanna replace all a#2 with a#1, so we will do nothing for alias a#1 to a#2, because we can't find an attribute a#2

RemoveAliasOnlyProject will remove alias a#1 to a#2, and replace all a#2 with a#1, so there is no a#2 exists, If we do nothing for alias a#1 to a#2(not the same object with the removed one), it will cause the exception situation from step 5 to step 6 showed on the above comment.
@cloud-fan

I know how the failure happens and this can fix it, but I think it's too hacky and does not catch the root cause. https://github.com/apache/spark/pull/16255/files#r92348878 easily explains why the failure happens and how to fix it, can you make other people understand your fix easily?

…timizer ## What changes were proposed in this pull request? The optimizer tries to remove redundant alias only projections from the query plan using the `RemoveAliasOnlyProject` rule. The current rule identifies removes such a project and rewrites the project's attributes in the **entire** tree. This causes problems when parts of the tree are duplicated (for instance a self join on a temporary view/CTE) and the duplicated part contains the alias only project, in this case the rewrite will break the tree. This PR fixes these problems by using a blacklist for attributes that are not to be moved, and by making sure that attribute remapping is only done for the parent tree, and not for unrelated parts of the query plan. The current tree transformation infrastructure works very well if the transformation at hand requires little or a global contextual information. In this case we need to know both the attributes that were not to be moved, and we also needed to know which child attributes were modified. This cannot be done easily using the current infrastructure, and solutions typically involves transversing the query plan multiple times (which is super slow). I have moved around some code in `TreeNode`, `QueryPlan` and `LogicalPlan`to make this much more straightforward; this basically allows you to manually traverse the tree. This PR subsumes the following PRs by windpiger: Closes apache#16267 Closes apache#16255 ## How was this patch tested? I have added unit tests to `RemoveRedundantAliasAndProjectSuite` and I have added integration tests to the `SQLQueryTestSuite.union` and `SQLQueryTestSuite.cte` test cases. Author: Herman van Hovell <[email protected]> Closes apache#16757 from hvanhovell/SPARK-18609.

[SPARK-18609][SQL][WIP]Fix when CTE with Join between two table with …

a85db5e

…same column name

windpiger changed the title ~~[SPARK-18609][SQL][WIP]Fix when CTE with Join between two table with same column~~ [SPARK-18609][SQL]Fix when CTE with Join between two table with same column Dec 12, 2016

windpiger changed the title ~~[SPARK-18609][SQL]Fix when CTE with Join between two table with same column~~ [SPARK-18609][SQL]Fix when CTE with Join between two table with same column name Dec 12, 2016

fix when alias's child is not a NamedExpression

f8d602a

hvanhovell reviewed Dec 13, 2016

View reviewed changes

remove a relation condition in RemoveAliasOnlyProject

0413f9d

cloud-fan reviewed Dec 14, 2016

View reviewed changes

cloud-fan reviewed Jan 3, 2017

View reviewed changes

hvanhovell mentioned this pull request Feb 1, 2017

[SPARK-18609][SPARK-18841][SQL] Fix redundant Alias removal in the optimizer #16757

Closed

asfgit closed this in 73ee739 Feb 7, 2017

[SPARK-18609][SQL]Fix when CTE with Join between two table with same column name #16255

[SPARK-18609][SQL]Fix when CTE with Join between two table with same column name #16255

Uh oh!

Conversation

windpiger commented Dec 12, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Dec 12, 2016

Uh oh!

hvanhovell commented Dec 12, 2016

Uh oh!

windpiger commented Dec 13, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Dec 13, 2016

Uh oh!

windpiger commented Dec 13, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 14, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

windpiger Dec 14, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

windpiger Jan 13, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

windpiger commented Dec 13, 2016 •

edited

Loading

windpiger Dec 14, 2016 •

edited

Loading

windpiger Jan 13, 2017 •

edited

Loading