Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ import org.apache.spark.api.java.function.FilterFunction
import org.apache.spark.sql.AnalysisException
import org.apache.spark.sql.catalyst.{CatalystConf, SimpleCatalystConf}
import org.apache.spark.sql.catalyst.analysis._
import org.apache.spark.sql.catalyst.catalog.{InMemoryCatalog, SessionCatalog}
import org.apache.spark.sql.catalyst.catalog.{CatalogRelation, InMemoryCatalog, SessionCatalog}
import org.apache.spark.sql.catalyst.expressions._
import org.apache.spark.sql.catalyst.expressions.aggregate._
import org.apache.spark.sql.catalyst.expressions.Literal.{FalseLiteral, TrueLiteral}
Expand Down Expand Up @@ -200,6 +200,8 @@ object RemoveAliasOnlyProject extends Rule[LogicalPlan] {
case plan: Project if plan eq proj => plan.child
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is: this rule assumes that, if we find an alias-only project, e.g. alias a#1 to a#2, it's safe to remove this project and replace all a#2 with a#1 in this plan. However, this is not true for complex cases like https://github.com/apache/spark/pull/16255/files#diff-1ea02a6fab84e938582f7f87cc4d9ea1R2023 .

Let's see if there is a way to fix this problem entirely.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a naive way to do this is, make sure we only replace attributes in the ancestor nodes of the alias-only project:

plan transform {
  case plan: Project if plan eq proj => plan.child
  case plan if plan.collect { case p if p eq project }.nonEmpty => // do the replace
}

It's very inefficient, maybe we can improve TreeNode to maintain the parent-child relationship between nodes.

Copy link
Contributor Author

@windpiger windpiger Dec 14, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is safe to only replace attributes in the ancestor nodes.
Alias with the same exprId but not the same object, replace the alias with it's child. it is not safe ,right?
Project [col#9 AS col#6] -> Project [col#9]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

case plan => plan transformExpressions {
case a: Attribute if attrMap.contains(a) => attrMap(a)
case b: Alias if attrMap.exists(_._1.exprId == b.exprId)
&& b.child.isInstanceOf[NamedExpression] => b.child
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you reason about this? why we treat Alias differently here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you said, if we find an alias-only project, e.g. alias a#1 to a#2, it's safe to remove this project and replace all a#2 with a#1 in this plan. So another Alias which is also alias a#1 to a#2, but not the same object with the first one, it will not be processed.

here, the logic shows that we process the situation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get it, for alias a#1 to a#2, we wanna replace all a#2 with a#1, so we will do nothing for alias a#1 to a#2, because we can't find an attribute a#2

Copy link
Contributor Author

@windpiger windpiger Jan 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RemoveAliasOnlyProject will remove alias a#1 to a#2, and replace all a#2 with a#1, so there is no a#2 exists, If we do nothing for alias a#1 to a#2(not the same object with the removed one), it will cause the exception situation from step 5 to step 6 showed on the above comment.
@cloud-fan

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know how the failure happens and this can fix it, but I think it's too hacky and does not catch the root cause. https://github.com/apache/spark/pull/16255/files#r92348878 easily explains why the failure happens and how to fix it, can you make other people understand your fix easily?

}
}
}.getOrElse(plan)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2011,6 +2011,22 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton {
}
}

test("test CTE with join between two table with the same column name ") {
sql("DROP TABLE IF EXISTS p1")
sql("DROP TABLE IF EXISTS p2")
sql("CREATE TABLE p1 (col String)" )
sql("CREATE TABLE p2 (col String)")

assert(
sql(
"""
| WITH CTE AS
| (SELECT s2.col as col FROM p1
| CROSS JOIN (SELECT e.col as col FROM p2 E) s2)
| SELECT T1.col as c1,T2.col as c2 FROM CTE T1 CROSS JOIN CTE T2
""".stripMargin).collect.isEmpty)
}

def testCommandAvailable(command: String): Boolean = {
val attempt = Try(Process(command).run(ProcessLogger(_ => ())).exitValue())
attempt.isSuccess && attempt.get == 0
Expand Down