Skip to content

Conversation

@dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Jul 29, 2016

What changes were proposed in this pull request?

This PR changes the CTE resolving rule to use only forward-declared tables in order to prevent infinite loops. More specifically, new logic is like the following.

  • Resolve CTEs in WITH clauses first before replacing the main SQL body.
  • When resolving CTEs, only forward-declared CTEs or base tables are referenced.
    • Self-referencing is not allowed any more.
    • Cross-referencing is not allowed any more.

Reported Error Scenarios

scala> sql("WITH t AS (SELECT 1 FROM t) SELECT * FROM t")
java.lang.StackOverflowError
...
scala> sql("WITH t1 AS (SELECT * FROM t2), t2 AS (SELECT 2 FROM t1) SELECT * FROM t1, t2")
java.lang.StackOverflowError
...

Note that t, t1, and t2 are not declared in database. Spark falls into infinite loops before resolving table names.

How was this patch tested?

Pass the Jenkins tests with new two testcases.

@SparkQA
Copy link

SparkQA commented Jul 29, 2016

Test build #62995 has finished for PR 14397 at commit 5bca528.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Hi, @rxin .
Could you review this PR?

@hvanhovell
Copy link
Contributor

@dongjoon-hyun I think this has merit. I do have one question, what do other databases do? Like postgresql, mysql, sqlserver and others?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This integrates some of the functionality of the ResolveRelations rule, but not all (for instance file based datasources). Shouldn't we be consistent and (perhaps) integrate these rules?

@dongjoon-hyun
Copy link
Member Author

Thank you for review, @hvanhovell .

For the recursive CTE queries, traditional DBMS supports optional RECURSIVE keyword which computing closures, i.e., it iterates the subquery until the result converge. For example, PostgreSQL, MSSQL, SQLite supports WITH RECURSIVE syntax officially.

For the overlap of ResolveRelations rule, this rule calls lookupRelation to check the name conflict between the base relations and CTEs. As a side effect, it resolves the relations, too. Here is the reasons.

  • To check the conflict, this is the only simplest way.
  • For the performance, this patch only call one lookupRelation for UnresolvedRelation.
    • ResolveRelations rule do not try resolve again and will not complain about that.
  • The remaining general and exceptional rules like file-based datasource should be in ResolveRelations. We can add more datasources there, not here.

For Hive, the recursive queries and RECURSIVE syntax are not supported.

@hvanhovell
Copy link
Contributor

@dongjoon-hyun

New behavior versus existing systems

I was not talking about recursive CTE's (which can be very useful in some cases). We are changing the behavior of the Analyzer, and this might surprise users. I would like to know if there is a common approach to this among other systems; so we can justify the change in behavior.

Resolve Relations

The change you propose is almost subsuming the ResolveRelations rule. I think it is correct to to do this. However I am not a big fan of duplicating the behavior in two rules. Do you think there is way to get around this? Could we for instance centralize the logic in ResolveRelations?

@dongjoon-hyun
Copy link
Member Author

Oh, sorry for misunderstanding, @hvanhovell . I think we can integrate ResolveRelation and this one. I'll be back after some testing in this evening. Also for the common approach, I will bring some references, too.

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Aug 2, 2016

Hi, @hvanhovell . It seems not clearly documented, so I did some comparisons.

First of all, in the main SQL body, CTE query names are used first. However, the table resolution in the main SQL body is also dependent on the CTE subqueries. The general Analyzer approach seems to

  • use forward-declared(not self) base table or CTE name
    • Especially, in WITH clauses, it depends on CTE orders.
  • prevent the execution of cyclic queries by raising exceptions.
    • Hive/Oracle raise exceptions. PostgreSQL uses the base tables if they exist.

The root cause of previous Spark problems is using self or backward-declared CTE for table name resolutions. I will try to revise this PR according to the above rules. Please comment about the above rule.

@dongjoon-hyun
Copy link
Member Author

The above approach also can remove the duplicated scope issue between CTESubstitution and ResolveRelation.

@dongjoon-hyun
Copy link
Member Author

Hi, @hvanhovell .
Sorry for late update. I updated the PR description and code.
Could you review this PR again?

@SparkQA
Copy link

SparkQA commented Aug 8, 2016

Test build #63371 has finished for PR 14397 at commit b10ee10.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class With(child: LogicalPlan, cteRelations: Seq[(String, SubqueryAlias)]) extends UnaryNode

@dongjoon-hyun
Copy link
Member Author

Could you review this PR again, @hvanhovell ?
Last time, I did a huge mistake, but now I think I fixed correct according to your advice.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, the CTEs of WITH clause are resolved sequentially before applying to main SQL body.

@hvanhovell
Copy link
Contributor

@dongjoon-hyun I'll take a look in the morning (CET)

@dongjoon-hyun
Copy link
Member Author

Thank you so much! Then, see you later. :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

relations is a map. The iteration sequence of a map does not need the be the ordered in which elements were added (Use a map larger than 4 to see this in action, e.g.: Seq.tabulate(5)(i => ('a' + i).toChar.toString -> i).toMap).

So I think we need to change the data type in With to accommodate this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I changed it to Seq. Now, relations of With is a sequence.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind you have already done that :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not replace With in all cases. Could you remove the relations.nonEmpty guard?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If WITH without relations, I thought we can skip all since there is nothing to be replaced.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah you are right about that. But the current code does not remove the With node if there are no relations.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see!

@hvanhovell
Copy link
Contributor

@dongjoon-hyun this is looking very promising. I left two small comments.

@dongjoon-hyun
Copy link
Member Author

Thank you, @hvanhovell !

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved :+ r._1 -> ResolveRelations(substituteCTE(r._2, resolved))?

You could also deconstruct the r tuple for easier reading.

@hvanhovell
Copy link
Contributor

LGTM pending Jenkins.

@SparkQA
Copy link

SparkQA commented Aug 10, 2016

Test build #63526 has finished for PR 14397 at commit d896afc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 10, 2016

Test build #63529 has finished for PR 14397 at commit f3a2cd4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Rebased just to resolve conflicts.

@SparkQA
Copy link

SparkQA commented Aug 11, 2016

Test build #63577 has finished for PR 14397 at commit 624bb3d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 11, 2016

Test build #63578 has finished for PR 14397 at commit 178813e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you create a test in SQLQueryTestSuite instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I'll move this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ur, @rxin .
SQLQueryTestSuite seems not to support exceptions cases yet.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this PR, exceptions are important and should be checked.

@dongjoon-hyun
Copy link
Member Author

Oh, now Exception is supported at 0db373a. Thanks, @rxin . I will try again.

@dongjoon-hyun
Copy link
Member Author

Hi, @rxin .
Now the testcases are moved into sql-tests with SQLQueryTestSuite.

@SparkQA
Copy link

SparkQA commented Aug 11, 2016

Test build #63595 has finished for PR 14397 at commit 7f74ec7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Hi, @rxin . I moved the testcase into new suite.
Please let me know if there is more thing to do.
New SQL test suite is great!

@@ -0,0 +1,14 @@
create temporary view t as select * from values 0, 1, 2 as t(id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe CTE instead of "with" ?

Copy link
Member Author

@dongjoon-hyun dongjoon-hyun Aug 11, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, t and t2 is base tables.
The followings are used CTEs and to check if base table or previous CTE is used correctly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. You mean with.sql into CTE.sql.
I see. No problem.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will use cte.sql instead of with.sql.

@dongjoon-hyun
Copy link
Member Author

Thank you, @rxin . It's updated.

@SparkQA
Copy link

SparkQA commented Aug 11, 2016

Test build #63637 has finished for PR 14397 at commit f1f4a83.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Hi, @hvanhovell .
Could you review this again? After the last time you've seen, only the testcase is moved into a new testsuite.

@hvanhovell
Copy link
Contributor

LGTM (again) - merging to master. Thanks!

@dongjoon-hyun
Copy link
Member Author

Thank you, @hvanhovell !

@asfgit asfgit closed this in 2a10513 Aug 12, 2016
@dongjoon-hyun dongjoon-hyun deleted the SPARK-16771-TREENODE branch January 17, 2018 09:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants