[SPARK-36447][SQL] Avoid inlining non-deterministic With-CTEs #33671

maryannxue · 2021-08-06T18:28:32Z

What changes were proposed in this pull request?

This PR fixes an existing correctness issue where a non-deterministic With-CTE can be executed multiple times producing different results, by deferring the inline of With-CTE to after the analysis stage. This fix also provides the future opportunity of performance improvement by executing deterministic With-CTEs only once in some circumstances.

The major changes include:

Added new With-CTE logical nodes: CTERelationDef, CTERelationRef, WithCTE. Each CTERelationDef has a unique ID and the mapping between CTE def and CTE ref is based on IDs rather than names. WithCTE is a resolved version of With, only that: 1) WithCTE is a multi-children logical node so that most logical rules can automatically apply to CTE defs; 2) In the main query and each subquery, there can only be at most one WithCTE, which means nested With-CTEs are combined.
Changed CTESubstitution rule so that if NOT in legacy mode, CTE defs will not be inlined immediately, but rather transformed into a CTERelationRef per reference.
Added new With-CTE rules: 1) ResolveWithCTE - to update CTERelationRefs with resolved output from corresponding CTERelationDefs; 2) InlineCTE - to inline deterministic CTEs or non-deterministic CTEs with only ONE reference; 3) UpdateCTERelationStats - to update stats for CTERelationRefs that are not inlined.
Added a CTE physical planning strategy to plan CTERelationRefs as an independent shuffle with round-robin partitioning so that such CTEs will only be materialized once and different references will later be a shuffle reuse.

A current limitation is that With-CTEs mixed with SQL commands or DMLs will still go through the old inline code path because of our non-standard language specs and not-unified command/DML interfaces.

Why are the changes needed?

This is a correctness issue. Non-deterministic CTEs should produce the same output regardless of how many times it is referenced/used in query, while under the current implementation there is no such guarantee and would lead to incorrect query results.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added UTs.
Regenerated golden files for TPCDS plan stability tests. There is NO change to the simplified.txt files, the only differences are expression IDs.

SparkQA · 2021-08-06T19:11:23Z

Test build #142172 has finished for PR 33671 at commit 4f24104.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class UnresolvedWith(
case class CTERelationDef(child: LogicalPlan, id: Long = CTERelationDef.newId) extends UnaryNode
case class CTERelationRef(
case class WithCTE(plan: LogicalPlan, cteDefs: Seq[CTERelationDef]) extends LogicalPlan

SparkQA · 2021-08-06T19:13:13Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46684/

maryannxue · 2021-08-06T19:21:00Z

cc @peter-toth @cloud-fan @sigmod

SparkQA · 2021-08-06T20:02:14Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46687/

SparkQA · 2021-08-06T20:39:27Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46687/

SparkQA · 2021-08-06T20:55:19Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46688/

SparkQA · 2021-08-06T21:31:52Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46688/

SparkQA · 2021-08-06T22:30:44Z

Test build #142175 has finished for PR 33671 at commit 4d08307.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-08-06T23:19:54Z

Test build #142176 has finished for PR 33671 at commit 7ab352a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

peter-toth · 2021-08-08T15:52:33Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CTESubstitution.scala

Maybe we could use substituted.resolveOperatorsWithPruning(_ => !done) { to break out?

peter-toth · 2021-08-08T16:01:44Z

Thanks @maryannxue for pinging me. Unfortunately, I can take a closer look at this PR only the week after next...

maryannxue · 2021-08-09T15:21:31Z

Never mind, @peter-toth! Hopefully the added logical nodes make sense to you. We can always improve the implementation later on.

SparkQA · 2021-08-09T15:21:56Z

Test build #142229 has started for PR 33671 at commit dd4fc28.

SparkQA · 2021-08-09T16:22:45Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46742/

SparkQA · 2021-08-09T17:20:33Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46742/

SparkQA · 2021-08-10T16:43:45Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46786/

SparkQA · 2021-08-10T16:56:32Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46785/

SparkQA · 2021-08-10T17:54:17Z

Test build #142277 has finished for PR 33671 at commit 2c30d4a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-08-10T18:05:06Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46790/

SparkQA · 2021-08-10T18:48:52Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46790/

SparkQA · 2021-08-10T20:30:40Z

Test build #142278 has finished for PR 33671 at commit 7443fb9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sigmod

LGTM.

jaceklaskowski · 2021-08-11T14:42:24Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CTESubstitution.scala

-        legacyTraverseAndSubstituteCTE(plan)
-      case LegacyBehaviorPolicy.CORRECTED =>
-        traverseAndSubstituteCTE(plan)
+    val isCommand = plan.find {


plan.collectFirst would be quicker and less memory-hungry, wouldn't it?

Maybe I'm missing sth, but I don't see how so.

cloud-fan · 2021-08-11T17:08:17Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveWithCTE.scala

+
+      case other =>
+        other.transformExpressionsWithPruning(_.containsAllPatterns(PLAN_EXPRESSION, CTE)) {
+          case e: SubqueryExpression => e.withNewPlan(resolveWithCTE(e.plan, cteDefMap))


If the main query has more than one subqueries, when resolving the second subquery, the cteDefMap will contain CTE defs from the first subquery. I think we should clone the map here?

It's not necessary, right? The real resolving has happened in CTESubstitution earlier. Now there's a strict 1 to 1 ID mapping, so the map can only contain some unrelated CTE defs at its worst.

oh I see, makes sense

cloud-fan · 2021-08-11T17:16:12Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InlineCTE.scala

+          .transformExpressionsWithPruning(_.containsAllPatterns(PLAN_EXPRESSION, CTE)) {
+            case e: SubqueryExpression =>
+              val forceInline =
+                e.plan.find(_.expressions.exists(_.isInstanceOf[OuterReference])).nonEmpty


There is an easier way to check correlated subquery: e.outerAttrs.nonEmpty

cloud-fan · 2021-08-11T17:17:46Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala

+ * @param statsOpt  The optional statistics inferred from the corresponding CTE definition.
+ */
+case class CTERelationRef(
+  cteId: Long,


nit: 4 spaces indentation

cloud-fan · 2021-08-11T17:22:52Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala

        val newLogicalPlan = logicalPlan.transformDown {
          case p if p.eq(logicalNode) => newLogicalNode
        }
-        assert(newLogicalPlan != logicalPlan,


why this assert is removed?

Because stages in CTE defs with multiple references can be "replaced" more than once, but those are just the reuses of the same exchange, so we can ignore them.

cloud-fan · 2021-08-11T17:24:55Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

@@ -4215,6 +4215,255 @@ class SQLQuerySuite extends QueryTest with SharedSparkSession with AdaptiveSpark
      }
    }
  }
+
+  test("SPARK-36447: non-deterministic CTE dedup") {


Shall we create a new CTEInlineSuite?

and run it twice with AQE on and off

hvanhovell · 2021-08-11T19:03:31Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala

+        // an Exchange reuse at runtime.
+        // TODO create a new identity partitioning instead of using RoundRobinPartitioning.
+        exchange.ShuffleExchangeExec(
+          RoundRobinPartitioning(conf.numShufflePartitions),


RoundRobin sorts data before shuffling right? That will slow things a lot.

Yes, that's why we put a TODO there.

SparkQA · 2021-08-11T22:10:55Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46858/

SparkQA · 2021-08-11T23:05:05Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46858/

SparkQA · 2021-08-12T02:01:05Z

Test build #142349 has finished for PR 33671 at commit 09c1c27.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2021-08-12T07:35:00Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala

 abstract class SubqueryExpression(
    plan: LogicalPlan,
-    outerAttrs: Seq[Expression],
+    val outerAttrs: Seq[Expression],


nit: this needs to touch all the subclasses, which is a bit messy. How about we just add a new method in this class?

def isCorrelated: Boolean = outerAttrs.nonEmpty

cloud-fan

LGTM except one minor comment

SparkQA · 2021-08-12T15:22:30Z

Test build #142386 has finished for PR 33671 at commit 3746d76.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-08-12T16:12:59Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46894/

SparkQA · 2021-08-12T16:52:01Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46894/

SparkQA · 2021-08-12T18:20:30Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46899/

SparkQA · 2021-08-12T18:59:12Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46899/

SparkQA · 2021-08-12T22:29:03Z

Test build #142393 has finished for PR 33671 at commit 4cc52f7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2021-08-13T03:36:59Z

thanks, merging to master/3.2 (since it's a correctness fix)!

This PR fixes an existing correctness issue where a non-deterministic With-CTE can be executed multiple times producing different results, by deferring the inline of With-CTE to after the analysis stage. This fix also provides the future opportunity of performance improvement by executing deterministic With-CTEs only once in some circumstances. The major changes include: 1. Added new With-CTE logical nodes: `CTERelationDef`, `CTERelationRef`, `WithCTE`. Each `CTERelationDef` has a unique ID and the mapping between CTE def and CTE ref is based on IDs rather than names. `WithCTE` is a resolved version of `With`, only that: 1) `WithCTE` is a multi-children logical node so that most logical rules can automatically apply to CTE defs; 2) In the main query and each subquery, there can only be at most one `WithCTE`, which means nested With-CTEs are combined. 2. Changed `CTESubstitution` rule so that if NOT in legacy mode, CTE defs will not be inlined immediately, but rather transformed into a `CTERelationRef` per reference. 3. Added new With-CTE rules: 1) `ResolveWithCTE` - to update `CTERelationRef`s with resolved output from corresponding `CTERelationDef`s; 2) `InlineCTE` - to inline deterministic CTEs or non-deterministic CTEs with only ONE reference; 3) `UpdateCTERelationStats` - to update stats for `CTERelationRef`s that are not inlined. 4. Added a CTE physical planning strategy to plan `CTERelationRef`s as an independent shuffle with round-robin partitioning so that such CTEs will only be materialized once and different references will later be a shuffle reuse. A current limitation is that With-CTEs mixed with SQL commands or DMLs will still go through the old inline code path because of our non-standard language specs and not-unified command/DML interfaces. This is a correctness issue. Non-deterministic CTEs should produce the same output regardless of how many times it is referenced/used in query, while under the current implementation there is no such guarantee and would lead to incorrect query results. No. Added UTs. Regenerated golden files for TPCDS plan stability tests. There is NO change to the `simplified.txt` files, the only differences are expression IDs. Closes #33671 from maryannxue/spark-36447. Authored-by: Maryann Xue <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 29b1e39) Signed-off-by: Wenchen Fan <[email protected]>

peter-toth · 2021-08-17T14:09:24Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CTESubstitution.scala

   *     SELECT * FROM t
   *   )
   * @param plan the plan to be traversed
   * @return the plan where CTE substitution is applied


As the return value of traverseAndSubstituteCTE changed a bit, this docs could use some update.

peter-toth · 2021-08-17T14:10:42Z

Late LGTM. Just a very minor comment.

zinking · 2021-12-16T12:09:36Z

@maryannxue I find after this change, the eventually expanded query now contains duplicated expression Ids.

is this expected? I am thinking of the assumption expression id would be unique globally.

### What changes were proposed in this pull request? This PR adds recursive query feature to Spark SQL. A recursive query is defined using the `WITH RECURSIVE` keywords and referring the name of the common table expression within the query. The implementation complies with SQL standard and follows similar rules to other relational databases: - A query is made of an anchor followed by a recursive term. - The anchor terms doesn't contain self reference and it is used to initialize the query. - The recursive term contains a self reference and it is used to expand the current set of rows with new ones. - The anchor and recursive terms must be joined with each other by `UNION` or `UNION ALL` operators. - New rows can only be derived from the newly added rows of the previous iteration (or from the initial set of rows of anchor term). This limitation implies that recursive references can't be used with some of the joins, aggregations or subqueries. Please see `cte-recursive.sql` for some examples. The implemetation has the same limiation that [SPARK-36447](https://issues.apache.org/jira/browse/SPARK-36447) / apache#33671 has: > With-CTEs mixed with SQL commands or DMLs will still go through the old inline code path because of our non-standard language specs and not-unified command/DML interfaces. which means that recursive queries are not supported in SQL commands and DMLs. With apache#42036 this restriction is lifted and a recursive CTE only doesn't work when the CTE is force inlined (`spark.sql.legacy.inlineCTEInCommands=true` or the command is a multi-insert statement). ### Why are the changes needed? Recursive query is an ANSI SQL feature that is useful to process hierarchical data. ### Does this PR introduce _any_ user-facing change? Yes, adds recursive query feature. ### How was this patch tested? Added new UTs and tests in `cte-recursion.sql`.

github-actions bot added the SQL label Aug 6, 2021

peter-toth reviewed Aug 8, 2021

View reviewed changes

maryannxue added 4 commits August 10, 2021 10:00

fix

6485490

fix

006621b

fix

f46f1f2

regenerate golden files

2c30d4a

maryannxue force-pushed the spark-36447 branch from dd4fc28 to 2c30d4a Compare August 10, 2021 15:01

maryannxue added 2 commits August 10, 2021 10:13

retrigger tests

7443fb9

fix sbt compilation error

92262a2

sigmod approved these changes Aug 10, 2021

View reviewed changes

jaceklaskowski reviewed Aug 11, 2021

View reviewed changes

cloud-fan reviewed Aug 11, 2021

View reviewed changes

hvanhovell reviewed Aug 11, 2021

View reviewed changes

address review comments

09c1c27

cloud-fan reviewed Aug 12, 2021

View reviewed changes

cloud-fan approved these changes Aug 12, 2021

View reviewed changes

address more review comments + a little refactoring

3746d76

Merge remote-tracking branch 'upstream/master' into spark-36447

4cc52f7

cloud-fan closed this in 29b1e39 Aug 13, 2021

peter-toth reviewed Aug 17, 2021

View reviewed changes

peter-toth mentioned this pull request Sep 16, 2021

[SPARK-34079][SQL] Merge non-correlated scalar subqueries #32298

Closed

peter-toth mentioned this pull request Apr 26, 2023

[SPARK-24497][SQL] Support recursive SQL #40744

Closed

[SPARK-36447][SQL] Avoid inlining non-deterministic With-CTEs #33671

[SPARK-36447][SQL] Avoid inlining non-deterministic With-CTEs #33671

Uh oh!

Conversation

maryannxue commented Aug 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Aug 6, 2021

Uh oh!

SparkQA commented Aug 6, 2021

Uh oh!

maryannxue commented Aug 6, 2021

Uh oh!

SparkQA commented Aug 6, 2021

Uh oh!

SparkQA commented Aug 6, 2021

Uh oh!

SparkQA commented Aug 6, 2021

Uh oh!

SparkQA commented Aug 6, 2021

Uh oh!

SparkQA commented Aug 6, 2021

Uh oh!

SparkQA commented Aug 6, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

peter-toth commented Aug 8, 2021

Uh oh!

maryannxue commented Aug 9, 2021

Uh oh!

SparkQA commented Aug 9, 2021

Uh oh!

SparkQA commented Aug 9, 2021

Uh oh!

SparkQA commented Aug 9, 2021

Uh oh!

SparkQA commented Aug 10, 2021

Uh oh!

SparkQA commented Aug 10, 2021

Uh oh!

SparkQA commented Aug 10, 2021

Uh oh!

SparkQA commented Aug 10, 2021

Uh oh!

SparkQA commented Aug 10, 2021

Uh oh!

SparkQA commented Aug 10, 2021

Uh oh!

sigmod left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maryannxue commented Aug 6, 2021 •

edited

Loading

maryannxue Aug 11, 2021 •

edited

Loading