[SPARK-43199][SQL] Make InlineCTE idempotent #40856

peter-toth · 2023-04-19T16:55:45Z

What changes were proposed in this pull request?

This PR fixes InlineCTE's idempotence. E.g. the following query:

WITH
  x(r) AS (SELECT random()),
  y(r) AS (SELECT * FROM x),
  z(r) AS (SELECT * FROM x)
SELECT * FROM z

currently breaks it because we take into account the reference to x from y when deciding about not inlining x in the first round:

=== Applying Rule org.apache.spark.sql.catalyst.optimizer.InlineCTE ===
 WithCTE                                                        WithCTE
 :- CTERelationDef 0, false                                     :- CTERelationDef 0, false
 :  +- Project [rand()#218 AS r#219]                            :  +- Project [rand()#218 AS r#219]
 :     +- Project [random(2957388522017368375) AS rand()#218]   :     +- Project [random(2957388522017368375) AS rand()#218]
 :        +- OneRowRelation                                     :        +- OneRowRelation
!:- CTERelationDef 1, false                                     +- Project [r#222]
!:  +- Project [r#219 AS r#221]                                    +- Project [r#220 AS r#222]
!:     +- Project [r#219]                                             +- Project [r#220]
!:        +- CTERelationRef 0, true, [r#219]                             +- CTERelationRef 0, true, [r#220]
!:- CTERelationDef 2, false                                     
!:  +- Project [r#220 AS r#222]                                 
!:     +- Project [r#220]                                       
!:        +- CTERelationRef 0, true, [r#220]                    
!+- Project [r#222]                                             
!   +- CTERelationRef 2, true, [r#222]

But in the next round we inline x because y was removed due to lack of references:

Once strategy's idempotence is broken for batch Inline CTE
!WithCTE                                                        Project [r#222]
!:- CTERelationDef 0, false                                     +- Project [r#220 AS r#222]
!:  +- Project [rand()#218 AS r#219]                               +- Project [r#220]
!:     +- Project [random(2957388522017368375) AS rand()#218]         +- Project [r#225 AS r#220]
!:        +- OneRowRelation                                              +- Project [rand()#218 AS r#225]
!+- Project [r#222]                                                         +- Project [random(2957388522017368375) AS rand()#218]
!   +- Project [r#220 AS r#222]                                                +- OneRowRelation
!      +- Project [r#220]                                       
!         +- CTERelationRef 0, true, [r#220]

Why are the changes needed?

We use InlineCTE as an idempotent rule in the Optimizer, CheckAnalysis and ProgressReporter.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added new UT.

peter-toth · 2023-04-20T07:14:23Z

cc @cloud-fan, @maryannxue

cloud-fan · 2023-04-20T14:01:35Z

@peter-toth can you briefly explain the idea of fixing it?

…her CTEs from a CTE (the previous version stored the incoming references and counts to a CTE)

peter-toth · 2023-04-20T16:58:47Z

@peter-toth can you briefly explain the idea of fixing it?

I've updated the PR recently, but the main change is that the CTE accumulator map argument of buildCTEMap() changed from
mutable.HashMap.empty[Long, (CTERelationDef, Int)]
to
mutable.SortedMap.empty[Long, (CTERelationDef, Int, mutable.Map[Long, Int])].
The new mutable.Map[Long, Int] part tracks where the references are pointing to from a CTE. (The old Int part tracks the "count of incoming references".)

Once we have this extended outer map we can correct the "count of incoming references" in cleanCTEMap(). We just need to iterate the CTEs in reverse order (that's why the outer map is now a SortedMap) and if we encounter a CTE whose "count of incoming references" is 0 then we decrease the referenced CTE's "count of incoming references".

To build the new inner map buildCTEMap() has a new outerCTEId optional argument.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InlineCTE.scala

cloud-fan · 2023-04-21T09:07:26Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InlineCTE.scala

-      }
+        if (plan.containsPattern(CTE)) {
+          plan.children.foreach { child =>
+            buildCTEMap(child, cteMap, outerCTEId)


is this duplicated? If plan is WithCTE, we should have already invoked buildCTEMap for CTE relations in https://github.com/apache/spark/pull/40856/files#diff-1c15413e5d63f78fff1db3dec9df4a671e78b76d086104d81f4a967eb2800805R82

nvm, this is the case _ branch.

cloud-fan · 2023-04-21T09:21:05Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InlineCTE.scala

+      cteRefMap: mutable.SortedMap[Long, (CTERelationDef, Int, mutable.Map[Long, Int])]
+    ) = {
+    cteRefMap.keys.toSeq.reverse.foreach { currentCTEId =>
+      val (_, currentRefCount, refMap) = cteRefMap(currentCTEId)


so the idea is, if a CTE relation A is referenced by another CTE relation B, and relation B has no references, we should update the relation A reference count to exclude the references from relation B? Can we add comments to write it down?

Yes. Added scaladoc in ecbc4b8, let me know if it needs more detailed comments.

cloud-fan · 2023-04-21T09:31:53Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InlineCTE.scala

+   *               ids. The value of the map is tuple whose elements are:
+   *               - The CTE definition
+   *               - The number of incoming references to the CTE. This includes references from
+   *                 outer CTEs and regular places.


Suggested change

* outer CTEs and regular places.

* inner CTEs and regular places.

I actually wanted to write other CTEs and not inner/outer. E.g. in

WITH ( cte1 AS (SELECT 1), cte2 AS (SELECT * FROM cte1) ) SELECT * FROM cte1 JOIN cte2 ON ...

the reference count of cte1 is 2. 1 is from an "other" CTE (but not "inner"/"outer") (and 1 is from a "regular place").

fixed in 806c2de

@cloud-fan, please let me know if anything else is needed here.

cloud-fan · 2023-04-26T07:32:06Z

thanks, merging to master!

peter-toth · 2023-04-26T07:38:27Z

Thanks @cloud-fan!

### What changes were proposed in this pull request? This PR fixes `InlineCTE`'s idempotence. E.g. the following query: ``` WITH x(r) AS (SELECT random()), y(r) AS (SELECT * FROM x), z(r) AS (SELECT * FROM x) SELECT * FROM z ``` currently breaks it because we take into account the reference to `x` from `y` when deciding about not inlining `x` in the first round: ``` === Applying Rule org.apache.spark.sql.catalyst.optimizer.InlineCTE === WithCTE WithCTE :- CTERelationDef 0, false :- CTERelationDef 0, false : +- Project [rand()apache#218 AS r#219] : +- Project [rand()apache#218 AS r#219] : +- Project [random(2957388522017368375) AS rand()apache#218] : +- Project [random(2957388522017368375) AS rand()apache#218] : +- OneRowRelation : +- OneRowRelation !:- CTERelationDef 1, false +- Project [r#222] !: +- Project [r#219 AS r#221] +- Project [r#220 AS r#222] !: +- Project [r#219] +- Project [r#220] !: +- CTERelationRef 0, true, [r#219] +- CTERelationRef 0, true, [r#220] !:- CTERelationDef 2, false !: +- Project [r#220 AS r#222] !: +- Project [r#220] !: +- CTERelationRef 0, true, [r#220] !+- Project [r#222] ! +- CTERelationRef 2, true, [r#222] ``` But in the next round we inline `x` because `y` was removed due to lack of references: ``` Once strategy's idempotence is broken for batch Inline CTE !WithCTE Project [r#222] !:- CTERelationDef 0, false +- Project [r#220 AS r#222] !: +- Project [rand()apache#218 AS r#219] +- Project [r#220] !: +- Project [random(2957388522017368375) AS rand()apache#218] +- Project [r#225 AS r#220] !: +- OneRowRelation +- Project [rand()apache#218 AS r#225] !+- Project [r#222] +- Project [random(2957388522017368375) AS rand()apache#218] ! +- Project [r#220 AS r#222] +- OneRowRelation ! +- Project [r#220] ! +- CTERelationRef 0, true, [r#220] ``` ### Why are the changes needed? We use `InlineCTE` as an idempotent rule in the `Optimizer`, `CheckAnalysis` and `ProgressReporter`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added new UT. Closes apache#40856 from peter-toth/SPARK-43199-make-inlinecte-idempotent. Authored-by: Peter Toth <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

[SPARK-43199][SQL] Make InlineCTE idempotent

7dce656

github-actions bot added the SQL label Apr 19, 2023

peter-toth mentioned this pull request Apr 20, 2023

[SPARK-24497][SQL] Support recursive SQL #40744

Closed

improved version that stores the outgoing references and counts to ot…

8765bf7

…her CTEs from a CTE (the previous version stored the incoming references and counts to a CTE)

use outerCTEId instead of outerRefMap

fccda08

cloud-fan reviewed Apr 21, 2023

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InlineCTE.scala Show resolved Hide resolved

cloud-fan reviewed Apr 21, 2023

View reviewed changes

add scaladoc, rename param

ecbc4b8

cloud-fan reviewed Apr 21, 2023

View reviewed changes

cloud-fan approved these changes Apr 21, 2023

View reviewed changes

fix

806c2de

cloud-fan approved these changes Apr 26, 2023

View reviewed changes

cloud-fan closed this in 8970415 Apr 26, 2023

	* outer CTEs and regular places.
	* inner CTEs and regular places.

[SPARK-43199][SQL] Make InlineCTE idempotent #40856

[SPARK-43199][SQL] Make InlineCTE idempotent #40856

Uh oh!

Conversation

peter-toth commented Apr 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

peter-toth commented Apr 20, 2023

Uh oh!

cloud-fan commented Apr 20, 2023

Uh oh!

peter-toth commented Apr 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

cloud-fan Apr 21, 2023

Choose a reason for hiding this comment

Uh oh!

cloud-fan Apr 21, 2023

Choose a reason for hiding this comment

Uh oh!

cloud-fan Apr 21, 2023

Choose a reason for hiding this comment

Uh oh!

peter-toth Apr 21, 2023

Choose a reason for hiding this comment

Uh oh!

cloud-fan Apr 21, 2023

Choose a reason for hiding this comment

Uh oh!

peter-toth Apr 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

peter-toth Apr 21, 2023

Choose a reason for hiding this comment

Uh oh!

peter-toth Apr 25, 2023

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Apr 26, 2023

Uh oh!

peter-toth commented Apr 26, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

peter-toth commented Apr 19, 2023 •

edited

Loading

peter-toth commented Apr 20, 2023 •

edited

Loading

peter-toth Apr 21, 2023 •

edited

Loading