Skip to content

Conversation

@peter-toth
Copy link
Contributor

@peter-toth peter-toth commented Apr 19, 2023

What changes were proposed in this pull request?

This PR fixes InlineCTE's idempotence. E.g. the following query:

WITH
  x(r) AS (SELECT random()),
  y(r) AS (SELECT * FROM x),
  z(r) AS (SELECT * FROM x)
SELECT * FROM z

currently breaks it because we take into account the reference to x from y when deciding about not inlining x in the first round:

=== Applying Rule org.apache.spark.sql.catalyst.optimizer.InlineCTE ===
 WithCTE                                                        WithCTE
 :- CTERelationDef 0, false                                     :- CTERelationDef 0, false
 :  +- Project [rand()#218 AS r#219]                            :  +- Project [rand()#218 AS r#219]
 :     +- Project [random(2957388522017368375) AS rand()#218]   :     +- Project [random(2957388522017368375) AS rand()#218]
 :        +- OneRowRelation                                     :        +- OneRowRelation
!:- CTERelationDef 1, false                                     +- Project [r#222]
!:  +- Project [r#219 AS r#221]                                    +- Project [r#220 AS r#222]
!:     +- Project [r#219]                                             +- Project [r#220]
!:        +- CTERelationRef 0, true, [r#219]                             +- CTERelationRef 0, true, [r#220]
!:- CTERelationDef 2, false                                     
!:  +- Project [r#220 AS r#222]                                 
!:     +- Project [r#220]                                       
!:        +- CTERelationRef 0, true, [r#220]                    
!+- Project [r#222]                                             
!   +- CTERelationRef 2, true, [r#222]    

But in the next round we inline x because y was removed due to lack of references:

Once strategy's idempotence is broken for batch Inline CTE
!WithCTE                                                        Project [r#222]
!:- CTERelationDef 0, false                                     +- Project [r#220 AS r#222]
!:  +- Project [rand()#218 AS r#219]                               +- Project [r#220]
!:     +- Project [random(2957388522017368375) AS rand()#218]         +- Project [r#225 AS r#220]
!:        +- OneRowRelation                                              +- Project [rand()#218 AS r#225]
!+- Project [r#222]                                                         +- Project [random(2957388522017368375) AS rand()#218]
!   +- Project [r#220 AS r#222]                                                +- OneRowRelation
!      +- Project [r#220]                                       
!         +- CTERelationRef 0, true, [r#220]    

Why are the changes needed?

We use InlineCTE as an idempotent rule in the Optimizer, CheckAnalysis and ProgressReporter.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added new UT.

@github-actions github-actions bot added the SQL label Apr 19, 2023
@peter-toth
Copy link
Contributor Author

cc @cloud-fan, @maryannxue

@cloud-fan
Copy link
Contributor

@peter-toth can you briefly explain the idea of fixing it?

…her CTEs from a CTE (the previous version stored the incoming references and counts to a CTE)
@peter-toth
Copy link
Contributor Author

peter-toth commented Apr 20, 2023

@peter-toth can you briefly explain the idea of fixing it?

I've updated the PR recently, but the main change is that the CTE accumulator map argument of buildCTEMap() changed from
mutable.HashMap.empty[Long, (CTERelationDef, Int)]
to
mutable.SortedMap.empty[Long, (CTERelationDef, Int, mutable.Map[Long, Int])].
The new mutable.Map[Long, Int] part tracks where the references are pointing to from a CTE. (The old Int part tracks the "count of incoming references".)

Once we have this extended outer map we can correct the "count of incoming references" in cleanCTEMap(). We just need to iterate the CTEs in reverse order (that's why the outer map is now a SortedMap) and if we encounter a CTE whose "count of incoming references" is 0 then we decrease the referenced CTE's "count of incoming references".

To build the new inner map buildCTEMap() has a new outerCTEId optional argument.

}
if (plan.containsPattern(CTE)) {
plan.children.foreach { child =>
buildCTEMap(child, cteMap, outerCTEId)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this duplicated? If plan is WithCTE, we should have already invoked buildCTEMap for CTE relations in https://github.com/apache/spark/pull/40856/files#diff-1c15413e5d63f78fff1db3dec9df4a671e78b76d086104d81f4a967eb2800805R82

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm, this is the case _ branch.

cteRefMap: mutable.SortedMap[Long, (CTERelationDef, Int, mutable.Map[Long, Int])]
) = {
cteRefMap.keys.toSeq.reverse.foreach { currentCTEId =>
val (_, currentRefCount, refMap) = cteRefMap(currentCTEId)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the idea is, if a CTE relation A is referenced by another CTE relation B, and relation B has no references, we should update the relation A reference count to exclude the references from relation B? Can we add comments to write it down?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Added scaladoc in ecbc4b8, let me know if it needs more detailed comments.

* ids. The value of the map is tuple whose elements are:
* - The CTE definition
* - The number of incoming references to the CTE. This includes references from
* outer CTEs and regular places.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* outer CTEs and regular places.
* inner CTEs and regular places.

Copy link
Contributor Author

@peter-toth peter-toth Apr 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually wanted to write other CTEs and not inner/outer. E.g. in

WITH (
  cte1 AS (SELECT 1),
  cte2 AS (SELECT * FROM cte1)
)
SELECT * FROM cte1 JOIN cte2 ON ...

the reference count of cte1 is 2. 1 is from an "other" CTE (but not "inner"/"outer") (and 1 is from a "regular place").

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in 806c2de

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan, please let me know if anything else is needed here.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 8970415 Apr 26, 2023
@peter-toth
Copy link
Contributor Author

Thanks @cloud-fan!

LuciferYang pushed a commit to LuciferYang/spark that referenced this pull request May 10, 2023
### What changes were proposed in this pull request?
This PR fixes `InlineCTE`'s idempotence. E.g. the following query:
```
WITH
  x(r) AS (SELECT random()),
  y(r) AS (SELECT * FROM x),
  z(r) AS (SELECT * FROM x)
SELECT * FROM z
```
currently breaks it because we take into account the reference to `x` from `y` when deciding about not inlining `x` in the first round:
```
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.InlineCTE ===
 WithCTE                                                        WithCTE
 :- CTERelationDef 0, false                                     :- CTERelationDef 0, false
 :  +- Project [rand()apache#218 AS r#219]                            :  +- Project [rand()apache#218 AS r#219]
 :     +- Project [random(2957388522017368375) AS rand()apache#218]   :     +- Project [random(2957388522017368375) AS rand()apache#218]
 :        +- OneRowRelation                                     :        +- OneRowRelation
!:- CTERelationDef 1, false                                     +- Project [r#222]
!:  +- Project [r#219 AS r#221]                                    +- Project [r#220 AS r#222]
!:     +- Project [r#219]                                             +- Project [r#220]
!:        +- CTERelationRef 0, true, [r#219]                             +- CTERelationRef 0, true, [r#220]
!:- CTERelationDef 2, false
!:  +- Project [r#220 AS r#222]
!:     +- Project [r#220]
!:        +- CTERelationRef 0, true, [r#220]
!+- Project [r#222]
!   +- CTERelationRef 2, true, [r#222]
```
But in the next round we inline `x` because `y` was removed due to lack of references:
```
Once strategy's idempotence is broken for batch Inline CTE
!WithCTE                                                        Project [r#222]
!:- CTERelationDef 0, false                                     +- Project [r#220 AS r#222]
!:  +- Project [rand()apache#218 AS r#219]                               +- Project [r#220]
!:     +- Project [random(2957388522017368375) AS rand()apache#218]         +- Project [r#225 AS r#220]
!:        +- OneRowRelation                                              +- Project [rand()apache#218 AS r#225]
!+- Project [r#222]                                                         +- Project [random(2957388522017368375) AS rand()apache#218]
!   +- Project [r#220 AS r#222]                                                +- OneRowRelation
!      +- Project [r#220]
!         +- CTERelationRef 0, true, [r#220]
```

### Why are the changes needed?
We use `InlineCTE` as an idempotent rule in the `Optimizer`, `CheckAnalysis` and `ProgressReporter`.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Added new UT.

Closes apache#40856 from peter-toth/SPARK-43199-make-inlinecte-idempotent.

Authored-by: Peter Toth <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants