-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-43199][SQL] Make InlineCTE idempotent #40856
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-43199][SQL] Make InlineCTE idempotent #40856
Conversation
|
@peter-toth can you briefly explain the idea of fixing it? |
…her CTEs from a CTE (the previous version stored the incoming references and counts to a CTE)
I've updated the PR recently, but the main change is that the CTE accumulator map argument of Once we have this extended outer map we can correct the "count of incoming references" in To build the new inner map |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InlineCTE.scala
Show resolved
Hide resolved
| } | ||
| if (plan.containsPattern(CTE)) { | ||
| plan.children.foreach { child => | ||
| buildCTEMap(child, cteMap, outerCTEId) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this duplicated? If plan is WithCTE, we should have already invoked buildCTEMap for CTE relations in https://github.com/apache/spark/pull/40856/files#diff-1c15413e5d63f78fff1db3dec9df4a671e78b76d086104d81f4a967eb2800805R82
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvm, this is the case _ branch.
| cteRefMap: mutable.SortedMap[Long, (CTERelationDef, Int, mutable.Map[Long, Int])] | ||
| ) = { | ||
| cteRefMap.keys.toSeq.reverse.foreach { currentCTEId => | ||
| val (_, currentRefCount, refMap) = cteRefMap(currentCTEId) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so the idea is, if a CTE relation A is referenced by another CTE relation B, and relation B has no references, we should update the relation A reference count to exclude the references from relation B? Can we add comments to write it down?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Added scaladoc in ecbc4b8, let me know if it needs more detailed comments.
| * ids. The value of the map is tuple whose elements are: | ||
| * - The CTE definition | ||
| * - The number of incoming references to the CTE. This includes references from | ||
| * outer CTEs and regular places. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| * outer CTEs and regular places. | |
| * inner CTEs and regular places. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually wanted to write other CTEs and not inner/outer. E.g. in
WITH (
cte1 AS (SELECT 1),
cte2 AS (SELECT * FROM cte1)
)
SELECT * FROM cte1 JOIN cte2 ON ...
the reference count of cte1 is 2. 1 is from an "other" CTE (but not "inner"/"outer") (and 1 is from a "regular place").
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in 806c2de
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan, please let me know if anything else is needed here.
|
thanks, merging to master! |
|
Thanks @cloud-fan! |
### What changes were proposed in this pull request? This PR fixes `InlineCTE`'s idempotence. E.g. the following query: ``` WITH x(r) AS (SELECT random()), y(r) AS (SELECT * FROM x), z(r) AS (SELECT * FROM x) SELECT * FROM z ``` currently breaks it because we take into account the reference to `x` from `y` when deciding about not inlining `x` in the first round: ``` === Applying Rule org.apache.spark.sql.catalyst.optimizer.InlineCTE === WithCTE WithCTE :- CTERelationDef 0, false :- CTERelationDef 0, false : +- Project [rand()apache#218 AS r#219] : +- Project [rand()apache#218 AS r#219] : +- Project [random(2957388522017368375) AS rand()apache#218] : +- Project [random(2957388522017368375) AS rand()apache#218] : +- OneRowRelation : +- OneRowRelation !:- CTERelationDef 1, false +- Project [r#222] !: +- Project [r#219 AS r#221] +- Project [r#220 AS r#222] !: +- Project [r#219] +- Project [r#220] !: +- CTERelationRef 0, true, [r#219] +- CTERelationRef 0, true, [r#220] !:- CTERelationDef 2, false !: +- Project [r#220 AS r#222] !: +- Project [r#220] !: +- CTERelationRef 0, true, [r#220] !+- Project [r#222] ! +- CTERelationRef 2, true, [r#222] ``` But in the next round we inline `x` because `y` was removed due to lack of references: ``` Once strategy's idempotence is broken for batch Inline CTE !WithCTE Project [r#222] !:- CTERelationDef 0, false +- Project [r#220 AS r#222] !: +- Project [rand()apache#218 AS r#219] +- Project [r#220] !: +- Project [random(2957388522017368375) AS rand()apache#218] +- Project [r#225 AS r#220] !: +- OneRowRelation +- Project [rand()apache#218 AS r#225] !+- Project [r#222] +- Project [random(2957388522017368375) AS rand()apache#218] ! +- Project [r#220 AS r#222] +- OneRowRelation ! +- Project [r#220] ! +- CTERelationRef 0, true, [r#220] ``` ### Why are the changes needed? We use `InlineCTE` as an idempotent rule in the `Optimizer`, `CheckAnalysis` and `ProgressReporter`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added new UT. Closes apache#40856 from peter-toth/SPARK-43199-make-inlinecte-idempotent. Authored-by: Peter Toth <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
This PR fixes
InlineCTE's idempotence. E.g. the following query:currently breaks it because we take into account the reference to
xfromywhen deciding about not inliningxin the first round:But in the next round we inline
xbecauseywas removed due to lack of references:Why are the changes needed?
We use
InlineCTEas an idempotent rule in theOptimizer,CheckAnalysisandProgressReporter.Does this PR introduce any user-facing change?
No.
How was this patch tested?
Added new UT.