-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-31607][SQL] Improve the perf of CTESubstitution #28407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thank you always, @cloud-fan . |
|
I like this approach and LGTM. |
|
Test build #122074 has finished for PR 28407 at commit
|
Yea I thought about it as well. It's still doable if I change the map type to |
| isLegacy: Boolean): Seq[(String, LogicalPlan)] = { | ||
| val resolvedCTERelations = new mutable.ArrayBuffer[(String, LogicalPlan)](relations.size) | ||
| for ((name, relation) <- relations) { | ||
| val innerCTEResolved = if (isLegacy) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan Just trying to understand. innerCTEResolved indicates a already resolved CTE or the one we are going to resolve in the subsequent call to substituteCTE ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"resolved" here means the With is resolved inside this relation. The relation needs further processing to substitute UnresolvedRelation with the previous CTE relations.
The naming is not very accurate when legacy = true, but this probably doesn't matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan OK. sounds good.
| traverseAndSubstituteCTE(relation) | ||
| } | ||
| // CTE definition can reference a previous one | ||
| resolvedCTERelations += (name -> substituteCTE(innerCTEResolved, resolvedCTERelations)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For legacy case, innerCTEResolved might contain an inner WITH, but seems substituteCTE doesn't remove WITH.
Then in later substituteCTEs, will we result some untouched WITHs in the final query plan ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rule CTESubstitution runs in a batch many times (https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L208-L212) so those Withs will be removed in the end because we substitute child here: https://github.com/apache/spark/pull/28407/files#diff-d0bfa3367c63988ad7cf33397e643e75R91
gatorsmile
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| // In legacy mode, outer CTE relations take precedence, so substitute relations later. | ||
| relation | ||
| } else { | ||
| // A CTE definition might contain an inner CTE that has priority, so traverse and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"has priority" -> "has a higher priority"
|
the last commit just updates comment, and it already passes compilation. I'm merging to master, thanks for review! |
|
Test build #122126 has finished for PR 28407 at commit
|
What changes were proposed in this pull request?
In
CTESubstitution, resolve CTE relations first, then traverse the main plan only once to substitute CTE relations.Why are the changes needed?
Currently we will traverse the main query many times (if there are many CTE relations), which can be pretty slow if the main query is large.
Does this PR introduce any user-facing change?
No
How was this patch tested?
local perf test
The result before this patch
The result after this patch
About 2 times faster.