-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-37670][SQL] Support predicate pushdown and column pruning for de-duped CTEs #34929
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #146315 has finished for PR 34929 at commit
|
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
Outdated
Show resolved
Hide resolved
| val (cteDef, refCount) = cteMap(ref.cteId) | ||
| val newRef = if (forceInline || shouldInline(cteDef, refCount)) { | ||
| if (shouldInline(cteDef, refCount)) { | ||
| if (ref.outputSet == cteDef.outputSet) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we compare output set, not output? is it possible that the ref and def have different output column order?
...ala/org/apache/spark/sql/catalyst/optimizer/PushdownPredicatesAndPruneColumnsForCTEDef.scala
Outdated
Show resolved
Hide resolved
...ala/org/apache/spark/sql/catalyst/optimizer/PushdownPredicatesAndPruneColumnsForCTEDef.scala
Outdated
Show resolved
Hide resolved
...ala/org/apache/spark/sql/catalyst/optimizer/PushdownPredicatesAndPruneColumnsForCTEDef.scala
Outdated
Show resolved
Hide resolved
...ala/org/apache/spark/sql/catalyst/optimizer/PushdownPredicatesAndPruneColumnsForCTEDef.scala
Show resolved
Hide resolved
...st/src/main/scala/org/apache/spark/sql/catalyst/optimizer/ReplaceCTERefWithRepartition.scala
Outdated
Show resolved
Hide resolved
| traverseAndSubstituteCTE(relation, isCommand, cteDefs)._1 | ||
| } | ||
|
|
||
| if (cteDefs.length > lastCTEDefCount) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please consider #36146 as an alternative to substituting and changing accumulated cteDefs so far.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @peter-toth , this PR contains all the CTE bug fixes we have found so far internally. Can you rebase #36146 after this one gets merged if you think your fix is cleaner? thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR will be merged to 3.2 because of the fixes for the correctness bugs and performance regressions. It doesn't really improve the performance compared to Spark 3.1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But can we merge it after #32298?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#32298 won't go to 3.2, right? If we really need a different CTE handling in master/3.3 for the merging scalar subqueries feature, we should still merge this PR first and make a followup PR to change CTE in master/3.3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I though you just made a typo and want to merge it 3.3+ only and maybe backport the bugfix parts to 3.2...
|
@maryannxue do you think we can merge #32298 as it is targeted to Spark 3.3 and rebease this PR on the top of that? cc @sigmod, @cloud-fan, @tgravescs |
|
thanks, merging to master/3.3/3.2! @peter-toth hopefully #32298 answers your question. I can work with you together to adjust your PR with the new CTE changes. |
…de-duped CTEs This PR adds predicate push-down and column pruning to CTEs that are not inlined as well as fixes a few potential correctness issues: 1) Replace (previously not inlined) CTE refs with Repartition operations at the end of logical plan optimization so that WithCTE is not carried over to physical plan. As a result, we can simplify the logic of physical planning, as well as avoid a correctness issue where the logical link of a physical plan node can point to `WithCTE` and lead to unexpected behaviors in AQE, e.g., class cast exceptions in DPP. 2) Pull (not inlined) CTE defs from subqueries up to the main query level, in order to avoid creating copies of the same CTE def during predicate push-downs and other transformations. 3) Make CTE IDs more deterministic by starting from 0 for each query. Improve de-duped CTEs' performance with predicate pushdown and column pruning; fixes de-duped CTEs' correctness issues. No. Added UTs. Closes #34929 from maryannxue/cte-followup. Lead-authored-by: Maryann Xue <[email protected]> Co-authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 175e429) Signed-off-by: Wenchen Fan <[email protected]>
…de-duped CTEs This PR adds predicate push-down and column pruning to CTEs that are not inlined as well as fixes a few potential correctness issues: 1) Replace (previously not inlined) CTE refs with Repartition operations at the end of logical plan optimization so that WithCTE is not carried over to physical plan. As a result, we can simplify the logic of physical planning, as well as avoid a correctness issue where the logical link of a physical plan node can point to `WithCTE` and lead to unexpected behaviors in AQE, e.g., class cast exceptions in DPP. 2) Pull (not inlined) CTE defs from subqueries up to the main query level, in order to avoid creating copies of the same CTE def during predicate push-downs and other transformations. 3) Make CTE IDs more deterministic by starting from 0 for each query. Improve de-duped CTEs' performance with predicate pushdown and column pruning; fixes de-duped CTEs' correctness issues. No. Added UTs. Closes #34929 from maryannxue/cte-followup. Lead-authored-by: Maryann Xue <[email protected]> Co-authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 175e429) Signed-off-by: Wenchen Fan <[email protected]>
Thanks, I will try to rebase mine today. |
…s an outer CTE ### What changes were proposed in this pull request? Please note that the bug in the [SPARK-38404](https://issues.apache.org/jira/browse/SPARK-38404) is fixed already with #34929. This PR is a minor improvement to the current implementation by collecting already resolved outer CTEs to avoid re-substituting already collected CTE definitions. ### Why are the changes needed? Small improvement + additional tests. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added new test case. Closes #36146 from peter-toth/SPARK-38404-nested-cte-references-outer-cte. Authored-by: Peter Toth <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…terSuite ### What changes were proposed in this pull request? To remove unnecessary changes from `InjectRuntimeFilterSuite` after apache#32298. These are not needed after apache#34929 as the final optimized plan does'n contain any `WithCTE` nodes. ### Why are the changes needed? No need for those changes. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added new test. Closes apache#36361 from peter-toth/SPARK-34079-multi-column-scalar-subquery-follow-up-2. Authored-by: Peter Toth <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…terSuite To remove unnecessary changes from `InjectRuntimeFilterSuite` after #32298. These are not needed after #34929 as the final optimized plan does'n contain any `WithCTE` nodes. No need for those changes. No. Added new test. Closes #36361 from peter-toth/SPARK-34079-multi-column-scalar-subquery-follow-up-2. Authored-by: Peter Toth <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit d05e01d) Signed-off-by: Wenchen Fan <[email protected]>
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @maryannxue and @cloud-fan .
This seems to break branch-3.2 with two failures.
TPCDSV1_4_PlanStabilitySuite.check simplified (tpcds-v1.4/q4)
TPCDSV1_4_PlanStabilitySuite.check simplified (tpcds-v1.4/q5)
|
Here is a followup. |
| protected def planner = sparkSession.sessionState.planner | ||
|
|
||
| // The CTE map for the planner shared by the main query and all subqueries. | ||
| private val cteMap = mutable.HashMap.empty[Long, CTERelationDef] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This move (from QueryExecutor to the other location) could be the root cause of the UT failure in GitHub Action.
As I posted on #36815 , this PR seems to generate the golden answer files with the following correctly.
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *PlanStabilitySuite"
However, unfortunately, we are experiencing GitHub Action failures. In addition, the result is also different from the one when we run individual query.
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *PlanStabilitySuite -- -z (tpcds-v1.4/q4)"
Given that, this PR seems to introduce indeterministic logic in terms of expression IDs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you take a look together with me, @maryannxue and @cloud-fan ?
If there is no easy solution, I'd like to recommend to revert this from branch-3.2 first because it has been broken for 2 months already. How do you want to proceed this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the result generation status.
$ SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *PlanStabilitySuite" &> /dev/null
$ git status
On branch branch-3.2
Your branch is up to date with 'apache/branch-3.2'.
nothing to commit, working tree clean
$ SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *PlanStabilitySuite -- -z (tpcds-v1.4/q4)" &> /dev/null
$ SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *PlanStabilitySuite -- -z (tpcds-v1.4/q5)" &> /dev/null
$ git diff --stat
sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q4/explain.txt | 238 ++++++++++++++++++++++++++++++++++++++++++++++++++++----------------------------------------------------
sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q5/explain.txt | 212 ++++++++++++++++++++++++++++++++++++++++++++++----------------------------------------------
2 files changed, 225 insertions(+), 225 deletions(-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It turns out to be a bug in the plan stability test suite.
The test suite normalizes the expr IDs, by using regex "#\\d+L?".r to match the explain string. However, The exchange node has a special string arg id=#...: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/Exchange.scala#L40
The regex can't distinguish between expr ID and exchange plan id, and may normalize the plan wrongly.
I'll try to fix it tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! Thank you for your investigation, @cloud-fan .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds like a general bug on all branches. In that case, do you know why only branch-3.2 is so flaky like this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the plans are different between 3.2 and 3.3, then the plan ids are different and we may not trigger the bug.
|
As a backup, I created a reverting PR too. Since it's been two month already, we need to check if there is no other dependent PRs. |
|
Any updates, @cloud-fan ? We also can revert this first and land it back after fixing the root cause inside |
|
Just created: #36827 |
…s an outer CTE ### What changes were proposed in this pull request? Please note that the bug in the [SPARK-38404](https://issues.apache.org/jira/browse/SPARK-38404) is fixed already with apache#34929. This PR is a minor improvement to the current implementation by collecting already resolved outer CTEs to avoid re-substituting already collected CTE definitions. ### Why are the changes needed? Small improvement + additional tests. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added new test case. Closes apache#36146 from peter-toth/SPARK-38404-nested-cte-references-outer-cte. Authored-by: Peter Toth <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…rences an outer CTE ### What changes were proposed in this pull request? Please note that the bug in the [SPARK-38404](https://issues.apache.org/jira/browse/SPARK-38404) is fixed already with #34929. This PR is a minor improvement to the current implementation by collecting already resolved outer CTEs to avoid re-substituting already collected CTE definitions. ### Why are the changes needed? Small improvement + additional tests. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added new test case. Closes #37760 from peter-toth/SPARK-38404-nested-cte-references-outer-cte-3.3. Authored-by: Peter Toth <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…de-duped CTEs This PR adds predicate push-down and column pruning to CTEs that are not inlined as well as fixes a few potential correctness issues: 1) Replace (previously not inlined) CTE refs with Repartition operations at the end of logical plan optimization so that WithCTE is not carried over to physical plan. As a result, we can simplify the logic of physical planning, as well as avoid a correctness issue where the logical link of a physical plan node can point to `WithCTE` and lead to unexpected behaviors in AQE, e.g., class cast exceptions in DPP. 2) Pull (not inlined) CTE defs from subqueries up to the main query level, in order to avoid creating copies of the same CTE def during predicate push-downs and other transformations. 3) Make CTE IDs more deterministic by starting from 0 for each query. Improve de-duped CTEs' performance with predicate pushdown and column pruning; fixes de-duped CTEs' correctness issues. No. Added UTs. Closes apache#34929 from maryannxue/cte-followup. Lead-authored-by: Maryann Xue <[email protected]> Co-authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 175e429) Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 1a35685) Signed-off-by: Dongjoon Hyun <[email protected]>
What changes were proposed in this pull request?
This PR adds predicate push-down and column pruning to CTEs that are not inlined as well as fixes a few potential correctness issues:
WithCTEand lead to unexpected behaviors in AQE, e.g., class cast exceptions in DPP.Why are the changes needed?
Improve de-duped CTEs' performance with predicate pushdown and column pruning; fixes de-duped CTEs' correctness issues.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Added UTs.