Skip to content

Conversation

@maryannxue
Copy link
Contributor

@maryannxue maryannxue commented Dec 17, 2021

What changes were proposed in this pull request?

This PR adds predicate push-down and column pruning to CTEs that are not inlined as well as fixes a few potential correctness issues:

  1. Replace (previously not inlined) CTE refs with Repartition operations at the end of logical plan optimization so that WithCTE is not carried over to physical plan. As a result, we can simplify the logic of physical planning, as well as avoid a correctness issue where the logical link of a physical plan node can point to WithCTE and lead to unexpected behaviors in AQE, e.g., class cast exceptions in DPP.
  2. Pull (not inlined) CTE defs from subqueries up to the main query level, in order to avoid creating copies of the same CTE def during predicate push-downs and other transformations.
  3. Make CTE IDs more deterministic by starting from 0 for each query.

Why are the changes needed?

Improve de-duped CTEs' performance with predicate pushdown and column pruning; fixes de-duped CTEs' correctness issues.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added UTs.

@github-actions github-actions bot added the SQL label Dec 17, 2021
@SparkQA
Copy link

SparkQA commented Dec 17, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50788/

@SparkQA
Copy link

SparkQA commented Dec 17, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50788/

@SparkQA
Copy link

SparkQA commented Dec 17, 2021

Test build #146315 has finished for PR 34929 at commit 841aa2a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val (cteDef, refCount) = cteMap(ref.cteId)
val newRef = if (forceInline || shouldInline(cteDef, refCount)) {
if (shouldInline(cteDef, refCount)) {
if (ref.outputSet == cteDef.outputSet) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we compare output set, not output? is it possible that the ref and def have different output column order?

traverseAndSubstituteCTE(relation, isCommand, cteDefs)._1
}

if (cteDefs.length > lastCTEDefCount) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider #36146 as an alternative to substituting and changing accumulated cteDefs so far.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @peter-toth , this PR contains all the CTE bug fixes we have found so far internally. Can you rebase #36146 after this one gets merged if you think your fix is cleaner? thanks!

Copy link
Contributor

@peter-toth peter-toth Apr 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can rebase my #36146, no problem with that.

But I'm more concerned about #32298. This PR seems to contain a mix of improvements and bugfixes and changes a lot in CTE handling and conflicts with my PR. As mine is on the 3.3 whitelist do you think we can merge that first and rebase this on that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR will be merged to 3.2 because of the fixes for the correctness bugs and performance regressions. It doesn't really improve the performance compared to Spark 3.1.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But can we merge it after #32298?

Copy link
Contributor

@cloud-fan cloud-fan Apr 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#32298 won't go to 3.2, right? If we really need a different CTE handling in master/3.3 for the merging scalar subqueries feature, we should still merge this PR first and make a followup PR to change CTE in master/3.3

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I though you just made a typo and want to merge it 3.3+ only and maybe backport the bugfix parts to 3.2...

@peter-toth
Copy link
Contributor

peter-toth commented Apr 12, 2022

@maryannxue do you think we can merge #32298 as it is targeted to Spark 3.3 and rebease this PR on the top of that?

cc @sigmod, @cloud-fan, @tgravescs

@cloud-fan
Copy link
Contributor

@cloud-fan
Copy link
Contributor

thanks, merging to master/3.3/3.2!

@peter-toth hopefully #32298 answers your question. I can work with you together to adjust your PR with the new CTE changes.

@cloud-fan cloud-fan closed this in 175e429 Apr 19, 2022
cloud-fan added a commit that referenced this pull request Apr 19, 2022
…de-duped CTEs

This PR adds predicate push-down and column pruning to CTEs that are not inlined as well as fixes a few potential correctness issues:
  1) Replace (previously not inlined) CTE refs with Repartition operations at the end of logical plan optimization so that WithCTE is not carried over to physical plan. As a result, we can simplify the logic of physical planning, as well as avoid a correctness issue where the logical link of a physical plan node can point to `WithCTE` and lead to unexpected behaviors in AQE, e.g., class cast exceptions in DPP.
  2) Pull (not inlined) CTE defs from subqueries up to the main query level, in order to avoid creating copies of the same CTE def during predicate push-downs and other transformations.
  3) Make CTE IDs more deterministic by starting from 0 for each query.

Improve de-duped CTEs' performance with predicate pushdown and column pruning; fixes de-duped CTEs' correctness issues.

No.

Added UTs.

Closes #34929 from maryannxue/cte-followup.

Lead-authored-by: Maryann Xue <[email protected]>
Co-authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 175e429)
Signed-off-by: Wenchen Fan <[email protected]>
cloud-fan added a commit that referenced this pull request Apr 19, 2022
…de-duped CTEs

This PR adds predicate push-down and column pruning to CTEs that are not inlined as well as fixes a few potential correctness issues:
  1) Replace (previously not inlined) CTE refs with Repartition operations at the end of logical plan optimization so that WithCTE is not carried over to physical plan. As a result, we can simplify the logic of physical planning, as well as avoid a correctness issue where the logical link of a physical plan node can point to `WithCTE` and lead to unexpected behaviors in AQE, e.g., class cast exceptions in DPP.
  2) Pull (not inlined) CTE defs from subqueries up to the main query level, in order to avoid creating copies of the same CTE def during predicate push-downs and other transformations.
  3) Make CTE IDs more deterministic by starting from 0 for each query.

Improve de-duped CTEs' performance with predicate pushdown and column pruning; fixes de-duped CTEs' correctness issues.

No.

Added UTs.

Closes #34929 from maryannxue/cte-followup.

Lead-authored-by: Maryann Xue <[email protected]>
Co-authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 175e429)
Signed-off-by: Wenchen Fan <[email protected]>
@peter-toth
Copy link
Contributor

@peter-toth hopefully #32298 answers your question. I can work with you together to adjust your PR with the new CTE changes.

Thanks, I will try to rebase mine today.

peter-toth added a commit to peter-toth/spark that referenced this pull request Apr 19, 2022
cloud-fan pushed a commit that referenced this pull request Apr 20, 2022
…s an outer CTE

### What changes were proposed in this pull request?
Please note that the bug in the [SPARK-38404](https://issues.apache.org/jira/browse/SPARK-38404) is fixed already with #34929.
This PR is a minor improvement to the current implementation by collecting already resolved outer CTEs to avoid re-substituting already collected CTE definitions.

### Why are the changes needed?
Small improvement + additional tests.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Added new test case.

Closes #36146 from peter-toth/SPARK-38404-nested-cte-references-outer-cte.

Authored-by: Peter Toth <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Lukas-Grasmann pushed a commit to Lukas-Grasmann/spark-fix-unresolved-aggregates-order that referenced this pull request Apr 27, 2022
…terSuite

### What changes were proposed in this pull request?
To remove unnecessary changes from `InjectRuntimeFilterSuite` after apache#32298. These are not needed after apache#34929 as the final optimized plan does'n contain any `WithCTE` nodes.

### Why are the changes needed?
No need for those changes.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Added new test.

Closes apache#36361 from peter-toth/SPARK-34079-multi-column-scalar-subquery-follow-up-2.

Authored-by: Peter Toth <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
cloud-fan pushed a commit that referenced this pull request Apr 27, 2022
…terSuite

To remove unnecessary changes from `InjectRuntimeFilterSuite` after #32298. These are not needed after #34929 as the final optimized plan does'n contain any `WithCTE` nodes.

No need for those changes.

No.

Added new test.

Closes #36361 from peter-toth/SPARK-34079-multi-column-scalar-subquery-follow-up-2.

Authored-by: Peter Toth <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit d05e01d)
Signed-off-by: Wenchen Fan <[email protected]>
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @maryannxue and @cloud-fan .

This seems to break branch-3.2 with two failures.

TPCDSV1_4_PlanStabilitySuite.check simplified (tpcds-v1.4/q4)
TPCDSV1_4_PlanStabilitySuite.check simplified (tpcds-v1.4/q5)

@dongjoon-hyun
Copy link
Member

protected def planner = sparkSession.sessionState.planner

// The CTE map for the planner shared by the main query and all subqueries.
private val cteMap = mutable.HashMap.empty[Long, CTERelationDef]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This move (from QueryExecutor to the other location) could be the root cause of the UT failure in GitHub Action.

As I posted on #36815 , this PR seems to generate the golden answer files with the following correctly.

SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *PlanStabilitySuite"

However, unfortunately, we are experiencing GitHub Action failures. In addition, the result is also different from the one when we run individual query.

SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *PlanStabilitySuite -- -z (tpcds-v1.4/q4)"

Given that, this PR seems to introduce indeterministic logic in terms of expression IDs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you take a look together with me, @maryannxue and @cloud-fan ?

If there is no easy solution, I'd like to recommend to revert this from branch-3.2 first because it has been broken for 2 months already. How do you want to proceed this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the result generation status.

$ SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *PlanStabilitySuite" &> /dev/null
$ git status
On branch branch-3.2
Your branch is up to date with 'apache/branch-3.2'.

nothing to commit, working tree clean

$ SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *PlanStabilitySuite -- -z (tpcds-v1.4/q4)" &> /dev/null
$ SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *PlanStabilitySuite -- -z (tpcds-v1.4/q5)" &> /dev/null
$ git diff --stat
 sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q4/explain.txt | 238 ++++++++++++++++++++++++++++++++++++++++++++++++++++----------------------------------------------------
 sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q5/explain.txt | 212 ++++++++++++++++++++++++++++++++++++++++++++++----------------------------------------------
 2 files changed, 225 insertions(+), 225 deletions(-)

Copy link
Contributor

@cloud-fan cloud-fan Jun 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out to be a bug in the plan stability test suite.

The test suite normalizes the expr IDs, by using regex "#\\d+L?".r to match the explain string. However, The exchange node has a special string arg id=#...: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/Exchange.scala#L40

The regex can't distinguish between expr ID and exchange plan id, and may normalize the plan wrongly.

I'll try to fix it tomorrow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Thank you for your investigation, @cloud-fan .

Copy link
Member

@dongjoon-hyun dongjoon-hyun Jun 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like a general bug on all branches. In that case, do you know why only branch-3.2 is so flaky like this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the plans are different between 3.2 and 3.3, then the plan ids are different and we may not trigger the bug.

@dongjoon-hyun
Copy link
Member

As a backup, I created a reverting PR too. Since it's been two month already, we need to check if there is no other dependent PRs.

@dongjoon-hyun
Copy link
Member

Any updates, @cloud-fan ? We also can revert this first and land it back after fixing the root cause inside PlanStabilitySuite.

@cloud-fan
Copy link
Contributor

Just created: #36827

peter-toth added a commit to peter-toth/spark that referenced this pull request Sep 1, 2022
…s an outer CTE

### What changes were proposed in this pull request?
Please note that the bug in the [SPARK-38404](https://issues.apache.org/jira/browse/SPARK-38404) is fixed already with apache#34929.
This PR is a minor improvement to the current implementation by collecting already resolved outer CTEs to avoid re-substituting already collected CTE definitions.

### Why are the changes needed?
Small improvement + additional tests.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Added new test case.

Closes apache#36146 from peter-toth/SPARK-38404-nested-cte-references-outer-cte.

Authored-by: Peter Toth <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
cloud-fan pushed a commit that referenced this pull request Sep 6, 2022
…rences an outer CTE

### What changes were proposed in this pull request?
Please note that the bug in the [SPARK-38404](https://issues.apache.org/jira/browse/SPARK-38404) is fixed already with #34929.
This PR is a minor improvement to the current implementation by collecting already resolved outer CTEs to avoid re-substituting already collected CTE definitions.

### Why are the changes needed?
Small improvement + additional tests.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Added new test case.

Closes #37760 from peter-toth/SPARK-38404-nested-cte-references-outer-cte-3.3.

Authored-by: Peter Toth <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
sunchao pushed a commit to sunchao/spark that referenced this pull request Jun 2, 2023
…de-duped CTEs

This PR adds predicate push-down and column pruning to CTEs that are not inlined as well as fixes a few potential correctness issues:
  1) Replace (previously not inlined) CTE refs with Repartition operations at the end of logical plan optimization so that WithCTE is not carried over to physical plan. As a result, we can simplify the logic of physical planning, as well as avoid a correctness issue where the logical link of a physical plan node can point to `WithCTE` and lead to unexpected behaviors in AQE, e.g., class cast exceptions in DPP.
  2) Pull (not inlined) CTE defs from subqueries up to the main query level, in order to avoid creating copies of the same CTE def during predicate push-downs and other transformations.
  3) Make CTE IDs more deterministic by starting from 0 for each query.

Improve de-duped CTEs' performance with predicate pushdown and column pruning; fixes de-duped CTEs' correctness issues.

No.

Added UTs.

Closes apache#34929 from maryannxue/cte-followup.

Lead-authored-by: Maryann Xue <[email protected]>
Co-authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 175e429)
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 1a35685)
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants