[SPARK-31607][SQL] Improve the perf of CTESubstitution #28407

cloud-fan · 2020-04-29T16:15:57Z

What changes were proposed in this pull request?

In CTESubstitution, resolve CTE relations first, then traverse the main plan only once to substitute CTE relations.

Why are the changes needed?

Currently we will traverse the main query many times (if there are many CTE relations), which can be pretty slow if the main query is large.

Does this PR introduce any user-facing change?

No

How was this patch tested?

local perf test

scala> :pa
// Entering paste mode (ctrl-D to finish)

def test(i: Int): Unit = 1.to(i).foreach { _ =>
  spark.sql("""
    with
    t1 as (select 1),
    t2 as (select 1),
    t3 as (select 1),
    t4 as (select 1),
    t5 as (select 1),
    t6 as (select 1),
    t7 as (select 1),
    t8 as (select 1),
    t9 as (select 1)
    select * from t1, t2, t3, t4, t5, t6, t7, t8, t9""").queryExecution.assertAnalyzed()
}

// Exiting paste mode, now interpreting.

test: (i: Int)Unit

scala> test(10000)

scala> println(org.apache.spark.sql.catalyst.rules.RuleExecutor.dumpTimeSpent)

The result before this patch

Rule                                       Effective Time / Total Time                     Effective Runs / Total Runs
CTESubstitution                            3328796344 / 3924576425                         10000 / 20000

The result after this patch

Rule                                       Effective Time / Total Time                     Effective Runs / Total Runs
CTESubstitution                            1503085936 / 2091992092                         10000 / 20000

About 2 times faster.

cloud-fan · 2020-04-29T16:21:19Z

cc @peter-toth @hvanhovell @HyukjinKwon

dongjoon-hyun · 2020-04-29T17:12:04Z

Thank you always, @cloud-fan .

peter-toth · 2020-04-29T17:39:24Z

I like this approach and LGTM.
Just a side note that due to its eager way of substitution it can also cause performance degradation with queries where a CTE is defined but never actually used. But the performance gain that the PR can bring with realistic queries is worth the change.

SparkQA · 2020-04-29T21:37:34Z

Test build #122074 has finished for PR 28407 at commit c906218.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-04-30T06:37:48Z

Just a side note that due to its eager way of substitution it can also cause performance degradation with queries where a CTE is defined but never actually used.

Yea I thought about it as well. It's still doable if I change the map type to Map[String, PlanHolder] where PlanHolder can lazily calculate the plan. However, I feel it's too rare to have CTE relations defined but not used, and may not worth it. And CTE relation itself should not be very complex, so even if we do a substitution unnecessarily, mostly it doesn't matter.

dilipbiswal · 2020-04-30T07:29:45Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CTESubstitution.scala

+      isLegacy: Boolean): Seq[(String, LogicalPlan)] = {
+    val resolvedCTERelations = new mutable.ArrayBuffer[(String, LogicalPlan)](relations.size)
+    for ((name, relation) <- relations) {
+      val innerCTEResolved = if (isLegacy) {


@cloud-fan Just trying to understand. innerCTEResolved indicates a already resolved CTE or the one we are going to resolve in the subsequent call to substituteCTE ?

"resolved" here means the With is resolved inside this relation. The relation needs further processing to substitute UnresolvedRelation with the previous CTE relations.

The naming is not very accurate when legacy = true, but this probably doesn't matter.

@cloud-fan OK. sounds good.

viirya · 2020-04-30T07:57:27Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CTESubstitution.scala

+        traverseAndSubstituteCTE(relation)
+      }
+      // CTE definition can reference a previous one
+      resolvedCTERelations += (name -> substituteCTE(innerCTEResolved, resolvedCTERelations))


For legacy case, innerCTEResolved might contain an inner WITH, but seems substituteCTE doesn't remove WITH.

Then in later substituteCTEs, will we result some untouched WITHs in the final query plan ?

The rule CTESubstitution runs in a batch many times (https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L208-L212) so those Withs will be removed in the end because we substitute child here: https://github.com/apache/spark/pull/28407/files#diff-d0bfa3367c63988ad7cf33397e643e75R91

gatorsmile

LGTM

gatorsmile · 2020-04-30T07:53:57Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CTESubstitution.scala

+        // In legacy mode, outer CTE relations take precedence, so substitute relations later.
+        relation
+      } else {
+        // A CTE definition might contain an inner CTE that has priority, so traverse and


"has priority" -> "has a higher priority"

cloud-fan · 2020-04-30T12:11:16Z

the last commit just updates comment, and it already passes compilation.

I'm merging to master, thanks for review!

SparkQA · 2020-04-30T14:05:29Z

Test build #122126 has finished for PR 28407 at commit b022e35.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

fix perf regression in CTESubstitution

c906218

probot-autolabeler bot added the SQL label Apr 29, 2020

cloud-fan changed the title ~~[SPARK-31607][SQL] Fix perf regression in CTESubstitution~~ [SPARK-31607][SQL] Improve the perf of CTESubstitution Apr 29, 2020

dilipbiswal reviewed Apr 30, 2020

View reviewed changes

viirya approved these changes Apr 30, 2020

View reviewed changes

viirya reviewed Apr 30, 2020

View reviewed changes

gatorsmile approved these changes Apr 30, 2020

View reviewed changes

update comments

b022e35

cloud-fan closed this in 636119c Apr 30, 2020

[SPARK-31607][SQL] Improve the perf of CTESubstitution #28407

[SPARK-31607][SQL] Improve the perf of CTESubstitution #28407

Uh oh!

Conversation

cloud-fan commented Apr 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

cloud-fan commented Apr 29, 2020

Uh oh!

dongjoon-hyun commented Apr 29, 2020

Uh oh!

peter-toth commented Apr 29, 2020

Uh oh!

SparkQA commented Apr 29, 2020

Uh oh!

cloud-fan commented Apr 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dilipbiswal Apr 30, 2020

Choose a reason for hiding this comment

Uh oh!

cloud-fan Apr 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dilipbiswal Apr 30, 2020

Choose a reason for hiding this comment

Uh oh!

viirya Apr 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

peter-toth Apr 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile left a comment

Choose a reason for hiding this comment

Uh oh!

gatorsmile Apr 30, 2020

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Apr 30, 2020

Uh oh!

SparkQA commented Apr 30, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

cloud-fan commented Apr 29, 2020 •

edited

Loading

cloud-fan commented Apr 30, 2020 •

edited

Loading

cloud-fan Apr 30, 2020 •

edited

Loading

viirya Apr 30, 2020 •

edited

Loading

peter-toth Apr 30, 2020 •

edited

Loading