Skip to content

Conversation

@cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented Apr 29, 2020

What changes were proposed in this pull request?

In CTESubstitution, resolve CTE relations first, then traverse the main plan only once to substitute CTE relations.

Why are the changes needed?

Currently we will traverse the main query many times (if there are many CTE relations), which can be pretty slow if the main query is large.

Does this PR introduce any user-facing change?

No

How was this patch tested?

local perf test

scala> :pa
// Entering paste mode (ctrl-D to finish)

def test(i: Int): Unit = 1.to(i).foreach { _ =>
  spark.sql("""
    with
    t1 as (select 1),
    t2 as (select 1),
    t3 as (select 1),
    t4 as (select 1),
    t5 as (select 1),
    t6 as (select 1),
    t7 as (select 1),
    t8 as (select 1),
    t9 as (select 1)
    select * from t1, t2, t3, t4, t5, t6, t7, t8, t9""").queryExecution.assertAnalyzed()
}

// Exiting paste mode, now interpreting.

test: (i: Int)Unit

scala> test(10000)

scala> println(org.apache.spark.sql.catalyst.rules.RuleExecutor.dumpTimeSpent)

The result before this patch

Rule                                       Effective Time / Total Time                     Effective Runs / Total Runs
CTESubstitution                            3328796344 / 3924576425                         10000 / 20000

The result after this patch

Rule                                       Effective Time / Total Time                     Effective Runs / Total Runs
CTESubstitution                            1503085936 / 2091992092                         10000 / 20000

About 2 times faster.

@cloud-fan cloud-fan changed the title [SPARK-31607][SQL] Fix perf regression in CTESubstitution [SPARK-31607][SQL] Improve the perf of CTESubstitution Apr 29, 2020
@cloud-fan
Copy link
Contributor Author

@dongjoon-hyun
Copy link
Member

Thank you always, @cloud-fan .

@peter-toth
Copy link
Contributor

I like this approach and LGTM.
Just a side note that due to its eager way of substitution it can also cause performance degradation with queries where a CTE is defined but never actually used. But the performance gain that the PR can bring with realistic queries is worth the change.

@SparkQA
Copy link

SparkQA commented Apr 29, 2020

Test build #122074 has finished for PR 28407 at commit c906218.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

cloud-fan commented Apr 30, 2020

Just a side note that due to its eager way of substitution it can also cause performance degradation with queries where a CTE is defined but never actually used.

Yea I thought about it as well. It's still doable if I change the map type to Map[String, PlanHolder] where PlanHolder can lazily calculate the plan. However, I feel it's too rare to have CTE relations defined but not used, and may not worth it. And CTE relation itself should not be very complex, so even if we do a substitution unnecessarily, mostly it doesn't matter.

isLegacy: Boolean): Seq[(String, LogicalPlan)] = {
val resolvedCTERelations = new mutable.ArrayBuffer[(String, LogicalPlan)](relations.size)
for ((name, relation) <- relations) {
val innerCTEResolved = if (isLegacy) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan Just trying to understand. innerCTEResolved indicates a already resolved CTE or the one we are going to resolve in the subsequent call to substituteCTE ?

Copy link
Contributor Author

@cloud-fan cloud-fan Apr 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"resolved" here means the With is resolved inside this relation. The relation needs further processing to substitute UnresolvedRelation with the previous CTE relations.

The naming is not very accurate when legacy = true, but this probably doesn't matter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan OK. sounds good.

traverseAndSubstituteCTE(relation)
}
// CTE definition can reference a previous one
resolvedCTERelations += (name -> substituteCTE(innerCTEResolved, resolvedCTERelations))
Copy link
Member

@viirya viirya Apr 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For legacy case, innerCTEResolved might contain an inner WITH, but seems substituteCTE doesn't remove WITH.

Then in later substituteCTEs, will we result some untouched WITHs in the final query plan ?

Copy link
Contributor

@peter-toth peter-toth Apr 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@gatorsmile gatorsmile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

// In legacy mode, outer CTE relations take precedence, so substitute relations later.
relation
} else {
// A CTE definition might contain an inner CTE that has priority, so traverse and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"has priority" -> "has a higher priority"

@cloud-fan
Copy link
Contributor Author

the last commit just updates comment, and it already passes compilation.

I'm merging to master, thanks for review!

@cloud-fan cloud-fan closed this in 636119c Apr 30, 2020
@SparkQA
Copy link

SparkQA commented Apr 30, 2020

Test build #122126 has finished for PR 28407 at commit b022e35.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants