[SPARK-20392][SQL] Set barrier to prevent re-entering a tree #17770

viirya · 2017-04-26T09:03:26Z

What changes were proposed in this pull request?

It is reported that there is performance downgrade when applying ML pipeline for dataset with many columns but few rows.

A big part of the performance downgrade comes from some operations (e.g., select) on DataFrame/Dataset which re-create new DataFrame/Dataset with a new LogicalPlan. The cost can be ignored in the usage of SQL, normally.

However, it's not rare to chain dozens of pipeline stages in ML. When the query plan grows incrementally during running those stages, the total cost spent on re-creation of DataFrame grows too. In particular, the Analyzer will go through the big query plan even most part of it is analyzed.

By eliminating part of the cost, the time to run the example code locally is reduced from about 1min to about 30 secs.

In particular, the time applying the pipeline locally is mostly spent on calling transform of the 137 Bucketizers. Before the change, each call of Bucketizer's transform can cost about 0.4 sec. So the total time spent on all Bucketizers' transform is about 50 secs. After the change, each call only costs about 0.1 sec.

~~We also make boundEnc as lazy variable to reduce unnecessary running time.~~

Performance improvement

The codes and datasets provided by Barry Becker to re-produce this issue and benchmark can be found on the JIRA.

Before this patch: about 1 min
After this patch: about 20 secs

How was this patch tested?

Existing tests.

Please review http://spark.apache.org/contributing.html before opening a pull request.

SparkQA · 2017-04-26T10:16:04Z

Test build #76174 has finished for PR 17770 at commit fe44832.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class Barrier(node: Option[TreeNode[_]] = None)

hvanhovell · 2017-04-26T14:55:04Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala

  }
 }

+case class Barrier(node: Option[TreeNode[_]] = None)


Why not just create a logical plan node and override the transformUp/transformDown functions?

My original thought is: If we use a barrier node, we need to modify many places where we create a new logical plan and wrap it with the barrier node.

I will revamp it with a barrier node.

SparkQA · 2017-04-26T15:11:07Z

Test build #76182 has finished for PR 17770 at commit 82978d7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class Barrier(node: Option[TreeNode[_]] = None)

rxin · 2017-04-26T18:07:02Z

Can we fix the description? It is really confusing since it uses the word exchange. Also can we just skip a plan if it is resolved in transform?

viirya · 2017-04-26T23:51:51Z

@hvanhovell @rxin Thanks for comment.

Ok. Based on your suggestion, looks like we have two options:

Create a logical plan node and override the transformUp/transformDown
Override the transformUp/transformDown inLogicalPlan and skip resolved plan

May I ask which one is preferred?

viirya · 2017-04-27T00:10:35Z

Option 2 may not work because transformUp/transformDown is still used in Optimizer.

viirya · 2017-04-27T07:46:43Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

   * `fromRow` method later.
   */
-  private val boundEnc =
+  private lazy val boundEnc =


We can't let boundEnc as lazy val because we need early exception when the encoder can't be resolved.

viirya · 2017-04-27T08:30:52Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

For self-join de-duplication, we only set barrier for left side.

I am wondering if we should check there's duplication between right and left sides and decide using barrier or not for right side.

SparkQA · 2017-04-27T10:34:17Z

Test build #76219 has finished for PR 17770 at commit 81db205.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class AnalysisBarrier(child: LogicalPlan) extends UnaryNode

SparkQA · 2017-04-27T11:06:30Z

Test build #76222 has finished for PR 17770 at commit 24905e3.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class AnalysisBarrier(child: LogicalPlan) extends UnaryNode

viirya · 2017-04-29T02:17:33Z

@hvanhovell @rxin I've updated this accordingly. Do you have more comments on this? Thanks.

SparkQA · 2017-04-30T17:05:33Z

Test build #76324 has finished for PR 17770 at commit a76f225.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-04-30T17:50:24Z

Test build #76326 has finished for PR 17770 at commit e15b001.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-05-01T01:32:36Z

cc @cloud-fan

cloud-fan · 2017-05-01T10:43:24Z

we can use resolveOperators to avoid transforming already resolved plans, but I think we still need to traverse the plan tree for optimizer.

viirya · 2017-05-01T13:00:15Z

There are many transform/Up/Down usages across Analyzer which still traverse resolved plans. We put this analysis barrier when we are confident that the the wrapped plan is resolved and shouldn't be traversed again.

cloud-fan · 2017-05-02T01:07:53Z

after we have this, can we remove the resolveOperators?

viirya · 2017-05-02T03:12:29Z

It is possible as I think resolveOperators works as the same as this analysis barrier + transformUp. However, resolveOperators is widely used now so we may not have urgent need to remove it.

cloud-fan · 2017-05-02T06:07:05Z

I think we should have a single way to stop analyzing already analyzed plans. We should either apply resolveOperators more widely, or switch to use this analysis barrier.

gatorsmile · 2017-05-02T08:16:07Z

Using resolveOperators looks cleaner to me.

viirya · 2017-05-02T13:18:57Z

resolveOperators can't cover all usage of this analysis barrier. There are also rules in analyzer that can't be replaced with resolveOperators.

cloud-fan · 2017-05-02T14:06:39Z

is the analysis barrier applicable for all the cases?

viirya · 2017-05-02T14:10:30Z

Currently I think it can cover all cases of resolveOperators. ~~It is more flexible for example we can choose only wrap specific sub-plan in barrier.~~ (this doesn't differentiate it as we can set _analyzed of a sub-plan too)

resolveOperators is transforming up actually. Analysis barrier can work with transformUp/Down.

viirya · 2017-05-24T07:30:57Z

retest this please.

SparkQA · 2017-05-24T08:48:44Z

Test build #77295 has finished for PR 17770 at commit cba784b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-05-25T06:33:53Z

Test build #77331 has finished for PR 17770 at commit b82b018.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-05-25T06:34:52Z

Test build #77332 has finished for PR 17770 at commit 8314cc3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-05-25T07:00:17Z

@cloud-fan @gatorsmile Please let me know if you have more comments on this change. Thanks.

cloud-fan · 2017-05-25T12:58:16Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

   */
  private val boundEnc =
-    exprEnc.resolveAndBind(logicalPlan.output, sparkSession.sessionState.analyzer)
+    exprEnc.resolveAndBind(planWithBarrier.output, sparkSession.sessionState.analyzer)


I think we should only use planWithBarrier if necessary, this place is obviously unnecessary.

cloud-fan · 2017-05-25T12:58:39Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

        s"New column names (${colNames.size}): " + colNames.mkString(", "))

-    val newCols = logicalPlan.output.zip(colNames).map { case (oldAttribute, newName) =>
+    val newCols = planWithBarrier.output.zip(colNames).map { case (oldAttribute, newName) =>


cloud-fan · 2017-05-25T12:58:55Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

  @Experimental
  @InterfaceStability.Evolving
-  def isStreaming: Boolean = logicalPlan.isStreaming
+  def isStreaming: Boolean = planWithBarrier.isStreaming


cloud-fan · 2017-05-25T12:59:51Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

      sparkSession,
      LogicalRDD(
-        logicalPlan.output,
+        planWithBarrier.output,


cloud-fan · 2017-05-25T13:02:31Z

sql/core/src/test/scala/org/apache/spark/sql/execution/PlannerSuite.scala

    val doubleRepartitioned = testData.repartition(10).repartition(20).coalesce(5)
    def countRepartitions(plan: LogicalPlan): Int = plan.collect { case r: Repartition => r }.length
-    assert(countRepartitions(doubleRepartitioned.queryExecution.logical) === 3)
+    assert(countRepartitions(doubleRepartitioned.queryExecution.analyzed) === 3)


unnecessary change?

queryExecution.logical is the raw logical plan without eliminating analysis barrier. It fails this test since there's additional barrier node.

cloud-fan · 2017-05-25T13:03:25Z

LGTM except some minor comments

SparkQA · 2017-05-26T05:25:24Z

Test build #77400 has finished for PR 17770 at commit 6add9ec.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-05-26T05:46:30Z

thanks, merging to master!

cloud-fan · 2017-05-30T17:49:02Z

I think this will bring conflicts if we backport new PRs to branch 2.2, @viirya can you send a new PR to backport it to branch 2.2? thanks!

marmbrus · 2017-05-30T18:10:48Z

Whoa, I do not think we should back porting a large change to the inner workings of the analyzer.

viirya · 2017-05-30T23:09:44Z

Ok, I won't backport this until we reach a consensus.

cloud-fan · 2017-05-31T03:33:42Z

Hi @viirya , as this PR already missed the Spark 2.2 release, I'd like to revert it and re-merge it at the end of Spark 2.3, so that future analyzer related PRs won't get conflicted when backporting to 2.2. I'm pretty sorry about this, I'll mark it as a blocker for spark 2.3 so that we don't forget.

What do you think?

viirya · 2017-05-31T03:36:07Z

@cloud-fan Ok. No problem for me. Thanks.

cloud-fan · 2017-05-31T04:16:25Z

reverted, thanks for your understanding!

cloud-fan · 2017-12-04T04:50:00Z

Hi @viirya , since it's close to Spark 2.3, would you like to reopen this PR? Thanks!

viirya · 2017-12-04T04:52:46Z

@cloud-fan Sure. Seems there is no option to reopen it as it was merged before. Should I create another PR for it?

cloud-fan · 2017-12-04T05:10:31Z

yea, a new PR sounds good, thanks!

Set barrier to prevent re-analysis of analyzed plan.

82978d7

viirya force-pushed the SPARK-20392 branch from fe44832 to 82978d7 Compare April 26, 2017 13:43

hvanhovell reviewed Apr 26, 2017

View reviewed changes

viirya commented Apr 27, 2017

View reviewed changes

Use a logical node to set analysis barrier.

24905e3

viirya force-pushed the SPARK-20392 branch from 81db205 to 24905e3 Compare April 27, 2017 08:50

viirya changed the title ~~[SPARK-20392][SQL][WIP] Set barrier to prevent re-entering a tree~~ [SPARK-20392][SQL] Set barrier to prevent re-entering a tree Apr 30, 2017

Add test for analysis barrier.

e15b001

viirya force-pushed the SPARK-20392 branch from a76f225 to e15b001 Compare April 30, 2017 15:36

viirya added 2 commits May 25, 2017 03:18

Merge remote-tracking branch 'upstream/master' into SPARK-20392

b478e55

Create a new field in Dataset for the plan with barrier.

8314cc3

viirya force-pushed the SPARK-20392 branch from b82b018 to 8314cc3 Compare May 25, 2017 04:10

cloud-fan reviewed May 25, 2017

View reviewed changes

Address comments.

6add9ec

asfgit closed this in 8ce0d8f May 26, 2017

viirya mentioned this pull request Aug 3, 2017

[SPARK-21580][SQL]Integers in aggregation expressions are wrongly taken as group-by ordinal #18779

Closed

viirya mentioned this pull request Dec 4, 2017

[SPARK-20392][SQL] Set barrier to prevent re-entering a tree #19873

Closed

viirya deleted the SPARK-20392 branch December 27, 2023 18:20

[SPARK-20392][SQL] Set barrier to prevent re-entering a tree #17770

[SPARK-20392][SQL] Set barrier to prevent re-entering a tree #17770

Uh oh!

Conversation

viirya commented Apr 26, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Performance improvement

How was this patch tested?

Uh oh!

SparkQA commented Apr 26, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 26, 2017

Uh oh!

rxin commented Apr 26, 2017

Uh oh!

viirya commented Apr 26, 2017

Uh oh!

viirya commented Apr 27, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 27, 2017

Uh oh!

SparkQA commented Apr 27, 2017

Uh oh!

viirya commented Apr 29, 2017

Uh oh!

SparkQA commented Apr 30, 2017

Uh oh!

SparkQA commented Apr 30, 2017

Uh oh!

viirya commented May 1, 2017

Uh oh!

cloud-fan commented May 1, 2017

Uh oh!

viirya commented May 1, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan commented May 2, 2017

Uh oh!

viirya commented May 2, 2017

Uh oh!

cloud-fan commented May 2, 2017

Uh oh!

gatorsmile commented May 2, 2017

Uh oh!

viirya commented May 2, 2017

Uh oh!

cloud-fan commented May 2, 2017

Uh oh!

viirya commented May 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

viirya commented May 24, 2017

Uh oh!

SparkQA commented May 24, 2017

Uh oh!

SparkQA commented May 25, 2017

Uh oh!

SparkQA commented May 25, 2017

Uh oh!

viirya commented May 25, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya commented Apr 26, 2017 •

edited

Loading

viirya commented May 1, 2017 •

edited

Loading

viirya commented May 2, 2017 •

edited

Loading

cloud-fan May 25, 2017 •

edited

Loading