Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented Apr 26, 2017

What changes were proposed in this pull request?

It is reported that there is performance downgrade when applying ML pipeline for dataset with many columns but few rows.

A big part of the performance downgrade comes from some operations (e.g., select) on DataFrame/Dataset which re-create new DataFrame/Dataset with a new LogicalPlan. The cost can be ignored in the usage of SQL, normally.

However, it's not rare to chain dozens of pipeline stages in ML. When the query plan grows incrementally during running those stages, the total cost spent on re-creation of DataFrame grows too. In particular, the Analyzer will go through the big query plan even most part of it is analyzed.

By eliminating part of the cost, the time to run the example code locally is reduced from about 1min to about 30 secs.

In particular, the time applying the pipeline locally is mostly spent on calling transform of the 137 Bucketizers. Before the change, each call of Bucketizer's transform can cost about 0.4 sec. So the total time spent on all Bucketizers' transform is about 50 secs. After the change, each call only costs about 0.1 sec.

We also make boundEnc as lazy variable to reduce unnecessary running time.

Performance improvement

The codes and datasets provided by Barry Becker to re-produce this issue and benchmark can be found on the JIRA.

Before this patch: about 1 min
After this patch: about 20 secs

How was this patch tested?

Existing tests.

Please review http://spark.apache.org/contributing.html before opening a pull request.

@SparkQA
Copy link

SparkQA commented Apr 26, 2017

Test build #76174 has finished for PR 17770 at commit fe44832.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class Barrier(node: Option[TreeNode[_]] = None)

}
}

case class Barrier(node: Option[TreeNode[_]] = None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just create a logical plan node and override the transformUp/transformDown functions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My original thought is: If we use a barrier node, we need to modify many places where we create a new logical plan and wrap it with the barrier node.

I will revamp it with a barrier node.

@SparkQA
Copy link

SparkQA commented Apr 26, 2017

Test build #76182 has finished for PR 17770 at commit 82978d7.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class Barrier(node: Option[TreeNode[_]] = None)

@rxin
Copy link
Contributor

rxin commented Apr 26, 2017

Can we fix the description? It is really confusing since it uses the word exchange. Also can we just skip a plan if it is resolved in transform?

@viirya
Copy link
Member Author

viirya commented Apr 26, 2017

@hvanhovell @rxin Thanks for comment.

Ok. Based on your suggestion, looks like we have two options:

  1. Create a logical plan node and override the transformUp/transformDown
  2. Override the transformUp/transformDown inLogicalPlan and skip resolved plan

May I ask which one is preferred?

@viirya
Copy link
Member Author

viirya commented Apr 27, 2017

Option 2 may not work because transformUp/transformDown is still used in Optimizer.

* `fromRow` method later.
*/
private val boundEnc =
private lazy val boundEnc =
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't let boundEnc as lazy val because we need early exception when the encoder can't be resolved.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For self-join de-duplication, we only set barrier for left side.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if we should check there's duplication between right and left sides and decide using barrier or not for right side.

@SparkQA
Copy link

SparkQA commented Apr 27, 2017

Test build #76219 has finished for PR 17770 at commit 81db205.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class AnalysisBarrier(child: LogicalPlan) extends UnaryNode

@SparkQA
Copy link

SparkQA commented Apr 27, 2017

Test build #76222 has finished for PR 17770 at commit 24905e3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class AnalysisBarrier(child: LogicalPlan) extends UnaryNode

@viirya
Copy link
Member Author

viirya commented Apr 29, 2017

@hvanhovell @rxin I've updated this accordingly. Do you have more comments on this? Thanks.

@viirya viirya changed the title [SPARK-20392][SQL][WIP] Set barrier to prevent re-entering a tree [SPARK-20392][SQL] Set barrier to prevent re-entering a tree Apr 30, 2017
@SparkQA
Copy link

SparkQA commented Apr 30, 2017

Test build #76324 has finished for PR 17770 at commit a76f225.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 30, 2017

Test build #76326 has finished for PR 17770 at commit e15b001.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented May 1, 2017

cc @cloud-fan

@cloud-fan
Copy link
Contributor

we can use resolveOperators to avoid transforming already resolved plans, but I think we still need to traverse the plan tree for optimizer.

@viirya
Copy link
Member Author

viirya commented May 1, 2017

There are many transform/Up/Down usages across Analyzer which still traverse resolved plans. We put this analysis barrier when we are confident that the the wrapped plan is resolved and shouldn't be traversed again.

@cloud-fan
Copy link
Contributor

after we have this, can we remove the resolveOperators?

@viirya
Copy link
Member Author

viirya commented May 2, 2017

It is possible as I think resolveOperators works as the same as this analysis barrier + transformUp. However, resolveOperators is widely used now so we may not have urgent need to remove it.

@cloud-fan
Copy link
Contributor

I think we should have a single way to stop analyzing already analyzed plans. We should either apply resolveOperators more widely, or switch to use this analysis barrier.

@gatorsmile
Copy link
Member

Using resolveOperators looks cleaner to me.

@viirya
Copy link
Member Author

viirya commented May 2, 2017

resolveOperators can't cover all usage of this analysis barrier. There are also rules in analyzer that can't be replaced with resolveOperators.

@cloud-fan
Copy link
Contributor

is the analysis barrier applicable for all the cases?

@viirya
Copy link
Member Author

viirya commented May 2, 2017

Currently I think it can cover all cases of resolveOperators. It is more flexible for example we can choose only wrap specific sub-plan in barrier. (this doesn't differentiate it as we can set _analyzed of a sub-plan too)

resolveOperators is transforming up actually. Analysis barrier can work with transformUp/Down.

@viirya
Copy link
Member Author

viirya commented May 24, 2017

retest this please.

@SparkQA
Copy link

SparkQA commented May 24, 2017

Test build #77295 has finished for PR 17770 at commit cba784b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 25, 2017

Test build #77331 has finished for PR 17770 at commit b82b018.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 25, 2017

Test build #77332 has finished for PR 17770 at commit 8314cc3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented May 25, 2017

@cloud-fan @gatorsmile Please let me know if you have more comments on this change. Thanks.

*/
private val boundEnc =
exprEnc.resolveAndBind(logicalPlan.output, sparkSession.sessionState.analyzer)
exprEnc.resolveAndBind(planWithBarrier.output, sparkSession.sessionState.analyzer)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should only use planWithBarrier if necessary, this place is obviously unnecessary.

s"New column names (${colNames.size}): " + colNames.mkString(", "))

val newCols = logicalPlan.output.zip(colNames).map { case (oldAttribute, newName) =>
val newCols = planWithBarrier.output.zip(colNames).map { case (oldAttribute, newName) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@Experimental
@InterfaceStability.Evolving
def isStreaming: Boolean = logicalPlan.isStreaming
def isStreaming: Boolean = planWithBarrier.isStreaming
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

sparkSession,
LogicalRDD(
logicalPlan.output,
planWithBarrier.output,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

val doubleRepartitioned = testData.repartition(10).repartition(20).coalesce(5)
def countRepartitions(plan: LogicalPlan): Int = plan.collect { case r: Repartition => r }.length
assert(countRepartitions(doubleRepartitioned.queryExecution.logical) === 3)
assert(countRepartitions(doubleRepartitioned.queryExecution.analyzed) === 3)
Copy link
Contributor

@cloud-fan cloud-fan May 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unnecessary change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

queryExecution.logical is the raw logical plan without eliminating analysis barrier. It fails this test since there's additional barrier node.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah i see

@cloud-fan
Copy link
Contributor

LGTM except some minor comments

@SparkQA
Copy link

SparkQA commented May 26, 2017

Test build #77400 has finished for PR 17770 at commit 6add9ec.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@asfgit asfgit closed this in 8ce0d8f May 26, 2017
@cloud-fan
Copy link
Contributor

I think this will bring conflicts if we backport new PRs to branch 2.2, @viirya can you send a new PR to backport it to branch 2.2? thanks!

@marmbrus
Copy link
Contributor

Whoa, I do not think we should back porting a large change to the inner workings of the analyzer.

@viirya
Copy link
Member Author

viirya commented May 30, 2017

Ok, I won't backport this until we reach a consensus.

@cloud-fan
Copy link
Contributor

Hi @viirya , as this PR already missed the Spark 2.2 release, I'd like to revert it and re-merge it at the end of Spark 2.3, so that future analyzer related PRs won't get conflicted when backporting to 2.2. I'm pretty sorry about this, I'll mark it as a blocker for spark 2.3 so that we don't forget.

What do you think?

@viirya
Copy link
Member Author

viirya commented May 31, 2017

@cloud-fan Ok. No problem for me. Thanks.

@cloud-fan
Copy link
Contributor

reverted, thanks for your understanding!

@cloud-fan
Copy link
Contributor

Hi @viirya , since it's close to Spark 2.3, would you like to reopen this PR? Thanks!

@viirya
Copy link
Member Author

viirya commented Dec 4, 2017

@cloud-fan Sure. Seems there is no option to reopen it as it was merged before. Should I create another PR for it?

@cloud-fan
Copy link
Contributor

yea, a new PR sounds good, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants