[SPARK-8392] RDDOperationGraph: getting cached nodes is slow #6839

XuTingjun · 2015-06-16T09:18:13Z

def getAllNodes: Seq[RDDOperationNode] = { _childNodes ++ _childClusters.flatMap(_.childNodes) }

when the _childClusters has so many nodes, the process will hang on. I think we can improve the efficiency here.

srowen · 2015-06-16T11:56:28Z

core/src/main/scala/org/apache/spark/ui/scope/RDDOperationGraph.scala

Can be a val, and can add the initial elements straight away with just ListBuffer(cachedNodes:_*). Below you have a missing space before for, but better than a for loop, why not

_childClusters.foreach(cluster => cachedNodes ++= cluster._childNodes.filter(_.cached))

? Unless I overlook something that works.

Also, can you explain why you think this is slow to begin with? because nodes are expanded, then filtered?

Your JIRAs have been missing this so please add clearer motivation.

XuTingjun · 2015-06-16T12:28:40Z

Yeah, I think expand all nodes then filter every node, is slow and cost memory.
Also, test with the same case, the old code will hang up, but the patch can finish quickly.

WangTaoTheTonic · 2015-06-16T13:07:23Z

core/src/main/scala/org/apache/spark/ui/scope/RDDOperationGraph.scala

Nit: "Return all the nodes"

JoshRosen · 2015-06-16T15:22:42Z

/cc @andrewor14 for review.

andrewor14 · 2015-06-17T18:01:50Z

Hi @XuTingjun can you update the title to something more specific:
"RDDOperationGraph: getting cached nodes is slow" or something?

andrewor14 · 2015-06-17T18:01:57Z

add to whitelist

andrewor14 · 2015-06-17T18:09:11Z

core/src/main/scala/org/apache/spark/ui/scope/RDDOperationGraph.scala

maybe I'm missing something, but why is this faster? You're still iterating through all the nodes in the end so the complexity doesn't change.

I see, is it because we clone fewer nodes? AFAIK ++ on ArrayBuffer actually clones the entire thing first

yeah, I think so.

andrewor14 · 2015-06-17T18:14:01Z

Approach looks fine to me. Once you address the comments I'll merge this.

SparkQA · 2015-06-17T19:52:48Z

Test build #35049 has finished for PR 6839 at commit f98728b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

XuTingjun · 2015-06-18T02:22:23Z

@andrewor14, I have updated the title and code, please have a look again, thanks.

SparkQA · 2015-06-18T03:39:35Z

Test build #35078 has finished for PR 6839 at commit 53b03ea.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

andrewor14 · 2015-06-18T05:30:18Z

Thanks, I'm merging this into master 1.4.

```def getAllNodes: Seq[RDDOperationNode] = { _childNodes ++ _childClusters.flatMap(_.childNodes) }``` when the ```_childClusters``` has so many nodes, the process will hang on. I think we can improve the efficiency here. Author: xutingjun <[email protected]> Closes #6839 from XuTingjun/DAGImprove and squashes the following commits: 53b03ea [xutingjun] change code to more concise and easier to read f98728b [xutingjun] fix words: node -> nodes f87c663 [xutingjun] put the filter inside 81f9fd2 [xutingjun] put the filter inside (cherry picked from commit e2cdb05) Signed-off-by: Andrew Or <[email protected]>

```def getAllNodes: Seq[RDDOperationNode] = { _childNodes ++ _childClusters.flatMap(_.childNodes) }``` when the ```_childClusters``` has so many nodes, the process will hang on. I think we can improve the efficiency here. Author: xutingjun <[email protected]> Closes apache#6839 from XuTingjun/DAGImprove and squashes the following commits: 53b03ea [xutingjun] change code to more concise and easier to read f98728b [xutingjun] fix words: node -> nodes f87c663 [xutingjun] put the filter inside 81f9fd2 [xutingjun] put the filter inside

put the filter inside

81f9fd2

srowen reviewed Jun 16, 2015
View reviewed changes

put the filter inside

f87c663

WangTaoTheTonic reviewed Jun 16, 2015
View reviewed changes

core/src/main/scala/org/apache/spark/ui/scope/RDDOperationGraph.scala Outdated

Copy link

Contributor

WangTaoTheTonic Jun 16, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: "Return all the nodes"

fix words: node -> nodes

f98728b

andrewor14 reviewed Jun 17, 2015
View reviewed changes

XuTingjun changed the title ~~[SPARK-8392] Improve the efficiency~~ [SPARK-8392] RDDOperationGraph: getting cached nodes is slow Jun 18, 2015

change code to more concise and easier to read

53b03ea

asfgit closed this in e2cdb05 Jun 18, 2015

[SPARK-8392] RDDOperationGraph: getting cached nodes is slow #6839

[SPARK-8392] RDDOperationGraph: getting cached nodes is slow #6839

Uh oh!

Conversation

XuTingjun commented Jun 16, 2015

Uh oh!

srowen Jun 16, 2015

Choose a reason for hiding this comment

Uh oh!

XuTingjun commented Jun 16, 2015

Uh oh!

WangTaoTheTonic Jun 16, 2015

Choose a reason for hiding this comment

Uh oh!

JoshRosen commented Jun 16, 2015

Uh oh!

andrewor14 commented Jun 17, 2015

Uh oh!

andrewor14 commented Jun 17, 2015

Uh oh!

andrewor14 Jun 17, 2015

Choose a reason for hiding this comment

Uh oh!

andrewor14 Jun 17, 2015

Choose a reason for hiding this comment

Uh oh!

XuTingjun Jun 18, 2015

Choose a reason for hiding this comment

Uh oh!

andrewor14 commented Jun 17, 2015

Uh oh!

SparkQA commented Jun 17, 2015

Uh oh!

XuTingjun commented Jun 18, 2015

Uh oh!

SparkQA commented Jun 18, 2015

Uh oh!

andrewor14 commented Jun 18, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants