Skip to content

Conversation

@XuTingjun
Copy link
Contributor

def getAllNodes: Seq[RDDOperationNode] = { _childNodes ++ _childClusters.flatMap(_.childNodes) }

when the _childClusters has so many nodes, the process will hang on. I think we can improve the efficiency here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be a val, and can add the initial elements straight away with just ListBuffer(cachedNodes:_*). Below you have a missing space before for, but better than a for loop, why not

_childClusters.foreach(cluster => cachedNodes ++= cluster._childNodes.filter(_.cached))

? Unless I overlook something that works.

Also, can you explain why you think this is slow to begin with? because nodes are expanded, then filtered?

Your JIRAs have been missing this so please add clearer motivation.

@XuTingjun
Copy link
Contributor Author

Yeah, I think expand all nodes then filter every node, is slow and cost memory.
Also, test with the same case, the old code will hang up, but the patch can finish quickly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: "Return all the nodes"

@JoshRosen
Copy link
Contributor

/cc @andrewor14 for review.

@andrewor14
Copy link
Contributor

Hi @XuTingjun can you update the title to something more specific:
"RDDOperationGraph: getting cached nodes is slow" or something?

@andrewor14
Copy link
Contributor

add to whitelist

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe I'm missing something, but why is this faster? You're still iterating through all the nodes in the end so the complexity doesn't change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, is it because we clone fewer nodes? AFAIK ++ on ArrayBuffer actually clones the entire thing first

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I think so.

@andrewor14
Copy link
Contributor

Approach looks fine to me. Once you address the comments I'll merge this.

@SparkQA
Copy link

SparkQA commented Jun 17, 2015

Test build #35049 has finished for PR 6839 at commit f98728b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@XuTingjun XuTingjun changed the title [SPARK-8392] Improve the efficiency [SPARK-8392] RDDOperationGraph: getting cached nodes is slow Jun 18, 2015
@XuTingjun
Copy link
Contributor Author

@andrewor14, I have updated the title and code, please have a look again, thanks.

@SparkQA
Copy link

SparkQA commented Jun 18, 2015

Test build #35078 has finished for PR 6839 at commit 53b03ea.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@andrewor14
Copy link
Contributor

Thanks, I'm merging this into master 1.4.

asfgit pushed a commit that referenced this pull request Jun 18, 2015
```def getAllNodes: Seq[RDDOperationNode] =
{ _childNodes ++ _childClusters.flatMap(_.childNodes) }```

when the ```_childClusters``` has so many nodes, the process will hang on. I think we can improve the efficiency here.

Author: xutingjun <[email protected]>

Closes #6839 from XuTingjun/DAGImprove and squashes the following commits:

53b03ea [xutingjun] change code to more concise and easier to read
f98728b [xutingjun] fix words: node -> nodes
f87c663 [xutingjun] put the filter inside
81f9fd2 [xutingjun] put the filter inside

(cherry picked from commit e2cdb05)
Signed-off-by: Andrew Or <[email protected]>
@asfgit asfgit closed this in e2cdb05 Jun 18, 2015
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
```def getAllNodes: Seq[RDDOperationNode] =
{ _childNodes ++ _childClusters.flatMap(_.childNodes) }```

when the ```_childClusters``` has so many nodes, the process will hang on. I think we can improve the efficiency here.

Author: xutingjun <[email protected]>

Closes apache#6839 from XuTingjun/DAGImprove and squashes the following commits:

53b03ea [xutingjun] change code to more concise and easier to read
f98728b [xutingjun] fix words: node -> nodes
f87c663 [xutingjun] put the filter inside
81f9fd2 [xutingjun] put the filter inside
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants