-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-8392] RDDOperationGraph: getting cached nodes is slow #6839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be a val, and can add the initial elements straight away with just ListBuffer(cachedNodes:_*). Below you have a missing space before for, but better than a for loop, why not
_childClusters.foreach(cluster => cachedNodes ++= cluster._childNodes.filter(_.cached))
? Unless I overlook something that works.
Also, can you explain why you think this is slow to begin with? because nodes are expanded, then filtered?
Your JIRAs have been missing this so please add clearer motivation.
|
Yeah, I think expand all nodes then filter every node, is slow and cost memory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: "Return all the nodes"
|
/cc @andrewor14 for review. |
|
Hi @XuTingjun can you update the title to something more specific: |
|
add to whitelist |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe I'm missing something, but why is this faster? You're still iterating through all the nodes in the end so the complexity doesn't change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, is it because we clone fewer nodes? AFAIK ++ on ArrayBuffer actually clones the entire thing first
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I think so.
|
Approach looks fine to me. Once you address the comments I'll merge this. |
|
Test build #35049 has finished for PR 6839 at commit
|
|
@andrewor14, I have updated the title and code, please have a look again, thanks. |
|
Test build #35078 has finished for PR 6839 at commit
|
|
Thanks, I'm merging this into master 1.4. |
```def getAllNodes: Seq[RDDOperationNode] =
{ _childNodes ++ _childClusters.flatMap(_.childNodes) }```
when the ```_childClusters``` has so many nodes, the process will hang on. I think we can improve the efficiency here.
Author: xutingjun <[email protected]>
Closes #6839 from XuTingjun/DAGImprove and squashes the following commits:
53b03ea [xutingjun] change code to more concise and easier to read
f98728b [xutingjun] fix words: node -> nodes
f87c663 [xutingjun] put the filter inside
81f9fd2 [xutingjun] put the filter inside
(cherry picked from commit e2cdb05)
Signed-off-by: Andrew Or <[email protected]>
```def getAllNodes: Seq[RDDOperationNode] =
{ _childNodes ++ _childClusters.flatMap(_.childNodes) }```
when the ```_childClusters``` has so many nodes, the process will hang on. I think we can improve the efficiency here.
Author: xutingjun <[email protected]>
Closes apache#6839 from XuTingjun/DAGImprove and squashes the following commits:
53b03ea [xutingjun] change code to more concise and easier to read
f98728b [xutingjun] fix words: node -> nodes
f87c663 [xutingjun] put the filter inside
81f9fd2 [xutingjun] put the filter inside
def getAllNodes: Seq[RDDOperationNode] = { _childNodes ++ _childClusters.flatMap(_.childNodes) }when the
_childClustershas so many nodes, the process will hang on. I think we can improve the efficiency here.