-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-16719][ML] Random Forests should communicate fewer trees on each iteration #14359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ss only trees in each group
|
@hhbyyh This is an improvement I had implemented a while back, just a little too late for the 2.0 code freeze. Could you please help review it or find others? Thank you! |
| q += ((treeIndex, node)) | ||
| } | ||
|
|
||
| /** Remove and return last inserted element. Linear time (unclear in Scala docs though) */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: This is not ideal, but it should be insignificant compared to the cost of communicating the trees. I am not aware of an existing solution for a stack with constant-time push/pop in Scala.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"a FILO queue" is an unconventional saying. I like it though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there some reason not to use scala.collection.mutable.Stack[(Int, LearningNode)] here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How critical is this for performance? I know that append is O(1) for scala mutable lists but dropRight(1) is actually implemented in a parent class [1] of MutableList and therefores does not take advantage of the list's reference to its last element. All in all, from it's implementation it seems that dropRight is O(n)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, I missed that List.tail is O(1) time. I switched to a List. Thanks all!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also somehow didn't find Stack via Google... Does anyone know of documentation for the performance of Stack? I don't see it in the Scala docs.
|
Ack. I'll review it and run tests tonight. |
|
Test build #62858 has finished for PR 14359 at commit
|
|
If it is not urgent, I'd like to try some large scale training to understand more about the improvements. |
|
Not urgent, but I'd like it to be in 2.1 |
|
@jkbradley , you can find the scaladoc on stacks here http://www.scala-lang.org/api/current/index.html#scala.collection.mutable.Stack Also this document http://docs.scala-lang.org/overviews/collections/performance-characteristics gives a nice overview of the different collection types in scala and their performances |
|
Test build #62900 has finished for PR 14359 at commit
|
|
Thanks @jodersky I saw those, but the first does not document computational cost & the latter does not really clarify what I need for stacks (push and pop). |
|
Agree, it's not very obvious. In the latter document I think a |
|
Sorry for the long delay; I've been swamped by other things for a while. Re-emerging... I switched to Stack and then realized Stack has been deprecated in Scala 2.11, so I reverted to the original NodeQueue. But I renamed NodeQueue to NodeStack to be a bit clearer. @hhbyyh Any luck testing this at scale? |
|
Btw, to give back-of-the-envelope estimates, we can look at 2 numbers: For (1), here's an example:
This implies that for trees of depth > 8 or so, many iterations will only split nodes from 1 or 2 trees. I.e., we should avoid communicating most trees. For (2), the forest can be pretty expensive to send.
|
|
Test build #63822 has finished for PR 14359 at commit
|
I think you probably read the immutable stack docs; the mutable stack is not deprecated AFAIK. I can imagine that having a custom stack implementation may allow for additional operations in the future, however we should also consider that using standard collections reduces the load for anyone who will maintain the code then. Btw, I highly recommend to use the milestone scaladocs over the current ones. Although 2.12 is not officially out yet, the changes to the library are minimal and the UI is much more pleasant to use. |
|
Ahh, you're right; I was looking at immutable. I'll update to use the mutable stack. Thanks! |
|
Done! |
|
Test build #63886 has finished for PR 14359 at commit
|
| val nodeQueue = new mutable.Queue[(Int, LearningNode)]() | ||
| /* | ||
| FILO queue of nodes to train: (treeIndex, node) | ||
| We make this FILO by always inserting nodes by appending (+=) and removing with dropRight. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the methods described here still refer to the original queue data structure
|
Some comments still refer to the use of queue and should be updated. Other than that, the data structure part now looks good to me. |
|
Thanks @jodersky ! Updated. |
|
Test build #64020 has finished for PR 14359 at commit
|
|
Hi Joseph, Sorry for the late response. I was occupied by a customer Spark project for the past month. The idea looks reasonable and I tested with MNist dataset and the overall run time decrease from 245 seconds to 225 seconds on average. LGTM. |
| val nodeQueue = new mutable.Queue[(Int, LearningNode)]() | ||
| /* | ||
| Stack of nodes to train: (treeIndex, node) | ||
| The reason this is FILO is that we train many trees at once, but we want to focus on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"The reason this is FILO" -> "The reason we use a stack"
|
This is a really nice improvement. The communication overhead is reduced, based on some simple local tests. I wonder how we can add a test to verify that the algorithm focuses on completing whole trees at once. Potentially, we can add a test of Also, it might not be too hard to take this a step further. We could group the nodes to be trained by tree, and keep track of the amount of memory they require. Then to select nodes to split, we can simply pick off the trees that require the most memory until we exceed the threshold. This way we truly minimize the number of trees while still occupying the memory size. We could leave it for another JIRA. |
|
LGTM pending tests. |
|
Test build #65798 has finished for PR 14359 at commit
|
|
Merging with master |
What changes were proposed in this pull request?
RandomForest currently sends the entire forest to each worker on each iteration. This is because (a) the node queue is FIFO and (b) the closure references the entire array of trees (topNodes). (a) causes RFs to handle splits in many trees, especially early on in learning. (b) sends all trees explicitly.
This PR:
(a) Change the RF node queue to be FILO (a stack), so that RFs tend to focus on 1 or a few trees before focusing on others.
(b) Change topNodes to pass only the trees required on that iteration.
How was this patch tested?
Unit tests: