-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-18471][MLLIB] In LBFGS, avoid sending huge vectors of 0 #15905
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
CostFun used to send a dense vector of zeroes as a closure in a treeAggregate call. To avoid that, we replace treeAggregate by mapPartition + treeReduce, creating a zero vector inside the mapPartition block in-place.
|
Can one of the admins verify this patch? |
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I get it, it's just that you find that the zero values accidentally gets into those function closures. Yes sounds good if it can be rewritten to avoid it. I think there are more instances of this pattern though, in NaiveBayes, ALS, etc. It would be cool to break out these operations even just for code clarity, but especially if it avoids some silent overhead.
| * tuples, updates the current gradient and current loss | ||
| */ | ||
| val seqOp = (c: (Vector, Double), v: (Double, Vector)) => { | ||
| (c, v) match { case ((grad, loss), (label, features)) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: unindent 2 spaces and you can remove the outer braces? same in the next function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok will do it.
| } | ||
| } | ||
|
|
||
| val (gradientSum, lossSum) = data.mapPartitions(it => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: .mapPartitions { it =>
|
By he way do you think that this should be addressed in core or just in each ML specific use ? |
|
I personally think it's good to be consistent. I think it's more readable to break out these function definitions, and, it seems like there's evidence it might avoid some unintended objects in a closure. Have a look for other instances of "seqOp = ..." etc and see which ones look like the same pattern that could be refactored. |
|
I missed part of my company guidelines. Closing this PR and creating a new one shortly from my company account. Sorry for the noise. |
What changes were proposed in this pull request?
CostFun used to send a dense vector of zeroes as a closure in a
treeAggregate call. To avoid that, we replace treeAggregate by
mapPartition + treeReduce, creating a zero vector inside the mapPartition
block in-place.
How was this patch tested?
Tests run by hand locally.
(Setting up local infra to run the official Spark tests is in progress)