-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-18471][MLLIB] In LBFGS, avoid sending huge vectors of 0 #16037
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
See #15963 for first feed back on those changes. |
|
OK, this is the fourth pull request though (not counting a not-quite-related 5th). You don't need to open a new PR to push more changes and it adds to the difficulty in reviewing. This still doesn't incorporate suggestions from the last review. |
|
Following #16038 I suggest this proceed by making the zero value a sparse vector, and then making it dense in the seqOp immediately. |
|
This is all a bit confusing - can we highlight which PR is actually to be reviewed? |
|
This should be the main PR @MLnick |
|
Right ok. So I think the approach of making the zero vector sparse then calling Currently the gradient vector must be dense in MLlib since both Can we see if this works: val zeroVector = Vectors.sparse(n, Seq())
val (gradientSum, lossSum) = data.treeAggregate((zeroVector, 0.0))(
seqOp = (c, v) => (c, v) match { case ((grad, loss), (label, features)) =>
val denseGrad = grad.toDense
val l = localGradient.compute(
features, label, bcW.value, denseGrad)
(denseGrad, loss + l)
},
combOp = (c1, c2) => (c1, c2) match { case ((grad1, loss1), (grad2, loss2)) =>
axpy(1.0, grad2, grad1)
(grad1, loss1 + loss2)
}) |
|
What worries me more actually is that the initial vector when sent in the closure should be compressed. So why is this issue occurring? Is it a problem with serialization / compression? OR even after compression it is still too large? Would be good to understand that. |
0a0571d to
aab2422
Compare
|
Hello @MLnick, Sorry that my push of a new version just crossed with your suggestion. I'll test you suggestion. Thanks to for having written it, I didn't get that Sean's hint was this. |
|
@MLnick I have to add a transtyping denseGrad in the expression returned by seqOp from DenseVector to Vector. The only way I could find to do it with the API are: |
51b8dd5 to
d7ebc7d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this really necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can compile with my example code...
|
ok to test |
|
Test build #69463 has finished for PR 16037 at commit
|
CostFun used to send a dense vector of zeroes as a closure in a treeAggregate call. To avoid that, we ensure the zero is sent as a SparseVector but that it is converted into a DensseVector immediately in seqOp, thus effectively generating a (potentially huge) vector of zero on the executors only.
d7ebc7d to
0ce8c64
Compare
|
Scala style checks fixed, and spurious cast removed. |
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this LGTM. Let's make sure it passes tests. CC maybe @dbtsai FYI
|
Test build #69466 has finished for PR 16037 at commit
|
|
This seems very related to the change. I know nothing wrt the way PySpark interfaces Python and Scala and I'm surprised that changing an internal f the Scala lib causes this ? Ready to learn and help though :-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think grad1 here will need to be denseGrad1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, thanks for spotting.
ec183a2 to
9b30c5c
Compare
|
Test build #69469 has finished for PR 16037 at commit
|
|
Test build #69470 has finished for PR 16037 at commit
|
|
I'm sure this will be net positive, and shouldn't cause any regression. Still, we must be certain. @AnthonyTruchet can you provide for posterity the detailed test results for the vector sizes you mentioned? And perhaps also some results for some smaller sizes (since I imagine the benefit of this change for that scenario is quite small and we should just check there is no unexpected overhead or regression we've somehow missed from the |
|
L-BFGS is the only optimizer I've used so far. I'm not sure how much time I can free to take care of the other ones, but I'll try :-) Regarding the bench, I'll check if we have archived the results, otherwise I'll relaunch it next week |
|
Yes I'm pretty OK with merging this. If you can dig up any results, that's all the better. Will check in with you next week. |
| } | ||
|
|
||
| val zeroSparseVector = Vectors.sparse(n, Seq()) | ||
| val (gradientSum, lossSum) = data.treeAggregate(zeroSparseVector, 0.0)(seqOp, combOp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm slightly in favor of keeping the parentheses for the zero values. If you don't know the signature of treeAggregate and that the scala compiler evidently packs these into a tuple, you may be confused about what the second argument is. Not a strong preference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean data.treeAggregate((zeroSparseVector, 0.0)), don't you ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes :)
| // Adds two (gradient, loss) tuples | ||
| val combOp = (c1: (Vector, Double), c2: (Vector, Double)) => | ||
| (c1, c2) match { case ((grad1, loss1), (grad2, loss2)) => | ||
| val denseGrad1 = grad1.toDense |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is necessary because we may have empty partitions, right? Might be nice to add a test explicitly for this (it seems it failed in pyspark without it, but that was just a coincidence?) so someone doesn't remove these lines in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Meaning, when would the args ever not be dense? I agree, shouldn't be sparse at this stage, but doing this defensively seems fine since it's a no-op for dense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. actually missing the handling of dense vector was the cause for the PySpark UTest failure we observed.
| (denseGrad, loss + l) | ||
| } | ||
|
|
||
| // Adds two (gradient, loss) tuples |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this comment is necessary. The semantics of a combine op should be well understood.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
|
@MLnick Yeah, this is likely a problem with all the ML aggregators as well. We can probably take care of it using lazy evaluation. |
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with @sethat's two comments regarding the comment and the arg to treeAggregate (that generates a warning I think). Then I think it's good to go.
| // Adds two (gradient, loss) tuples | ||
| val combOp = (c1: (Vector, Double), c2: (Vector, Double)) => | ||
| (c1, c2) match { case ((grad1, loss1), (grad2, loss2)) => | ||
| val denseGrad1 = grad1.toDense |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Meaning, when would the args ever not be dense? I agree, shouldn't be sparse at this stage, but doing this defensively seems fine since it's a no-op for dense.
|
Test build #69722 has finished for PR 16037 at commit
|
|
|
||
| val combOp = (c1: (Vector, Double), c2: (Vector, Double)) => | ||
| (c1, c2) match { case ((grad1, loss1), (grad2, loss2)) => | ||
| val denseGrad1 = grad1.toDense |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am still pretty strongly in favor of adding a test case explicitly for this. Just make an RDD with at least one empty partition, and be sure that LBFGS will run on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry, I have no clue how to generate a non-empty RDD with an empty partition. Can you please hint me at some entry points so that I can contribute the UTest you request ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about just parallelizing n elements into more than n partitions? at least one partition must be empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AnthonyTruchet does that let you add a quick test case? then I think this can be merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try to work on this in the next two days. I will not be able to relaunch the actual benchmark we did weeks ago, our internal codebase has changed too much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I attempted to submit a PR to your branch, but combination of sketchy wifi and git - not sure it worked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did thanks and I merged it. I was just comming back to this contrib when I saw it.
[SPARK-18471] add test
|
Test build #70081 has finished for PR 16037 at commit
|
|
LGTM |
|
Merged to master |
|
Thanks for keeping up with this merge request, I've learned a lot wrt the contribution process and good practice, and next contrib will hopefully be much more straightforward. Thanks to the Spark commiter team for this great piece of software :-) |
|
Sorry for late review. Just come back to US. LGTM too! Thanks. |
## What changes were proposed in this pull request? CostFun used to send a dense vector of zeroes as a closure in a treeAggregate call. To avoid that, we replace treeAggregate by mapPartition + treeReduce, creating a zero vector inside the mapPartition block in-place. ## How was this patch tested? Unit test for module mllib run locally for correctness. As for performance we run an heavy optimization on our production data (50 iterations on 128 MB weight vectors) and have seen significant decrease in terms both of runtime and container being killed by lack of off-heap memory. Author: Anthony Truchet <[email protected]> Author: sethah <[email protected]> Author: Anthony Truchet <[email protected]> Closes apache#16037 from AnthonyTruchet/ENG-17719-lbfgs-only.
## What changes were proposed in this pull request? CostFun used to send a dense vector of zeroes as a closure in a treeAggregate call. To avoid that, we replace treeAggregate by mapPartition + treeReduce, creating a zero vector inside the mapPartition block in-place. ## How was this patch tested? Unit test for module mllib run locally for correctness. As for performance we run an heavy optimization on our production data (50 iterations on 128 MB weight vectors) and have seen significant decrease in terms both of runtime and container being killed by lack of off-heap memory. Author: Anthony Truchet <[email protected]> Author: sethah <[email protected]> Author: Anthony Truchet <[email protected]> Closes apache#16037 from AnthonyTruchet/ENG-17719-lbfgs-only.
## What changes were proposed in this pull request? JIRA: [SPARK-19745](https://issues.apache.org/jira/browse/SPARK-19745) Reorganize SVCAggregator to avoid serializing coefficients. This patch also makes the gradient array a `lazy val` which will avoid materializing a large array on the driver before shipping the class to the executors. This improvement stems from #16037. Actually, probably all ML aggregators can benefit from this. We can either: a.) separate the gradient improvement into another patch b.) keep what's here _plus_ add the lazy evaluation to all other aggregators in this patch or c.) keep it as is. ## How was this patch tested? This is an interesting question! I don't know of a reasonable way to test this right now. Ideally, we could perform an optimization and look at the shuffle write data for each task, and we could compare the size to what it we know it should be: `numCoefficients * 8 bytes`. Not sure if there is a good way to do that right now? We could discuss this here or in another JIRA, but I suspect it would be a significant undertaking. Author: sethah <[email protected]> Closes #17076 from sethah/svc_agg.
What changes were proposed in this pull request?
CostFun used to send a dense vector of zeroes as a closure in a
treeAggregate call. To avoid that, we replace treeAggregate by
mapPartition + treeReduce, creating a zero vector inside the mapPartition
block in-place.
How was this patch tested?
Unit test for module mllib run locally for correctness.
As for performance we run an heavy optimization on our production data (50 iterations on 128 MB weight vectors) and have seen significant decrease in terms both of runtime and container being killed by lack of off-heap memory.