-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-17847][ML] Reduce shuffled data size of GaussianMixture & copy the implementation from mllib to ml #15413
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #66625 has finished for PR 15413 at commit
|
|
Test build #66655 has finished for PR 15413 at commit
|
a1e901b to
8b94909
Compare
|
Test build #67414 has finished for PR 15413 at commit
|
|
Test build #67448 has finished for PR 15413 at commit
|
|
Test build #67509 has finished for PR 15413 at commit
|
|
Do you plan to run performance tests to ensure there are no regressions? |
|
@sethah I did some performance tests actually, and found this change can improve by 1.5x ~ 2x according to different dimensions and number of clusters. I will post the test result soon. Thanks. |
sethah
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made a first pass. I haven't looked at the tests yet.
Also, about keeping the mllib code around. It would be really quite simple to get around the issues you mentioned. We can do like LogisticRegression and make a private var optInitialModel for now. For k, we could we could make an alternate private train constructor which takes k as an argument. Still, I'm ok with leaving it for a future PR, but I don't think it's being blocked by those issues. Let me know what you think on this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be nice to factor this out into an initialization method so we can just call val gaussians = initRandom(...) or similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use local pointers to avoid calling virtual methods each iteration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use new Array[Double](size) instead of Array.fill(size)(0.0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use local pointers localNewWeights, localNewMeans, localNewCovs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought the number of iterations(equals to the number of cluster) is small enough to ignore the impact, compared with the dimension of gradient in LiR/LoR which usually as large as millions or billions, but it's better we can avoid the extra cost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will allocate an intermediate zipped array. Maybe we can use a while loop and also collect pSum inside it. We should use a localGaussians reference as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's use while since this is called in several places
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Convert an n * (n + 1) / 2 dimension array representing the upper triangular part of a matrix into an n * n array representing the full symmetric matrix". I think that's more explicit about what is happening. Also very minor nit, can we call the array triangularValues instead? triangular sounds like it should be a boolean to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: 25.0. Also, let's call it numFeatures instead of d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: why not logLikelihood and logLikelihoodPrev ? It's nice to have descriptive variable names, then we can remove the comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have typically used this documentation as a place to explain the math used to compute updates. It would be nice to have that here as well.
|
As far as keeping the code around, I much prefer either the current approach (separate code) or having spark.mllib call into spark.ml. That will make it easier to deprecate and eventually remove spark.mllib code in 3.0. I like the upper-triangular matrix packing and unpacking! Could you please add a unit test for it? |
9617076 to
b2c2fa0
Compare
|
@sethah Yeah, I totally agree we can get around the issues I mentioned and make |
|
Test build #68332 has finished for PR 15413 at commit
|
|
Test build #68996 has finished for PR 15413 at commit
|
b2c2fa0 to
0bad9e7
Compare
|
@jkbradley @sethah Any more comments? Thanks. |
|
Test build #70550 has finished for PR 15413 at commit
|
|
I'll take a look, thanks for pinging! |
jkbradley
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the nice PR! I only found small things to comment on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Document that the matrix is in column-major order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Elsewhere too (e.g., in ExpectationAggregator)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: space after "while"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And in several other places
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need to make these local copies?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because we are actually accessing a getter method when we call this.newWeights. To improve the performance of the loop at L655, we should use explicit pointers to the values rather than call getter each time. It's probably not a big deal in this case since k is usually not very big, but I don't think it's a bad idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see. I'm not sure about this either. Does the JIT compiler adjust enough to make it efficient? The way it is seems fine though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I don't think it's necessary here because this is inside the merge operations which will be called far less than the add operation. It may be overkill in any event, but I'm also ok leaving it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Below here, we call logNumFeatures. This isn't part of your PR, but could you move it earlier since numFeatures is available before running the algorithm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You always use this right away by converting it to a DenseMatrix, so how about just returning a DenseMatrix?
python/pyspark/ml/clustering.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the table for documentation though. Does using fewer digits stabilize it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Set the seed in this and other tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simplify: point.toSparse
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This and the previous tests are almost the same. How about combining them via a helper function?
0bad9e7 to
b6e9d5f
Compare
|
Test build #70976 has finished for PR 15413 at commit
|
|
@jkbradley I addressed most of your comments. Thanks. |
| modelEquals(expected, actual) | ||
| } | ||
|
|
||
| test("univariate sparse data with two clusters") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This and the previous tests are almost the same. How about combining them via a helper function? (I see you abstracted out part.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, don't bother. This looks ready.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be in favor of merging them since they are so nearly identical
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I moved on to merge them together. Thanks.
|
This LGTM @sethah Any further comments before we merge it? |
|
I did a quick pass and it looks pretty good. I'll take a more thorough look at the tests this weekend, but if you want to merge it I think any of those items could be addressed in follow ups. |
|
OK, I'll just wait so @sethah can make a final pass and so @yanboliang can merge the 2 tests. |
|
Test build #71014 has finished for PR 15413 at commit
|
sethah
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple comments on testing. Nice work on the performance improvement :)
| } | ||
| } | ||
|
|
||
| test("check distributed decomposition") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test only checks that when we distribute the computation that it produces a model, i.e. that it doesn't fail. So, AFAICT we don't have any test right now that checks that when we distribute the computation it produces a correct model. I think it's a good idea to have that here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because the model is big and it's tedious to construct the model in advance. In this model, gaussians(the array of MultivariateGaussian) contains 5 elements and each element contains a mean array of length 50 and a covariance matrix of size 50 * 50.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I played with this a bit, and wrote a test to generate two very separate clusters and run with distributed computation. Then we could check that the model learns approximately the correct cluster means. However, I found that the algorithm seems incapable of learning even this very contrived example - due to the initialization method.
I still think it's a good test to have, but if you feel strongly against it then let's leave it. Otherwise it could be a follow up (along with more investigation to the initialization method, which does not seem to be effective).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I also suffer from bad initialization in some of my use cases. So I think we should push to resolve SPARK-15785 firstly. It's more easy to add correctness test after we support initial model. I'll leave this as follow up and open SPARK-19144 to track. Thanks.
| testEstimatorAndModelReadWrite(gm, dataset, | ||
| GaussianMixtureSuite.allParamSettings, checkModelData) | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In most of the other test suites in ML we have a test that checks the prediction/transform methods. For example, checking that the prediction always matches the highest probability, checking that probabilities sum to one. I don't see much reason to diverge from that pattern here, what do you think @yanboliang?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, updated.
| def modelEquals(m1: GaussianMixtureModel, m2: GaussianMixtureModel): Unit = { | ||
| assert(m1.weights.length === m2.weights.length) | ||
| for (i <- m1.weights.indices) { | ||
| assert(m1.gaussians(i).mean ~== m2.gaussians(i).mean absTol 1E-3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not also check the weights here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, forgot it, added. Thanks.
|
Test build #71034 has finished for PR 15413 at commit
|
|
Test build #71035 has finished for PR 15413 at commit
|
|
Left one small comment which isn't a blocker. LGTM otherwise. |
|
Merged into master. Thanks for all reviews. |
… the implementation from mllib to ml ## What changes were proposed in this pull request? Copy `GaussianMixture` implementation from mllib to ml, then we can add new features to it. I left mllib `GaussianMixture` untouched, unlike some other algorithms to wrap the ml implementation. For the following reasons: - mllib `GaussianMixture` allows k == 1, but ml does not. - mllib `GaussianMixture` supports setting initial model, but ml does not support currently. (We will definitely add this feature for ml in the future) We can get around these issues to make mllib as a wrapper calling into ml, but I'd prefer to leave mllib untouched which can make ml clean. Meanwhile, There is a big performance improvement for `GaussianMixture` in this PR. Since the covariance matrix of multivariate gaussian distribution is symmetric, we can only store the upper triangular part of the matrix and it will greatly reduce the shuffled data size. In my test, this change will reduce shuffled data size by about 50% and accelerate the job execution. Before this PR:  After this PR:  ## How was this patch tested? Existing tests and added new tests. Author: Yanbo Liang <[email protected]> Closes apache#15413 from yanboliang/spark-17847.
… the implementation from mllib to ml ## What changes were proposed in this pull request? Copy `GaussianMixture` implementation from mllib to ml, then we can add new features to it. I left mllib `GaussianMixture` untouched, unlike some other algorithms to wrap the ml implementation. For the following reasons: - mllib `GaussianMixture` allows k == 1, but ml does not. - mllib `GaussianMixture` supports setting initial model, but ml does not support currently. (We will definitely add this feature for ml in the future) We can get around these issues to make mllib as a wrapper calling into ml, but I'd prefer to leave mllib untouched which can make ml clean. Meanwhile, There is a big performance improvement for `GaussianMixture` in this PR. Since the covariance matrix of multivariate gaussian distribution is symmetric, we can only store the upper triangular part of the matrix and it will greatly reduce the shuffled data size. In my test, this change will reduce shuffled data size by about 50% and accelerate the job execution. Before this PR:  After this PR:  ## How was this patch tested? Existing tests and added new tests. Author: Yanbo Liang <[email protected]> Closes apache#15413 from yanboliang/spark-17847.
What changes were proposed in this pull request?
Copy
GaussianMixtureimplementation from mllib to ml, then we can add new features to it.I left mllib
GaussianMixtureuntouched, unlike some other algorithms to wrap the ml implementation. For the following reasons:GaussianMixtureallows k == 1, but ml does not.GaussianMixturesupports setting initial model, but ml does not support currently. (We will definitely add this feature for ml in the future)We can get around these issues to make mllib as a wrapper calling into ml, but I'd prefer to leave mllib untouched which can make ml clean.
Meanwhile, There is a big performance improvement for
GaussianMixturein this PR. Since the covariance matrix of multivariate gaussian distribution is symmetric, we can only store the upper triangular part of the matrix and it will greatly reduce the shuffled data size. In my test, this change will reduce shuffled data size by about 50% and accelerate the job execution.Before this PR:


After this PR:
How was this patch tested?
Existing tests and added new tests.