Skip to content

Conversation

@yanboliang
Copy link
Contributor

@yanboliang yanboliang commented Oct 10, 2016

What changes were proposed in this pull request?

Copy GaussianMixture implementation from mllib to ml, then we can add new features to it.
I left mllib GaussianMixture untouched, unlike some other algorithms to wrap the ml implementation. For the following reasons:

  • mllib GaussianMixture allows k == 1, but ml does not.
  • mllib GaussianMixture supports setting initial model, but ml does not support currently. (We will definitely add this feature for ml in the future)

We can get around these issues to make mllib as a wrapper calling into ml, but I'd prefer to leave mllib untouched which can make ml clean.

Meanwhile, There is a big performance improvement for GaussianMixture in this PR. Since the covariance matrix of multivariate gaussian distribution is symmetric, we can only store the upper triangular part of the matrix and it will greatly reduce the shuffled data size. In my test, this change will reduce shuffled data size by about 50% and accelerate the job execution.

Before this PR:
image
After this PR:
image

How was this patch tested?

Existing tests and added new tests.

@SparkQA
Copy link

SparkQA commented Oct 10, 2016

Test build #66625 has finished for PR 15413 at commit 5a8de4a.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 10, 2016

Test build #66655 has finished for PR 15413 at commit a1e901b.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yanboliang yanboliang changed the title [SPARK-17847] [ML] Copy GaussianMixture implementation from mllib to ml [SPARK-17847][ML][WIP] Copy GaussianMixture implementation from mllib to ml Oct 23, 2016
@SparkQA
Copy link

SparkQA commented Oct 23, 2016

Test build #67414 has finished for PR 15413 at commit 8b94909.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yanboliang yanboliang changed the title [SPARK-17847][ML][WIP] Copy GaussianMixture implementation from mllib to ml [SPARK-17847][ML] Copy GaussianMixture implementation from mllib to ml Oct 24, 2016
@yanboliang yanboliang changed the title [SPARK-17847][ML] Copy GaussianMixture implementation from mllib to ml [SPARK-17847][ML] Reduce shuffled data size of GaussianMixture & copy the implementation from mllib to ml Oct 24, 2016
@SparkQA
Copy link

SparkQA commented Oct 24, 2016

Test build #67448 has finished for PR 15413 at commit 15deb72.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yanboliang
Copy link
Contributor Author

cc @sethah @jkbradley

@SparkQA
Copy link

SparkQA commented Oct 25, 2016

Test build #67509 has finished for PR 15413 at commit 9617076.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sethah
Copy link
Contributor

sethah commented Oct 25, 2016

Do you plan to run performance tests to ensure there are no regressions?

@yanboliang
Copy link
Contributor Author

@sethah I did some performance tests actually, and found this change can improve by 1.5x ~ 2x according to different dimensions and number of clusters. I will post the test result soon. Thanks.

Copy link
Contributor

@sethah sethah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made a first pass. I haven't looked at the tests yet.

Also, about keeping the mllib code around. It would be really quite simple to get around the issues you mentioned. We can do like LogisticRegression and make a private var optInitialModel for now. For k, we could we could make an alternate private train constructor which takes k as an argument. Still, I'm ok with leaving it for a future PR, but I don't think it's being blocked by those issues. Let me know what you think on this issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be nice to factor this out into an initialization method so we can just call val gaussians = initRandom(...) or similar.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use local pointers to avoid calling virtual methods each iteration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use new Array[Double](size) instead of Array.fill(size)(0.0)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use local pointers localNewWeights, localNewMeans, localNewCovs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the number of iterations(equals to the number of cluster) is small enough to ignore the impact, compared with the dimension of gradient in LiR/LoR which usually as large as millions or billions, but it's better we can avoid the extra cost.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will allocate an intermediate zipped array. Maybe we can use a while loop and also collect pSum inside it. We should use a localGaussians reference as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's use while since this is called in several places

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Convert an n * (n + 1) / 2 dimension array representing the upper triangular part of a matrix into an n * n array representing the full symmetric matrix". I think that's more explicit about what is happening. Also very minor nit, can we call the array triangularValues instead? triangular sounds like it should be a boolean to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: 25.0. Also, let's call it numFeatures instead of d

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: why not logLikelihood and logLikelihoodPrev ? It's nice to have descriptive variable names, then we can remove the comments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have typically used this documentation as a place to explain the math used to compute updates. It would be nice to have that here as well.

@jkbradley
Copy link
Member

As far as keeping the code around, I much prefer either the current approach (separate code) or having spark.mllib call into spark.ml. That will make it easier to deprecate and eventually remove spark.mllib code in 3.0.

I like the upper-triangular matrix packing and unpacking! Could you please add a unit test for it?

@yanboliang
Copy link
Contributor Author

yanboliang commented Nov 8, 2016

@sethah Yeah, I totally agree we can get around the issues I mentioned and make mllib.GaussianMixture call into ml.GaussianMixture. But I'd prefer more the current approach based on the experience of moving NaiveBayes. Still, we can discuss this issue in a future task and let's this PR focus on the improved implementation.
@jkbradley I added test for upper-triangular matrix unpacking.

@SparkQA
Copy link

SparkQA commented Nov 8, 2016

Test build #68332 has finished for PR 15413 at commit b2c2fa0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 22, 2016

Test build #68996 has finished for PR 15413 at commit b2c2fa0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yanboliang
Copy link
Contributor Author

@jkbradley @sethah Any more comments? Thanks.

@SparkQA
Copy link

SparkQA commented Dec 23, 2016

Test build #70550 has finished for PR 15413 at commit 0bad9e7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

I'll take a look, thanks for pinging!

Copy link
Member

@jkbradley jkbradley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the nice PR! I only found small things to comment on.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Document that the matrix is in column-major order.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Elsewhere too (e.g., in ExpectationAggregator)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: space after "while"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And in several other places

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need to make these local copies?

Copy link
Contributor Author

@yanboliang yanboliang Jan 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because we are actually accessing a getter method when we call this.newWeights. To improve the performance of the loop at L655, we should use explicit pointers to the values rather than call getter each time. It's probably not a big deal in this case since k is usually not very big, but I don't think it's a bad idea.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. I'm not sure about this either. Does the JIT compiler adjust enough to make it efficient? The way it is seems fine though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't think it's necessary here because this is inside the merge operations which will be called far less than the add operation. It may be overkill in any event, but I'm also ok leaving it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Below here, we call logNumFeatures. This isn't part of your PR, but could you move it earlier since numFeatures is available before running the algorithm?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You always use this right away by converting it to a DenseMatrix, so how about just returning a DenseMatrix?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the table for documentation though. Does using fewer digits stabilize it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set the seed in this and other tests

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplify: point.toSparse

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and the previous tests are almost the same. How about combining them via a helper function?

@SparkQA
Copy link

SparkQA commented Jan 6, 2017

Test build #70976 has finished for PR 15413 at commit b6e9d5f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yanboliang
Copy link
Contributor Author

@jkbradley I addressed most of your comments. Thanks.

modelEquals(expected, actual)
}

test("univariate sparse data with two clusters") {
Copy link
Member

@jkbradley jkbradley Jan 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and the previous tests are almost the same. How about combining them via a helper function? (I see you abstracted out part.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, don't bother. This looks ready.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be in favor of merging them since they are so nearly identical

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I moved on to merge them together. Thanks.

@jkbradley
Copy link
Member

This LGTM

@sethah Any further comments before we merge it?

@sethah
Copy link
Contributor

sethah commented Jan 6, 2017

I did a quick pass and it looks pretty good. I'll take a more thorough look at the tests this weekend, but if you want to merge it I think any of those items could be addressed in follow ups.

@jkbradley
Copy link
Member

OK, I'll just wait so @sethah can make a final pass and so @yanboliang can merge the 2 tests.

@SparkQA
Copy link

SparkQA commented Jan 7, 2017

Test build #71014 has finished for PR 15413 at commit fa5b1ac.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@sethah sethah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple comments on testing. Nice work on the performance improvement :)

}
}

test("check distributed decomposition") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test only checks that when we distribute the computation that it produces a model, i.e. that it doesn't fail. So, AFAICT we don't have any test right now that checks that when we distribute the computation it produces a correct model. I think it's a good idea to have that here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because the model is big and it's tedious to construct the model in advance. In this model, gaussians(the array of MultivariateGaussian) contains 5 elements and each element contains a mean array of length 50 and a covariance matrix of size 50 * 50.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I played with this a bit, and wrote a test to generate two very separate clusters and run with distributed computation. Then we could check that the model learns approximately the correct cluster means. However, I found that the algorithm seems incapable of learning even this very contrived example - due to the initialization method.

I still think it's a good test to have, but if you feel strongly against it then let's leave it. Otherwise it could be a follow up (along with more investigation to the initialization method, which does not seem to be effective).

Copy link
Contributor Author

@yanboliang yanboliang Jan 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I also suffer from bad initialization in some of my use cases. So I think we should push to resolve SPARK-15785 firstly. It's more easy to add correctness test after we support initial model. I'll leave this as follow up and open SPARK-19144 to track. Thanks.

testEstimatorAndModelReadWrite(gm, dataset,
GaussianMixtureSuite.allParamSettings, checkModelData)
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In most of the other test suites in ML we have a test that checks the prediction/transform methods. For example, checking that the prediction always matches the highest probability, checking that probabilities sum to one. I don't see much reason to diverge from that pattern here, what do you think @yanboliang?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, updated.

def modelEquals(m1: GaussianMixtureModel, m2: GaussianMixtureModel): Unit = {
assert(m1.weights.length === m2.weights.length)
for (i <- m1.weights.indices) {
assert(m1.gaussians(i).mean ~== m2.gaussians(i).mean absTol 1E-3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not also check the weights here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, forgot it, added. Thanks.

@SparkQA
Copy link

SparkQA commented Jan 8, 2017

Test build #71034 has finished for PR 15413 at commit 45fa740.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 8, 2017

Test build #71035 has finished for PR 15413 at commit cba2906.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sethah
Copy link
Contributor

sethah commented Jan 9, 2017

Left one small comment which isn't a blocker. LGTM otherwise.

@yanboliang
Copy link
Contributor Author

Merged into master. Thanks for all reviews.

@asfgit asfgit closed this in 3ef6d98 Jan 10, 2017
@yanboliang yanboliang deleted the spark-17847 branch January 10, 2017 05:41
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
… the implementation from mllib to ml

## What changes were proposed in this pull request?

Copy `GaussianMixture` implementation from mllib to ml, then we can add new features to it.
I left mllib `GaussianMixture` untouched, unlike some other algorithms to wrap the ml implementation. For the following reasons:
- mllib `GaussianMixture` allows k == 1, but ml does not.
- mllib `GaussianMixture` supports setting initial model, but ml does not support currently. (We will definitely add this feature for ml in the future)

We can get around these issues to make mllib as a wrapper calling into ml, but I'd prefer to leave mllib untouched which can make ml clean.

Meanwhile, There is a big performance improvement for `GaussianMixture` in this PR. Since the covariance matrix of multivariate gaussian distribution is symmetric, we can only store the upper triangular part of the matrix and it will greatly reduce the shuffled data size. In my test, this change will reduce shuffled data size by about 50% and accelerate the job execution.

Before this PR:
![image](https://cloud.githubusercontent.com/assets/1962026/19641622/4bb017ac-9996-11e6-8ece-83db184b620a.png)
After this PR:
![image](https://cloud.githubusercontent.com/assets/1962026/19641635/629c21fe-9996-11e6-91e9-83ab74ae0126.png)
## How was this patch tested?

Existing tests and added new tests.

Author: Yanbo Liang <[email protected]>

Closes apache#15413 from yanboliang/spark-17847.
cmonkey pushed a commit to cmonkey/spark that referenced this pull request Feb 15, 2017
… the implementation from mllib to ml

## What changes were proposed in this pull request?

Copy `GaussianMixture` implementation from mllib to ml, then we can add new features to it.
I left mllib `GaussianMixture` untouched, unlike some other algorithms to wrap the ml implementation. For the following reasons:
- mllib `GaussianMixture` allows k == 1, but ml does not.
- mllib `GaussianMixture` supports setting initial model, but ml does not support currently. (We will definitely add this feature for ml in the future)

We can get around these issues to make mllib as a wrapper calling into ml, but I'd prefer to leave mllib untouched which can make ml clean.

Meanwhile, There is a big performance improvement for `GaussianMixture` in this PR. Since the covariance matrix of multivariate gaussian distribution is symmetric, we can only store the upper triangular part of the matrix and it will greatly reduce the shuffled data size. In my test, this change will reduce shuffled data size by about 50% and accelerate the job execution.

Before this PR:
![image](https://cloud.githubusercontent.com/assets/1962026/19641622/4bb017ac-9996-11e6-8ece-83db184b620a.png)
After this PR:
![image](https://cloud.githubusercontent.com/assets/1962026/19641635/629c21fe-9996-11e6-91e9-83ab74ae0126.png)
## How was this patch tested?

Existing tests and added new tests.

Author: Yanbo Liang <[email protected]>

Closes apache#15413 from yanboliang/spark-17847.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants