[SPARK-17847][ML] Reduce shuffled data size of GaussianMixture & copy the implementation from mllib to ml #15413

yanboliang · 2016-10-10T05:05:28Z

What changes were proposed in this pull request?

Copy GaussianMixture implementation from mllib to ml, then we can add new features to it.
I left mllib GaussianMixture untouched, unlike some other algorithms to wrap the ml implementation. For the following reasons:

mllib GaussianMixture allows k == 1, but ml does not.
mllib GaussianMixture supports setting initial model, but ml does not support currently. (We will definitely add this feature for ml in the future)

We can get around these issues to make mllib as a wrapper calling into ml, but I'd prefer to leave mllib untouched which can make ml clean.

Meanwhile, There is a big performance improvement for GaussianMixture in this PR. Since the covariance matrix of multivariate gaussian distribution is symmetric, we can only store the upper triangular part of the matrix and it will greatly reduce the shuffled data size. In my test, this change will reduce shuffled data size by about 50% and accelerate the job execution.

Before this PR:

After this PR:

How was this patch tested?

Existing tests and added new tests.

SparkQA · 2016-10-10T05:53:21Z

Test build #66625 has finished for PR 15413 at commit 5a8de4a.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-10T15:48:08Z

Test build #66655 has finished for PR 15413 at commit a1e901b.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-23T11:40:56Z

Test build #67414 has finished for PR 15413 at commit 8b94909.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-24T11:51:18Z

Test build #67448 has finished for PR 15413 at commit 15deb72.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang · 2016-10-25T10:11:50Z

cc @sethah @jkbradley

SparkQA · 2016-10-25T11:54:15Z

Test build #67509 has finished for PR 15413 at commit 9617076.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sethah · 2016-10-25T23:01:20Z

Do you plan to run performance tests to ensure there are no regressions?

yanboliang · 2016-10-26T15:31:17Z

@sethah I did some performance tests actually, and found this change can improve by 1.5x ~ 2x according to different dimensions and number of clusters. I will post the test result soon. Thanks.

sethah

Made a first pass. I haven't looked at the tests yet.

Also, about keeping the mllib code around. It would be really quite simple to get around the issues you mentioned. We can do like LogisticRegression and make a private var optInitialModel for now. For k, we could we could make an alternate private train constructor which takes k as an argument. Still, I'm ok with leaving it for a future PR, but I don't think it's being blocked by those issues. Let me know what you think on this issue.

sethah · 2016-10-25T23:43:47Z

mllib/src/main/scala/org/apache/spark/ml/clustering/GaussianMixture.scala

I think it would be nice to factor this out into an initialization method so we can just call val gaussians = initRandom(...) or similar.

sethah · 2016-10-26T15:34:27Z

mllib/src/main/scala/org/apache/spark/ml/clustering/GaussianMixture.scala

use local pointers to avoid calling virtual methods each iteration.

sethah · 2016-10-26T15:36:19Z

mllib/src/main/scala/org/apache/spark/ml/clustering/GaussianMixture.scala

use new Array[Double](size) instead of Array.fill(size)(0.0)

sethah · 2016-10-26T15:57:36Z

mllib/src/main/scala/org/apache/spark/ml/clustering/GaussianMixture.scala

use local pointers localNewWeights, localNewMeans, localNewCovs

I thought the number of iterations(equals to the number of cluster) is small enough to ignore the impact, compared with the dimension of gradient in LiR/LoR which usually as large as millions or billions, but it's better we can avoid the extra cost.

sethah · 2016-10-26T16:09:10Z

mllib/src/main/scala/org/apache/spark/ml/clustering/GaussianMixture.scala

This will allocate an intermediate zipped array. Maybe we can use a while loop and also collect pSum inside it. We should use a localGaussians reference as well.

sethah · 2016-10-26T18:20:09Z

mllib/src/main/scala/org/apache/spark/ml/clustering/GaussianMixture.scala

let's use while since this is called in several places

sethah · 2016-10-26T18:24:20Z

mllib/src/main/scala/org/apache/spark/ml/clustering/GaussianMixture.scala

"Convert an n * (n + 1) / 2 dimension array representing the upper triangular part of a matrix into an n * n array representing the full symmetric matrix". I think that's more explicit about what is happening. Also very minor nit, can we call the array triangularValues instead? triangular sounds like it should be a boolean to me.

sethah · 2016-10-26T18:25:12Z

mllib/src/main/scala/org/apache/spark/ml/clustering/GaussianMixture.scala

nit: 25.0. Also, let's call it numFeatures instead of d

sethah · 2016-10-26T18:28:15Z

mllib/src/main/scala/org/apache/spark/ml/clustering/GaussianMixture.scala

minor: why not logLikelihood and logLikelihoodPrev ? It's nice to have descriptive variable names, then we can remove the comments.

sethah · 2016-10-26T18:31:05Z

mllib/src/main/scala/org/apache/spark/ml/clustering/GaussianMixture.scala

We have typically used this documentation as a place to explain the math used to compute updates. It would be nice to have that here as well.

jkbradley · 2016-10-31T17:30:45Z

As far as keeping the code around, I much prefer either the current approach (separate code) or having spark.mllib call into spark.ml. That will make it easier to deprecate and eventually remove spark.mllib code in 3.0.

I like the upper-triangular matrix packing and unpacking! Could you please add a unit test for it?

yanboliang · 2016-11-08T09:25:40Z

@sethah Yeah, I totally agree we can get around the issues I mentioned and make mllib.GaussianMixture call into ml.GaussianMixture. But I'd prefer more the current approach based on the experience of moving NaiveBayes. Still, we can discuss this issue in a future task and let's this PR focus on the improved implementation.
@jkbradley I added test for upper-triangular matrix unpacking.

SparkQA · 2016-11-08T09:46:14Z

Test build #68332 has finished for PR 15413 at commit b2c2fa0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-22T13:07:40Z

Test build #68996 has finished for PR 15413 at commit b2c2fa0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang · 2016-12-23T16:45:22Z

@jkbradley @sethah Any more comments? Thanks.

SparkQA · 2016-12-23T17:48:41Z

Test build #70550 has finished for PR 15413 at commit 0bad9e7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2016-12-28T20:45:16Z

I'll take a look, thanks for pinging!

jkbradley

Thanks for the nice PR! I only found small things to comment on.

jkbradley · 2016-12-28T21:14:17Z

mllib/src/main/scala/org/apache/spark/ml/clustering/GaussianMixture.scala

Document that the matrix is in column-major order.

Elsewhere too (e.g., in ExpectationAggregator)

jkbradley · 2016-12-28T22:18:11Z

mllib/src/main/scala/org/apache/spark/ml/clustering/GaussianMixture.scala

style: space after "while"

And in several other places

jkbradley · 2016-12-28T22:22:07Z

mllib/src/main/scala/org/apache/spark/ml/clustering/GaussianMixture.scala

Why do you need to make these local copies?

This is because we are actually accessing a getter method when we call this.newWeights. To improve the performance of the loop at L655, we should use explicit pointers to the values rather than call getter each time. It's probably not a big deal in this case since k is usually not very big, but I don't think it's a bad idea.

Oh, I see. I'm not sure about this either. Does the JIT compiler adjust enough to make it efficient? The way it is seems fine though.

Yeah, I don't think it's necessary here because this is inside the merge operations which will be called far less than the add operation. It may be overkill in any event, but I'm also ok leaving it.

jkbradley · 2016-12-28T22:34:02Z

mllib/src/main/scala/org/apache/spark/ml/clustering/GaussianMixture.scala

Below here, we call logNumFeatures. This isn't part of your PR, but could you move it earlier since numFeatures is available before running the algorithm?

jkbradley · 2016-12-28T22:35:26Z

mllib/src/main/scala/org/apache/spark/ml/clustering/GaussianMixture.scala

You always use this right away by converting it to a DenseMatrix, so how about just returning a DenseMatrix?

jkbradley · 2016-12-28T22:40:00Z

python/pyspark/ml/clustering.py

I like the table for documentation though. Does using fewer digits stabilize it?

jkbradley · 2016-12-28T22:50:02Z

mllib/src/test/scala/org/apache/spark/ml/clustering/GaussianMixtureSuite.scala

Set the seed in this and other tests

jkbradley · 2016-12-28T22:51:33Z

mllib/src/test/scala/org/apache/spark/ml/clustering/GaussianMixtureSuite.scala

Simplify: point.toSparse

jkbradley · 2016-12-28T22:52:21Z

mllib/src/test/scala/org/apache/spark/ml/clustering/GaussianMixtureSuite.scala

This and the previous tests are almost the same. How about combining them via a helper function?

SparkQA · 2017-01-06T10:19:39Z

Test build #70976 has finished for PR 15413 at commit b6e9d5f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang · 2017-01-06T11:15:18Z

@jkbradley I addressed most of your comments. Thanks.

jkbradley · 2017-01-06T22:21:47Z

mllib/src/test/scala/org/apache/spark/ml/clustering/GaussianMixtureSuite.scala

+    modelEquals(expected, actual)
+  }
+
+  test("univariate sparse data with two clusters") {


This and the previous tests are almost the same. How about combining them via a helper function? (I see you abstracted out part.)

Actually, don't bother. This looks ready.

I'd be in favor of merging them since they are so nearly identical

Yeah, I moved on to merge them together. Thanks.

jkbradley · 2017-01-06T22:26:30Z

This LGTM

@sethah Any further comments before we merge it?

sethah · 2017-01-06T23:52:00Z

I did a quick pass and it looks pretty good. I'll take a more thorough look at the tests this weekend, but if you want to merge it I think any of those items could be addressed in follow ups.

jkbradley · 2017-01-07T02:54:19Z

OK, I'll just wait so @sethah can make a final pass and so @yanboliang can merge the 2 tests.

SparkQA · 2017-01-07T15:05:13Z

Test build #71014 has finished for PR 15413 at commit fa5b1ac.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sethah

Just a couple comments on testing. Nice work on the performance improvement :)

sethah · 2017-01-07T19:20:12Z

mllib/src/test/scala/org/apache/spark/ml/clustering/GaussianMixtureSuite.scala

+    }
+  }
+
+  test("check distributed decomposition") {


This test only checks that when we distribute the computation that it produces a model, i.e. that it doesn't fail. So, AFAICT we don't have any test right now that checks that when we distribute the computation it produces a correct model. I think it's a good idea to have that here.

This is because the model is big and it's tedious to construct the model in advance. In this model, gaussians(the array of MultivariateGaussian) contains 5 elements and each element contains a mean array of length 50 and a covariance matrix of size 50 * 50.

I played with this a bit, and wrote a test to generate two very separate clusters and run with distributed computation. Then we could check that the model learns approximately the correct cluster means. However, I found that the algorithm seems incapable of learning even this very contrived example - due to the initialization method.

I still think it's a good test to have, but if you feel strongly against it then let's leave it. Otherwise it could be a follow up (along with more investigation to the initialization method, which does not seem to be effective).

Yeah, I also suffer from bad initialization in some of my use cases. So I think we should push to resolve SPARK-15785 firstly. It's more easy to add correctness test after we support initial model. I'll leave this as follow up and open SPARK-19144 to track. Thanks.

sethah · 2017-01-07T19:25:57Z

mllib/src/test/scala/org/apache/spark/ml/clustering/GaussianMixtureSuite.scala

    testEstimatorAndModelReadWrite(gm, dataset,
      GaussianMixtureSuite.allParamSettings, checkModelData)
  }
+


In most of the other test suites in ML we have a test that checks the prediction/transform methods. For example, checking that the prediction always matches the highest probability, checking that probabilities sum to one. I don't see much reason to diverge from that pattern here, what do you think @yanboliang?

Sounds good, updated.

sethah · 2017-01-07T19:35:23Z

mllib/src/test/scala/org/apache/spark/ml/clustering/GaussianMixtureSuite.scala

+  def modelEquals(m1: GaussianMixtureModel, m2: GaussianMixtureModel): Unit = {
+    assert(m1.weights.length === m2.weights.length)
+    for (i <- m1.weights.indices) {
+      assert(m1.gaussians(i).mean ~== m2.gaussians(i).mean absTol 1E-3)


Why not also check the weights here?

Oops, forgot it, added. Thanks.

SparkQA · 2017-01-08T11:13:53Z

Test build #71034 has finished for PR 15413 at commit 45fa740.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-08T11:37:39Z

Test build #71035 has finished for PR 15413 at commit cba2906.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sethah · 2017-01-09T16:52:33Z

Left one small comment which isn't a blocker. LGTM otherwise.

yanboliang · 2017-01-10T05:39:06Z

Merged into master. Thanks for all reviews.

… the implementation from mllib to ml ## What changes were proposed in this pull request? Copy `GaussianMixture` implementation from mllib to ml, then we can add new features to it. I left mllib `GaussianMixture` untouched, unlike some other algorithms to wrap the ml implementation. For the following reasons: - mllib `GaussianMixture` allows k == 1, but ml does not. - mllib `GaussianMixture` supports setting initial model, but ml does not support currently. (We will definitely add this feature for ml in the future) We can get around these issues to make mllib as a wrapper calling into ml, but I'd prefer to leave mllib untouched which can make ml clean. Meanwhile, There is a big performance improvement for `GaussianMixture` in this PR. Since the covariance matrix of multivariate gaussian distribution is symmetric, we can only store the upper triangular part of the matrix and it will greatly reduce the shuffled data size. In my test, this change will reduce shuffled data size by about 50% and accelerate the job execution. Before this PR: ![image](https://cloud.githubusercontent.com/assets/1962026/19641622/4bb017ac-9996-11e6-8ece-83db184b620a.png) After this PR: ![image](https://cloud.githubusercontent.com/assets/1962026/19641635/629c21fe-9996-11e6-91e9-83ab74ae0126.png) ## How was this patch tested? Existing tests and added new tests. Author: Yanbo Liang <[email protected]> Closes apache#15413 from yanboliang/spark-17847.

yanboliang force-pushed the spark-17847 branch from a1e901b to 8b94909 Compare October 23, 2016 10:39

yanboliang changed the title ~~[SPARK-17847] [ML] Copy GaussianMixture implementation from mllib to ml~~ [SPARK-17847][ML][WIP] Copy GaussianMixture implementation from mllib to ml Oct 23, 2016

yanboliang changed the title ~~[SPARK-17847][ML][WIP] Copy GaussianMixture implementation from mllib to ml~~ [SPARK-17847][ML] Copy GaussianMixture implementation from mllib to ml Oct 24, 2016

yanboliang changed the title ~~[SPARK-17847][ML] Copy GaussianMixture implementation from mllib to ml~~ [SPARK-17847][ML] Reduce shuffled data size of GaussianMixture & copy the implementation from mllib to ml Oct 24, 2016

sethah reviewed Oct 26, 2016

View reviewed changes

yanboliang force-pushed the spark-17847 branch from 9617076 to b2c2fa0 Compare November 8, 2016 08:46

yanboliang force-pushed the spark-17847 branch from b2c2fa0 to 0bad9e7 Compare December 23, 2016 16:44

jkbradley reviewed Dec 28, 2016

View reviewed changes

yanboliang added 7 commits January 5, 2017 22:46

Copy GaussianMixture implementation from mllib to ml

dededa6

Refactor Expectation aggregator.

c5dfec0

Fix numerical stability of PySpark doctest output.

702b18f

Only shuffle upper triangular part of covariance matrix to reduce size.

bafe591

Cache instances.

bd5e6f8

Code beautification and update doc.

c4dec2b

Add test for upper triangular matrix unpacking.

20f8982

yanboliang added 2 commits January 5, 2017 22:46

Update annotations.

5b8d9d0

Address comments.

b6e9d5f

yanboliang force-pushed the spark-17847 branch from 0bad9e7 to b6e9d5f Compare January 6, 2017 09:13

jkbradley reviewed Jan 6, 2017

View reviewed changes

Combine univariate dense and sparse test cases.

fa5b1ac

sethah reviewed Jan 7, 2017

View reviewed changes

yanboliang added 2 commits January 8, 2017 02:03

Add test for prediction and probability.

45fa740

Add weights equality test in modelEquals function.

cba2906

asfgit closed this in 3ef6d98 Jan 10, 2017

yanboliang deleted the spark-17847 branch January 10, 2017 05:41

zhengruifeng mentioned this pull request Jan 24, 2017

[SPARK-19313][ML][MLLIB] GaussianMixture should limit the number of features #16661

Closed

[SPARK-17847][ML] Reduce shuffled data size of GaussianMixture & copy the implementation from mllib to ml #15413

[SPARK-17847][ML] Reduce shuffled data size of GaussianMixture & copy the implementation from mllib to ml #15413

Uh oh!

Conversation

yanboliang commented Oct 10, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Oct 10, 2016

Uh oh!

SparkQA commented Oct 10, 2016

Uh oh!

SparkQA commented Oct 23, 2016

Uh oh!

SparkQA commented Oct 24, 2016

Uh oh!

yanboliang commented Oct 25, 2016

Uh oh!

SparkQA commented Oct 25, 2016

Uh oh!

sethah commented Oct 25, 2016

Uh oh!

yanboliang commented Oct 26, 2016

Uh oh!

sethah left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jkbradley commented Oct 31, 2016

Uh oh!

yanboliang commented Nov 8, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Nov 8, 2016

Uh oh!

SparkQA commented Nov 22, 2016

Uh oh!

yanboliang commented Dec 23, 2016

Uh oh!

SparkQA commented Dec 23, 2016

Uh oh!

jkbradley commented Dec 28, 2016

Uh oh!

jkbradley left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yanboliang Jan 6, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

yanboliang commented Oct 10, 2016 •

edited

Loading

yanboliang commented Nov 8, 2016 •

edited

Loading

yanboliang Jan 6, 2017 •

edited

Loading

jkbradley Jan 6, 2017 •

edited

Loading