[SPARK-19313][ML][MLLIB] GaussianMixture should limit the number of features #16661

sethah · 2017-01-20T19:23:59Z

What changes were proposed in this pull request?

The following test will fail on current master

test("gmm fails on high dimensional data") {
    val ctx = spark.sqlContext
    import ctx.implicits._
    val df = Seq(
      Vectors.sparse(GaussianMixture.MAX_NUM_FEATURES + 1, Array(0, 4), Array(3.0, 8.0)),
      Vectors.sparse(GaussianMixture.MAX_NUM_FEATURES + 1, Array(1, 5), Array(4.0, 9.0)))
      .map(Tuple1.apply).toDF("features")
    val gm = new GaussianMixture()
    intercept[IllegalArgumentException] {
      gm.fit(df)
    }
  }

Instead, you'll get an ArrayIndexOutOfBoundsException or something similar for MLlib. That's because the covariance matrix allocates an array of numFeatures * numFeatures, and in this case we get integer overflow. While there is currently a warning that the algorithm does not perform well for high number of features, we should perform an appropriate check to communicate this limitation to users.

This patch adds a require(numFeatures < GaussianMixture.MAX_NUM_FEATURES) check to ML and MLlib algorithms. For the feature limitation, we can limit it such that we do not get numerical overflow to something like math.sqrt(Integer.MaxValue).toInt (about 46k) which eliminates the cryptic error. However in, for example WLS, we need to collect an array on the order of numFeatures * numFeatures to the driver and we therefore limit to 4096 features. We may want to keep that convention here for consistency.

How was this patch tested?

Unit tests in ML and MLlib.

SparkQA · 2017-01-20T20:27:45Z

Test build #71735 has finished for PR 16661 at commit 2ed94de.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sethah · 2017-01-22T05:10:34Z

ping @yanboliang

yanboliang · 2017-01-22T16:41:29Z

Will review it tomorrow. Thanks.

imatiach-msft · 2017-01-23T16:07:04Z

mllib/src/main/scala/org/apache/spark/ml/clustering/GaussianMixture.scala

 object GaussianMixture extends DefaultParamsReadable[GaussianMixture] {

+  /** Limit number of features such that numFeatures^2^ < Integer.MaxValue */
+  private[clustering] val MAX_NUM_FEATURES = 46000


shouldn't this be in upper camel case according to scala style?

http://docs.scala-lang.org/style/naming-conventions.html

This is like a private static final field in Java, and when used for constants, CONSTANT_CASE is normal.

In #15413, the symmetry of covariance matrix is taken into account and only the upper triangular part is store. So this number seems to be 65535? (math.sqrt(Int.MaxValue.toDouble * 2))

We have to unpack the covariance matrix to a full covariance matrix before returning the model.

Is floor(sqrt(2^31-1)) = 46340 more accurate? or is there overhead that prevents this from being achievable? I know it's a corner case, but if 46000 is a number that's just "about" the real max, let's just use the real max.

+1 @srowen It's better to use the real max.

imatiach-msft · 2017-01-23T16:13:44Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala

 private[clustering] object GaussianMixture {
+
+  /** Limit number of features such that numFeatures^2^ < Integer.MaxValue */
+  private[clustering] val MAX_NUM_FEATURES = 46000


it looks like the constant can be shared between the two GMM classes - I would recommend using the mllib one for now.

I can see benefits either way, but I think leaving ML GMM to be completely independent of MLlib is slightly preferable.

ultimately long-term the plan is to deprecate mllib so keeping it separate is preferable

imatiach-msft · 2017-01-23T16:17:51Z

mllib/src/test/scala/org/apache/spark/ml/clustering/GaussianMixtureSuite.scala


+  test("gmm fails on high dimensional data") {
+    val ctx = spark.sqlContext
+    import ctx.implicits._


is there a way to remove this import? I'm not sure why you need it.

imatiach-msft · 2017-01-23T16:18:52Z

I left few minor comments, looks good to me!

imatiach-msft · 2017-01-23T22:41:17Z

LGTM, nice work! Who has the permissions to push the changes?

sethah · 2017-01-23T23:17:25Z

@imatiach-msft Spark committers must push the changes. As long as at least one committer is aware of the changes there is probably nothing left to do.

sethah · 2017-01-23T23:17:43Z

Thanks for the review @srowen and @imatiach-msft!

SparkQA · 2017-01-23T23:20:45Z

Test build #71880 has finished for PR 16661 at commit 51a237b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang

LGTM except for very minor comments.

yanboliang · 2017-01-24T15:24:40Z

mllib/src/main/scala/org/apache/spark/ml/clustering/GaussianMixture.scala

 @Since("2.0.0")
 object GaussianMixture extends DefaultParamsReadable[GaussianMixture] {

+  /** Limit number of features such that numFeatures^2^ < Integer.MaxValue */


Nit: Integer.MaxValue is not a standard convention, it should be Int.MaxValue in Scala or Integer.MAX_VALUE in Java.

SparkQA · 2017-01-24T17:14:22Z

Test build #71937 has finished for PR 16661 at commit 5672d13.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2017-01-25T02:51:18Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala

 private[clustering] object GaussianMixture {
+
+  /** Limit number of features such that numFeatures^2^ < Int.MaxValue */
+  private[clustering] val MAX_NUM_FEATURES = math.sqrt(Int.MaxValue).toInt


The number is not equal to that used in computeCovariance() in mllib.linalg.distributed.RowMatrix.
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala#L327
Do the limits in mllib.linalg.distributed.RowMatrix need to be updated to this one?

I believe the limiting factor here is that we can't have an array of elements somewhere that has more than 2^31 - 1 elements. For a dense representation of a normal n x n matrix, that limits n to 46340. Here, however, the matrix is a symmetric Gramian matrix that needs n(n+1)/2 elements of storage, so 65535 works.

zhengruifeng · 2017-01-25T02:57:11Z

BTW, it maybe nice to add a SymmetricMatrix class, for symmetric matrice are widely used in computation of covariance/concurrence/etc

yanboliang · 2017-01-25T15:12:45Z

Merged into master. Thanks for all.

…eatures ## What changes were proposed in this pull request? The following test will fail on current master ````scala test("gmm fails on high dimensional data") { val ctx = spark.sqlContext import ctx.implicits._ val df = Seq( Vectors.sparse(GaussianMixture.MAX_NUM_FEATURES + 1, Array(0, 4), Array(3.0, 8.0)), Vectors.sparse(GaussianMixture.MAX_NUM_FEATURES + 1, Array(1, 5), Array(4.0, 9.0))) .map(Tuple1.apply).toDF("features") val gm = new GaussianMixture() intercept[IllegalArgumentException] { gm.fit(df) } } ```` Instead, you'll get an `ArrayIndexOutOfBoundsException` or something similar for MLlib. That's because the covariance matrix allocates an array of `numFeatures * numFeatures`, and in this case we get integer overflow. While there is currently a warning that the algorithm does not perform well for high number of features, we should perform an appropriate check to communicate this limitation to users. This patch adds a `require(numFeatures < GaussianMixture.MAX_NUM_FEATURES)` check to ML and MLlib algorithms. For the feature limitation, we can limit it such that we do not get numerical overflow to something like `math.sqrt(Integer.MaxValue).toInt` (about 46k) which eliminates the cryptic error. However in, for example WLS, we need to collect an array on the order of `numFeatures * numFeatures` to the driver and we therefore limit to 4096 features. We may want to keep that convention here for consistency. ## How was this patch tested? Unit tests in ML and MLlib. Author: sethah <[email protected]> Closes apache#16661 from sethah/gmm_high_dim.

numFeatures check

2ed94de

sethah force-pushed the gmm_high_dim branch from b5ae5bd to 2ed94de Compare January 20, 2017 19:25

imatiach-msft reviewed Jan 23, 2017

View reviewed changes

imatiach-msft approved these changes Jan 23, 2017

View reviewed changes

srowen approved these changes Jan 23, 2017

View reviewed changes

address review

51a237b

yanboliang reviewed Jan 24, 2017

View reviewed changes

exact threshold

5672d13

zhengruifeng reviewed Jan 25, 2017

View reviewed changes

asfgit closed this in 0e821ec Jan 25, 2017

[SPARK-19313][ML][MLLIB] GaussianMixture should limit the number of features #16661

[SPARK-19313][ML][MLLIB] GaussianMixture should limit the number of features #16661

Uh oh!

Conversation

sethah commented Jan 20, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jan 20, 2017

Uh oh!

sethah commented Jan 22, 2017

Uh oh!

yanboliang commented Jan 22, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

imatiach-msft Jan 23, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

imatiach-msft commented Jan 23, 2017

Uh oh!

imatiach-msft commented Jan 23, 2017

Uh oh!

sethah commented Jan 23, 2017

Uh oh!

sethah commented Jan 23, 2017

Uh oh!

SparkQA commented Jan 23, 2017

Uh oh!

yanboliang left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 24, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhengruifeng commented Jan 25, 2017

Uh oh!

yanboliang commented Jan 25, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

imatiach-msft Jan 23, 2017 •

edited

Loading