Skip to content

Conversation

@sethah
Copy link
Contributor

@sethah sethah commented Jan 20, 2017

What changes were proposed in this pull request?

The following test will fail on current master

test("gmm fails on high dimensional data") {
    val ctx = spark.sqlContext
    import ctx.implicits._
    val df = Seq(
      Vectors.sparse(GaussianMixture.MAX_NUM_FEATURES + 1, Array(0, 4), Array(3.0, 8.0)),
      Vectors.sparse(GaussianMixture.MAX_NUM_FEATURES + 1, Array(1, 5), Array(4.0, 9.0)))
      .map(Tuple1.apply).toDF("features")
    val gm = new GaussianMixture()
    intercept[IllegalArgumentException] {
      gm.fit(df)
    }
  }

Instead, you'll get an ArrayIndexOutOfBoundsException or something similar for MLlib. That's because the covariance matrix allocates an array of numFeatures * numFeatures, and in this case we get integer overflow. While there is currently a warning that the algorithm does not perform well for high number of features, we should perform an appropriate check to communicate this limitation to users.

This patch adds a require(numFeatures < GaussianMixture.MAX_NUM_FEATURES) check to ML and MLlib algorithms. For the feature limitation, we can limit it such that we do not get numerical overflow to something like math.sqrt(Integer.MaxValue).toInt (about 46k) which eliminates the cryptic error. However in, for example WLS, we need to collect an array on the order of numFeatures * numFeatures to the driver and we therefore limit to 4096 features. We may want to keep that convention here for consistency.

How was this patch tested?

Unit tests in ML and MLlib.

@SparkQA
Copy link

SparkQA commented Jan 20, 2017

Test build #71735 has finished for PR 16661 at commit 2ed94de.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sethah
Copy link
Contributor Author

sethah commented Jan 22, 2017

ping @yanboliang

@yanboliang
Copy link
Contributor

Will review it tomorrow. Thanks.

object GaussianMixture extends DefaultParamsReadable[GaussianMixture] {

/** Limit number of features such that numFeatures^2^ < Integer.MaxValue */
private[clustering] val MAX_NUM_FEATURES = 46000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be in upper camel case according to scala style?

http://docs.scala-lang.org/style/naming-conventions.html

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is like a private static final field in Java, and when used for constants, CONSTANT_CASE is normal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In #15413, the symmetry of covariance matrix is taken into account and only the upper triangular part is store. So this number seems to be 65535? (math.sqrt(Int.MaxValue.toDouble * 2))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to unpack the covariance matrix to a full covariance matrix before returning the model.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is floor(sqrt(2^31-1)) = 46340 more accurate? or is there overhead that prevents this from being achievable? I know it's a corner case, but if 46000 is a number that's just "about" the real max, let's just use the real max.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 @srowen It's better to use the real max.

private[clustering] object GaussianMixture {

/** Limit number of features such that numFeatures^2^ < Integer.MaxValue */
private[clustering] val MAX_NUM_FEATURES = 46000
Copy link
Contributor

@imatiach-msft imatiach-msft Jan 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like the constant can be shared between the two GMM classes - I would recommend using the mllib one for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see benefits either way, but I think leaving ML GMM to be completely independent of MLlib is slightly preferable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ultimately long-term the plan is to deprecate mllib so keeping it separate is preferable


test("gmm fails on high dimensional data") {
val ctx = spark.sqlContext
import ctx.implicits._
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a way to remove this import? I'm not sure why you need it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

@imatiach-msft
Copy link
Contributor

I left few minor comments, looks good to me!

@imatiach-msft
Copy link
Contributor

LGTM, nice work! Who has the permissions to push the changes?

@sethah
Copy link
Contributor Author

sethah commented Jan 23, 2017

@imatiach-msft Spark committers must push the changes. As long as at least one committer is aware of the changes there is probably nothing left to do.

@sethah
Copy link
Contributor Author

sethah commented Jan 23, 2017

Thanks for the review @srowen and @imatiach-msft!

@SparkQA
Copy link

SparkQA commented Jan 23, 2017

Test build #71880 has finished for PR 16661 at commit 51a237b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@yanboliang yanboliang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for very minor comments.

@Since("2.0.0")
object GaussianMixture extends DefaultParamsReadable[GaussianMixture] {

/** Limit number of features such that numFeatures^2^ < Integer.MaxValue */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Integer.MaxValue is not a standard convention, it should be Int.MaxValue in Scala or Integer.MAX_VALUE in Java.

@SparkQA
Copy link

SparkQA commented Jan 24, 2017

Test build #71937 has finished for PR 16661 at commit 5672d13.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

private[clustering] object GaussianMixture {

/** Limit number of features such that numFeatures^2^ < Int.MaxValue */
private[clustering] val MAX_NUM_FEATURES = math.sqrt(Int.MaxValue).toInt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The number is not equal to that used in computeCovariance() in mllib.linalg.distributed.RowMatrix.
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala#L327
Do the limits in mllib.linalg.distributed.RowMatrix need to be updated to this one?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the limiting factor here is that we can't have an array of elements somewhere that has more than 2^31 - 1 elements. For a dense representation of a normal n x n matrix, that limits n to 46340. Here, however, the matrix is a symmetric Gramian matrix that needs n(n+1)/2 elements of storage, so 65535 works.

@zhengruifeng
Copy link
Contributor

BTW, it maybe nice to add a SymmetricMatrix class, for symmetric matrice are widely used in computation of covariance/concurrence/etc

@yanboliang
Copy link
Contributor

Merged into master. Thanks for all.

@asfgit asfgit closed this in 0e821ec Jan 25, 2017
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…eatures

## What changes were proposed in this pull request?

The following test will fail on current master

````scala
test("gmm fails on high dimensional data") {
    val ctx = spark.sqlContext
    import ctx.implicits._
    val df = Seq(
      Vectors.sparse(GaussianMixture.MAX_NUM_FEATURES + 1, Array(0, 4), Array(3.0, 8.0)),
      Vectors.sparse(GaussianMixture.MAX_NUM_FEATURES + 1, Array(1, 5), Array(4.0, 9.0)))
      .map(Tuple1.apply).toDF("features")
    val gm = new GaussianMixture()
    intercept[IllegalArgumentException] {
      gm.fit(df)
    }
  }
````

Instead, you'll get an `ArrayIndexOutOfBoundsException` or something similar for MLlib. That's because the covariance matrix allocates an array of `numFeatures * numFeatures`, and in this case we get integer overflow. While there is currently a warning that the algorithm does not perform well for high number of features, we should perform an appropriate check to communicate this limitation to users.

This patch adds a `require(numFeatures < GaussianMixture.MAX_NUM_FEATURES)` check to ML and MLlib algorithms. For the feature limitation, we can limit it such that we do not get numerical overflow to something like `math.sqrt(Integer.MaxValue).toInt` (about 46k) which eliminates the cryptic error. However in, for example WLS, we need to collect an array on the order of `numFeatures * numFeatures` to the driver and we therefore limit to 4096 features. We may want to keep that convention here for consistency.

## How was this patch tested?
Unit tests in ML and MLlib.

Author: sethah <[email protected]>

Closes apache#16661 from sethah/gmm_high_dim.
cmonkey pushed a commit to cmonkey/spark that referenced this pull request Feb 15, 2017
…eatures

## What changes were proposed in this pull request?

The following test will fail on current master

````scala
test("gmm fails on high dimensional data") {
    val ctx = spark.sqlContext
    import ctx.implicits._
    val df = Seq(
      Vectors.sparse(GaussianMixture.MAX_NUM_FEATURES + 1, Array(0, 4), Array(3.0, 8.0)),
      Vectors.sparse(GaussianMixture.MAX_NUM_FEATURES + 1, Array(1, 5), Array(4.0, 9.0)))
      .map(Tuple1.apply).toDF("features")
    val gm = new GaussianMixture()
    intercept[IllegalArgumentException] {
      gm.fit(df)
    }
  }
````

Instead, you'll get an `ArrayIndexOutOfBoundsException` or something similar for MLlib. That's because the covariance matrix allocates an array of `numFeatures * numFeatures`, and in this case we get integer overflow. While there is currently a warning that the algorithm does not perform well for high number of features, we should perform an appropriate check to communicate this limitation to users.

This patch adds a `require(numFeatures < GaussianMixture.MAX_NUM_FEATURES)` check to ML and MLlib algorithms. For the feature limitation, we can limit it such that we do not get numerical overflow to something like `math.sqrt(Integer.MaxValue).toInt` (about 46k) which eliminates the cryptic error. However in, for example WLS, we need to collect an array on the order of `numFeatures * numFeatures` to the driver and we therefore limit to 4096 features. We may want to keep that convention here for consistency.

## How was this patch tested?
Unit tests in ML and MLlib.

Author: sethah <[email protected]>

Closes apache#16661 from sethah/gmm_high_dim.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants