-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-19313][ML][MLLIB] GaussianMixture should limit the number of features #16661
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #71735 has finished for PR 16661 at commit
|
|
ping @yanboliang |
|
Will review it tomorrow. Thanks. |
| object GaussianMixture extends DefaultParamsReadable[GaussianMixture] { | ||
|
|
||
| /** Limit number of features such that numFeatures^2^ < Integer.MaxValue */ | ||
| private[clustering] val MAX_NUM_FEATURES = 46000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this be in upper camel case according to scala style?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is like a private static final field in Java, and when used for constants, CONSTANT_CASE is normal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In #15413, the symmetry of covariance matrix is taken into account and only the upper triangular part is store. So this number seems to be 65535? (math.sqrt(Int.MaxValue.toDouble * 2))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have to unpack the covariance matrix to a full covariance matrix before returning the model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is floor(sqrt(2^31-1)) = 46340 more accurate? or is there overhead that prevents this from being achievable? I know it's a corner case, but if 46000 is a number that's just "about" the real max, let's just use the real max.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 @srowen It's better to use the real max.
| private[clustering] object GaussianMixture { | ||
|
|
||
| /** Limit number of features such that numFeatures^2^ < Integer.MaxValue */ | ||
| private[clustering] val MAX_NUM_FEATURES = 46000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks like the constant can be shared between the two GMM classes - I would recommend using the mllib one for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see benefits either way, but I think leaving ML GMM to be completely independent of MLlib is slightly preferable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ultimately long-term the plan is to deprecate mllib so keeping it separate is preferable
|
|
||
| test("gmm fails on high dimensional data") { | ||
| val ctx = spark.sqlContext | ||
| import ctx.implicits._ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a way to remove this import? I'm not sure why you need it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
|
I left few minor comments, looks good to me! |
|
LGTM, nice work! Who has the permissions to push the changes? |
|
@imatiach-msft Spark committers must push the changes. As long as at least one committer is aware of the changes there is probably nothing left to do. |
|
Thanks for the review @srowen and @imatiach-msft! |
|
Test build #71880 has finished for PR 16661 at commit
|
yanboliang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except for very minor comments.
| @Since("2.0.0") | ||
| object GaussianMixture extends DefaultParamsReadable[GaussianMixture] { | ||
|
|
||
| /** Limit number of features such that numFeatures^2^ < Integer.MaxValue */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Integer.MaxValue is not a standard convention, it should be Int.MaxValue in Scala or Integer.MAX_VALUE in Java.
|
Test build #71937 has finished for PR 16661 at commit
|
| private[clustering] object GaussianMixture { | ||
|
|
||
| /** Limit number of features such that numFeatures^2^ < Int.MaxValue */ | ||
| private[clustering] val MAX_NUM_FEATURES = math.sqrt(Int.MaxValue).toInt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The number is not equal to that used in computeCovariance() in mllib.linalg.distributed.RowMatrix.
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala#L327
Do the limits in mllib.linalg.distributed.RowMatrix need to be updated to this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the limiting factor here is that we can't have an array of elements somewhere that has more than 2^31 - 1 elements. For a dense representation of a normal n x n matrix, that limits n to 46340. Here, however, the matrix is a symmetric Gramian matrix that needs n(n+1)/2 elements of storage, so 65535 works.
|
BTW, it maybe nice to add a |
|
Merged into master. Thanks for all. |
…eatures
## What changes were proposed in this pull request?
The following test will fail on current master
````scala
test("gmm fails on high dimensional data") {
val ctx = spark.sqlContext
import ctx.implicits._
val df = Seq(
Vectors.sparse(GaussianMixture.MAX_NUM_FEATURES + 1, Array(0, 4), Array(3.0, 8.0)),
Vectors.sparse(GaussianMixture.MAX_NUM_FEATURES + 1, Array(1, 5), Array(4.0, 9.0)))
.map(Tuple1.apply).toDF("features")
val gm = new GaussianMixture()
intercept[IllegalArgumentException] {
gm.fit(df)
}
}
````
Instead, you'll get an `ArrayIndexOutOfBoundsException` or something similar for MLlib. That's because the covariance matrix allocates an array of `numFeatures * numFeatures`, and in this case we get integer overflow. While there is currently a warning that the algorithm does not perform well for high number of features, we should perform an appropriate check to communicate this limitation to users.
This patch adds a `require(numFeatures < GaussianMixture.MAX_NUM_FEATURES)` check to ML and MLlib algorithms. For the feature limitation, we can limit it such that we do not get numerical overflow to something like `math.sqrt(Integer.MaxValue).toInt` (about 46k) which eliminates the cryptic error. However in, for example WLS, we need to collect an array on the order of `numFeatures * numFeatures` to the driver and we therefore limit to 4096 features. We may want to keep that convention here for consistency.
## How was this patch tested?
Unit tests in ML and MLlib.
Author: sethah <[email protected]>
Closes apache#16661 from sethah/gmm_high_dim.
…eatures
## What changes were proposed in this pull request?
The following test will fail on current master
````scala
test("gmm fails on high dimensional data") {
val ctx = spark.sqlContext
import ctx.implicits._
val df = Seq(
Vectors.sparse(GaussianMixture.MAX_NUM_FEATURES + 1, Array(0, 4), Array(3.0, 8.0)),
Vectors.sparse(GaussianMixture.MAX_NUM_FEATURES + 1, Array(1, 5), Array(4.0, 9.0)))
.map(Tuple1.apply).toDF("features")
val gm = new GaussianMixture()
intercept[IllegalArgumentException] {
gm.fit(df)
}
}
````
Instead, you'll get an `ArrayIndexOutOfBoundsException` or something similar for MLlib. That's because the covariance matrix allocates an array of `numFeatures * numFeatures`, and in this case we get integer overflow. While there is currently a warning that the algorithm does not perform well for high number of features, we should perform an appropriate check to communicate this limitation to users.
This patch adds a `require(numFeatures < GaussianMixture.MAX_NUM_FEATURES)` check to ML and MLlib algorithms. For the feature limitation, we can limit it such that we do not get numerical overflow to something like `math.sqrt(Integer.MaxValue).toInt` (about 46k) which eliminates the cryptic error. However in, for example WLS, we need to collect an array on the order of `numFeatures * numFeatures` to the driver and we therefore limit to 4096 features. We may want to keep that convention here for consistency.
## How was this patch tested?
Unit tests in ML and MLlib.
Author: sethah <[email protected]>
Closes apache#16661 from sethah/gmm_high_dim.
What changes were proposed in this pull request?
The following test will fail on current master
Instead, you'll get an
ArrayIndexOutOfBoundsExceptionor something similar for MLlib. That's because the covariance matrix allocates an array ofnumFeatures * numFeatures, and in this case we get integer overflow. While there is currently a warning that the algorithm does not perform well for high number of features, we should perform an appropriate check to communicate this limitation to users.This patch adds a
require(numFeatures < GaussianMixture.MAX_NUM_FEATURES)check to ML and MLlib algorithms. For the feature limitation, we can limit it such that we do not get numerical overflow to something likemath.sqrt(Integer.MaxValue).toInt(about 46k) which eliminates the cryptic error. However in, for example WLS, we need to collect an array on the order ofnumFeatures * numFeaturesto the driver and we therefore limit to 4096 features. We may want to keep that convention here for consistency.How was this patch tested?
Unit tests in ML and MLlib.