Skip to content

Commit 83014f2

Browse files
sethahcmonkey
authored andcommitted
[SPARK-19313][ML][MLLIB] GaussianMixture should limit the number of features
## What changes were proposed in this pull request? The following test will fail on current master ````scala test("gmm fails on high dimensional data") { val ctx = spark.sqlContext import ctx.implicits._ val df = Seq( Vectors.sparse(GaussianMixture.MAX_NUM_FEATURES + 1, Array(0, 4), Array(3.0, 8.0)), Vectors.sparse(GaussianMixture.MAX_NUM_FEATURES + 1, Array(1, 5), Array(4.0, 9.0))) .map(Tuple1.apply).toDF("features") val gm = new GaussianMixture() intercept[IllegalArgumentException] { gm.fit(df) } } ```` Instead, you'll get an `ArrayIndexOutOfBoundsException` or something similar for MLlib. That's because the covariance matrix allocates an array of `numFeatures * numFeatures`, and in this case we get integer overflow. While there is currently a warning that the algorithm does not perform well for high number of features, we should perform an appropriate check to communicate this limitation to users. This patch adds a `require(numFeatures < GaussianMixture.MAX_NUM_FEATURES)` check to ML and MLlib algorithms. For the feature limitation, we can limit it such that we do not get numerical overflow to something like `math.sqrt(Integer.MaxValue).toInt` (about 46k) which eliminates the cryptic error. However in, for example WLS, we need to collect an array on the order of `numFeatures * numFeatures` to the driver and we therefore limit to 4096 features. We may want to keep that convention here for consistency. ## How was this patch tested? Unit tests in ML and MLlib. Author: sethah <[email protected]> Closes apache#16661 from sethah/gmm_high_dim.
1 parent 143ec54 commit 83014f2

File tree

4 files changed

+51
-6
lines changed

4 files changed

+51
-6
lines changed

mllib/src/main/scala/org/apache/spark/ml/clustering/GaussianMixture.scala

Lines changed: 11 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -278,7 +278,9 @@ object GaussianMixtureModel extends MLReadable[GaussianMixtureModel] {
278278
* While this process is generally guaranteed to converge, it is not guaranteed
279279
* to find a global optimum.
280280
*
281-
* @note For high-dimensional data (with many features), this algorithm may perform poorly.
281+
* @note This algorithm is limited in its number of features since it requires storing a covariance
282+
* matrix which has size quadratic in the number of features. Even when the number of features does
283+
* not exceed this limit, this algorithm may perform poorly on high-dimensional data.
282284
* This is due to high-dimensional data (a) making it difficult to cluster at all (based
283285
* on statistical/theoretical arguments) and (b) numerical issues with Gaussian distributions.
284286
*/
@@ -344,6 +346,9 @@ class GaussianMixture @Since("2.0.0") (
344346

345347
// Extract the number of features.
346348
val numFeatures = instances.first().size
349+
require(numFeatures < GaussianMixture.MAX_NUM_FEATURES, s"GaussianMixture cannot handle more " +
350+
s"than ${GaussianMixture.MAX_NUM_FEATURES} features because the size of the covariance" +
351+
s" matrix is quadratic in the number of features.")
347352

348353
val instr = Instrumentation.create(this, instances)
349354
instr.logParams(featuresCol, predictionCol, probabilityCol, k, maxIter, seed, tol)
@@ -391,8 +396,8 @@ class GaussianMixture @Since("2.0.0") (
391396
val (ws, gs) = sc.parallelize(tuples, numPartitions).map { case (mean, cov, weight) =>
392397
GaussianMixture.updateWeightsAndGaussians(mean, cov, weight, sumWeights)
393398
}.collect().unzip
394-
Array.copy(ws.toArray, 0, weights, 0, ws.length)
395-
Array.copy(gs.toArray, 0, gaussians, 0, gs.length)
399+
Array.copy(ws, 0, weights, 0, ws.length)
400+
Array.copy(gs, 0, gaussians, 0, gs.length)
396401
} else {
397402
var i = 0
398403
while (i < numClusters) {
@@ -486,6 +491,9 @@ class GaussianMixture @Since("2.0.0") (
486491
@Since("2.0.0")
487492
object GaussianMixture extends DefaultParamsReadable[GaussianMixture] {
488493

494+
/** Limit number of features such that numFeatures^2^ < Int.MaxValue */
495+
private[clustering] val MAX_NUM_FEATURES = math.sqrt(Int.MaxValue).toInt
496+
489497
@Since("2.0.0")
490498
override def load(path: String): GaussianMixture = super.load(path)
491499

mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala

Lines changed: 12 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,9 @@ import org.apache.spark.util.Utils
4646
* is considered to have occurred.
4747
* @param maxIterations Maximum number of iterations allowed.
4848
*
49-
* @note For high-dimensional data (with many features), this algorithm may perform poorly.
49+
* @note This algorithm is limited in its number of features since it requires storing a covariance
50+
* matrix which has size quadratic in the number of features. Even when the number of features does
51+
* not exceed this limit, this algorithm may perform poorly on high-dimensional data.
5052
* This is due to high-dimensional data (a) making it difficult to cluster at all (based
5153
* on statistical/theoretical arguments) and (b) numerical issues with Gaussian distributions.
5254
*/
@@ -170,6 +172,9 @@ class GaussianMixture private (
170172

171173
// Get length of the input vectors
172174
val d = breezeData.first().length
175+
require(d < GaussianMixture.MAX_NUM_FEATURES, s"GaussianMixture cannot handle more " +
176+
s"than ${GaussianMixture.MAX_NUM_FEATURES} features because the size of the covariance" +
177+
s" matrix is quadratic in the number of features.")
173178

174179
val shouldDistributeGaussians = GaussianMixture.shouldDistributeGaussians(k, d)
175180

@@ -211,8 +216,8 @@ class GaussianMixture private (
211216
val (ws, gs) = sc.parallelize(tuples, numPartitions).map { case (mean, sigma, weight) =>
212217
updateWeightsAndGaussians(mean, sigma, weight, sumWeights)
213218
}.collect().unzip
214-
Array.copy(ws.toArray, 0, weights, 0, ws.length)
215-
Array.copy(gs.toArray, 0, gaussians, 0, gs.length)
219+
Array.copy(ws, 0, weights, 0, ws.length)
220+
Array.copy(gs, 0, gaussians, 0, gs.length)
216221
} else {
217222
var i = 0
218223
while (i < k) {
@@ -272,6 +277,10 @@ class GaussianMixture private (
272277
}
273278

274279
private[clustering] object GaussianMixture {
280+
281+
/** Limit number of features such that numFeatures^2^ < Int.MaxValue */
282+
private[clustering] val MAX_NUM_FEATURES = math.sqrt(Int.MaxValue).toInt
283+
275284
/**
276285
* Heuristic to distribute the computation of the `MultivariateGaussian`s, approximately when
277286
* d is greater than 25 except for when k is very small.

mllib/src/test/scala/org/apache/spark/ml/clustering/GaussianMixtureSuite.scala

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,20 @@ class GaussianMixtureSuite extends SparkFunSuite with MLlibTestSparkContext
5353
rDataset = rData.map(FeatureData).toDF()
5454
}
5555

56+
test("gmm fails on high dimensional data") {
57+
val df = Seq(
58+
Vectors.sparse(GaussianMixture.MAX_NUM_FEATURES + 1, Array(0, 4), Array(3.0, 8.0)),
59+
Vectors.sparse(GaussianMixture.MAX_NUM_FEATURES + 1, Array(1, 5), Array(4.0, 9.0)))
60+
.map(Tuple1.apply).toDF("features")
61+
val gm = new GaussianMixture()
62+
withClue(s"GMM should restrict the maximum number of features to be < " +
63+
s"${GaussianMixture.MAX_NUM_FEATURES}") {
64+
intercept[IllegalArgumentException] {
65+
gm.fit(df)
66+
}
67+
}
68+
}
69+
5670
test("default parameters") {
5771
val gm = new GaussianMixture()
5872

mllib/src/test/scala/org/apache/spark/mllib/clustering/GaussianMixtureSuite.scala

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,20 @@ import org.apache.spark.mllib.util.TestingUtils._
2525
import org.apache.spark.util.Utils
2626

2727
class GaussianMixtureSuite extends SparkFunSuite with MLlibTestSparkContext {
28+
29+
test("gmm fails on high dimensional data") {
30+
val rdd = sc.parallelize(Seq(
31+
Vectors.sparse(GaussianMixture.MAX_NUM_FEATURES + 1, Array(0, 4), Array(3.0, 8.0)),
32+
Vectors.sparse(GaussianMixture.MAX_NUM_FEATURES + 1, Array(1, 5), Array(4.0, 9.0))))
33+
val gm = new GaussianMixture()
34+
withClue(s"GMM should restrict the maximum number of features to be < " +
35+
s"${GaussianMixture.MAX_NUM_FEATURES}") {
36+
intercept[IllegalArgumentException] {
37+
gm.run(rdd)
38+
}
39+
}
40+
}
41+
2842
test("single cluster") {
2943
val data = sc.parallelize(Array(
3044
Vectors.dense(6.0, 9.0),

0 commit comments

Comments
 (0)