Commit 83014f2
[SPARK-19313][ML][MLLIB] GaussianMixture should limit the number of features
## What changes were proposed in this pull request?
The following test will fail on current master
````scala
test("gmm fails on high dimensional data") {
val ctx = spark.sqlContext
import ctx.implicits._
val df = Seq(
Vectors.sparse(GaussianMixture.MAX_NUM_FEATURES + 1, Array(0, 4), Array(3.0, 8.0)),
Vectors.sparse(GaussianMixture.MAX_NUM_FEATURES + 1, Array(1, 5), Array(4.0, 9.0)))
.map(Tuple1.apply).toDF("features")
val gm = new GaussianMixture()
intercept[IllegalArgumentException] {
gm.fit(df)
}
}
````
Instead, you'll get an `ArrayIndexOutOfBoundsException` or something similar for MLlib. That's because the covariance matrix allocates an array of `numFeatures * numFeatures`, and in this case we get integer overflow. While there is currently a warning that the algorithm does not perform well for high number of features, we should perform an appropriate check to communicate this limitation to users.
This patch adds a `require(numFeatures < GaussianMixture.MAX_NUM_FEATURES)` check to ML and MLlib algorithms. For the feature limitation, we can limit it such that we do not get numerical overflow to something like `math.sqrt(Integer.MaxValue).toInt` (about 46k) which eliminates the cryptic error. However in, for example WLS, we need to collect an array on the order of `numFeatures * numFeatures` to the driver and we therefore limit to 4096 features. We may want to keep that convention here for consistency.
## How was this patch tested?
Unit tests in ML and MLlib.
Author: sethah <[email protected]>
Closes apache#16661 from sethah/gmm_high_dim.1 parent 143ec54 commit 83014f2
File tree
4 files changed
+51
-6
lines changed- mllib/src
- main/scala/org/apache/spark
- mllib/clustering
- ml/clustering
- test/scala/org/apache/spark
- mllib/clustering
- ml/clustering
4 files changed
+51
-6
lines changedLines changed: 11 additions & 3 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
278 | 278 | | |
279 | 279 | | |
280 | 280 | | |
281 | | - | |
| 281 | + | |
| 282 | + | |
| 283 | + | |
282 | 284 | | |
283 | 285 | | |
284 | 286 | | |
| |||
344 | 346 | | |
345 | 347 | | |
346 | 348 | | |
| 349 | + | |
| 350 | + | |
| 351 | + | |
347 | 352 | | |
348 | 353 | | |
349 | 354 | | |
| |||
391 | 396 | | |
392 | 397 | | |
393 | 398 | | |
394 | | - | |
395 | | - | |
| 399 | + | |
| 400 | + | |
396 | 401 | | |
397 | 402 | | |
398 | 403 | | |
| |||
486 | 491 | | |
487 | 492 | | |
488 | 493 | | |
| 494 | + | |
| 495 | + | |
| 496 | + | |
489 | 497 | | |
490 | 498 | | |
491 | 499 | | |
| |||
Lines changed: 12 additions & 3 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
46 | 46 | | |
47 | 47 | | |
48 | 48 | | |
49 | | - | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
50 | 52 | | |
51 | 53 | | |
52 | 54 | | |
| |||
170 | 172 | | |
171 | 173 | | |
172 | 174 | | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
173 | 178 | | |
174 | 179 | | |
175 | 180 | | |
| |||
211 | 216 | | |
212 | 217 | | |
213 | 218 | | |
214 | | - | |
215 | | - | |
| 219 | + | |
| 220 | + | |
216 | 221 | | |
217 | 222 | | |
218 | 223 | | |
| |||
272 | 277 | | |
273 | 278 | | |
274 | 279 | | |
| 280 | + | |
| 281 | + | |
| 282 | + | |
| 283 | + | |
275 | 284 | | |
276 | 285 | | |
277 | 286 | | |
| |||
Lines changed: 14 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
53 | 53 | | |
54 | 54 | | |
55 | 55 | | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
56 | 70 | | |
57 | 71 | | |
58 | 72 | | |
| |||
Lines changed: 14 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
25 | 25 | | |
26 | 26 | | |
27 | 27 | | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
28 | 42 | | |
29 | 43 | | |
30 | 44 | | |
| |||
0 commit comments