Skip to content

Commit db34690

Browse files
committed
[SPARK-5599] Check MLlib public APIs for 1.3
There are no break changes (against 1.2) in this PR. I hide the PythonMLLibAPI, which is only called by Py4J, and renamed `SparseMatrix.diag` to `SparseMatrix.spdiag`. All other changes are documentation and annotations. The `Experimental` tag is removed from `ALS.setAlpha` and `Rating`. One issue not addressed in this PR is the `setCheckpointDir` in `LDA` (https://issues.apache.org/jira/browse/SPARK-5604). CC: srowen jkbradley Author: Xiangrui Meng <[email protected]> Closes apache#4377 from mengxr/SPARK-5599 and squashes the following commits: 17975dc [Xiangrui Meng] fix tests 4487f20 [Xiangrui Meng] remove experimental tag from each stat method because Statistics is experimental already 3cd969a [Xiangrui Meng] remove freeman (sorry~) from StreamLA public doc 55900f5 [Xiangrui Meng] make IR experimental and update its doc 9b8eed3 [Xiangrui Meng] graduate Rating and setAlpha in ALS b854d28 [Xiangrui Meng] correct iid doc in RandomRDDs 27f5bdd [Xiangrui Meng] update linalg docs and some new method signatures 371721b [Xiangrui Meng] mark fpg as experimental and update its doc 8aca7ee [Xiangrui Meng] change SLR to experimental and update the doc ebbb2e9 [Xiangrui Meng] mark PIC experimental and update the doc 7830d3b [Xiangrui Meng] mark GMM experimental a378496 [Xiangrui Meng] use the correct subscript syntax in PIC c65c424 [Xiangrui Meng] update LDAModel doc a213b0c [Xiangrui Meng] update GMM constructor 3993054 [Xiangrui Meng] hide algorithm in SLR ad6b9ce [Xiangrui Meng] Revert "make ClassificatinModel.predict(JavaRDD) return JavaDoubleRDD" 0054684 [Xiangrui Meng] add doc to LRModel's constructor a89763b [Xiangrui Meng] make ClassificatinModel.predict(JavaRDD) return JavaDoubleRDD 7c0946c [Xiangrui Meng] hide PythonMLLibAPI
1 parent 975bcef commit db34690

File tree

19 files changed

+160
-119
lines changed

19 files changed

+160
-119
lines changed

mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -54,11 +54,9 @@ import org.apache.spark.storage.StorageLevel
5454
import org.apache.spark.util.Utils
5555

5656
/**
57-
* :: DeveloperApi ::
58-
* The Java stubs necessary for the Python mllib bindings.
57+
* The Java stubs necessary for the Python mllib bindings. It is called by Py4J on the Python side.
5958
*/
60-
@DeveloperApi
61-
class PythonMLLibAPI extends Serializable {
59+
private[python] class PythonMLLibAPI extends Serializable {
6260

6361

6462
/**

mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,9 @@ class LogisticRegressionModel (
6262
s" but was given weights of length ${weights.size}")
6363
}
6464

65+
/**
66+
* Constructs a [[LogisticRegressionModel]] with weights and intercept for binary classification.
67+
*/
6568
def this(weights: Vector, intercept: Double) = this(weights, intercept, weights.size, 2)
6669

6770
private var threshold: Option[Double] = Some(0.5)

mllib/src/main/scala/org/apache/spark/mllib/classification/StreamingLogisticRegressionWithSGD.scala

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -35,12 +35,13 @@ import org.apache.spark.mllib.regression.StreamingLinearAlgorithm
3535
* Use a builder pattern to construct a streaming logistic regression
3636
* analysis in an application, like:
3737
*
38+
* {{{
3839
* val model = new StreamingLogisticRegressionWithSGD()
3940
* .setStepSize(0.5)
4041
* .setNumIterations(10)
4142
* .setInitialWeights(Vectors.dense(...))
4243
* .trainOn(DStream)
43-
*
44+
* }}}
4445
*/
4546
@Experimental
4647
class StreamingLogisticRegressionWithSGD private[mllib] (
@@ -59,7 +60,7 @@ class StreamingLogisticRegressionWithSGD private[mllib] (
5960
*/
6061
def this() = this(0.1, 50, 1.0, 0.0)
6162

62-
val algorithm = new LogisticRegressionWithSGD(
63+
protected val algorithm = new LogisticRegressionWithSGD(
6364
stepSize, numIterations, regParam, miniBatchFraction)
6465

6566
/** Set the step size for gradient descent. Default: 0.1. */

mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala

Lines changed: 10 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -19,15 +19,18 @@ package org.apache.spark.mllib.clustering
1919

2020
import scala.collection.mutable.IndexedSeq
2121

22-
import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix, diag, Transpose}
22+
import breeze.linalg.{DenseMatrix => BreezeMatrix, DenseVector => BreezeVector, Transpose, diag}
2323

24-
import org.apache.spark.mllib.linalg.{Matrices, Vector, Vectors, DenseVector, DenseMatrix, BLAS}
24+
import org.apache.spark.annotation.Experimental
25+
import org.apache.spark.mllib.linalg.{BLAS, DenseMatrix, DenseVector, Matrices, Vector, Vectors}
2526
import org.apache.spark.mllib.stat.distribution.MultivariateGaussian
2627
import org.apache.spark.mllib.util.MLUtils
2728
import org.apache.spark.rdd.RDD
2829
import org.apache.spark.util.Utils
2930

3031
/**
32+
* :: Experimental ::
33+
*
3134
* This class performs expectation maximization for multivariate Gaussian
3235
* Mixture Models (GMMs). A GMM represents a composite distribution of
3336
* independent Gaussian distributions with associated "mixing" weights
@@ -44,13 +47,17 @@ import org.apache.spark.util.Utils
4447
* is considered to have occurred.
4548
* @param maxIterations The maximum number of iterations to perform
4649
*/
50+
@Experimental
4751
class GaussianMixture private (
4852
private var k: Int,
4953
private var convergenceTol: Double,
5054
private var maxIterations: Int,
5155
private var seed: Long) extends Serializable {
5256

53-
/** A default instance, 2 Gaussians, 100 iterations, 0.01 log-likelihood threshold */
57+
/**
58+
* Constructs a default instance. The default parameters are {k: 2, convergenceTol: 0.01,
59+
* maxIterations: 100, seed: random}.
60+
*/
5461
def this() = this(2, 0.01, 100, Utils.random.nextLong())
5562

5663
// number of samples per cluster to use when initializing Gaussians

mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModel.scala

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,12 +19,15 @@ package org.apache.spark.mllib.clustering
1919

2020
import breeze.linalg.{DenseVector => BreezeVector}
2121

22-
import org.apache.spark.rdd.RDD
22+
import org.apache.spark.annotation.Experimental
2323
import org.apache.spark.mllib.linalg.Vector
2424
import org.apache.spark.mllib.stat.distribution.MultivariateGaussian
2525
import org.apache.spark.mllib.util.MLUtils
26+
import org.apache.spark.rdd.RDD
2627

2728
/**
29+
* :: Experimental ::
30+
*
2831
* Multivariate Gaussian Mixture Model (GMM) consisting of k Gaussians, where points
2932
* are drawn from each Gaussian i=1..k with probability w(i); mu(i) and sigma(i) are
3033
* the respective mean and covariance for each Gaussian distribution i=1..k.
@@ -35,6 +38,7 @@ import org.apache.spark.mllib.util.MLUtils
3538
* @param sigma Covariance maxtrix for each Gaussian in the mixture, where sigma(i) is the
3639
* covariance matrix for Gaussian i
3740
*/
41+
@Experimental
3842
class GaussianMixtureModel(
3943
val weights: Array[Double],
4044
val gaussians: Array[MultivariateGaussian]) extends Serializable {

mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -335,7 +335,7 @@ class DistributedLDAModel private (
335335

336336
/**
337337
* For each document in the training set, return the distribution over topics for that document
338-
* (i.e., "theta_doc").
338+
* ("theta_doc").
339339
*
340340
* @return RDD of (document ID, topic distribution) pairs
341341
*/

mllib/src/main/scala/org/apache/spark/mllib/clustering/PowerIterationClustering.scala

Lines changed: 18 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@
1818
package org.apache.spark.mllib.clustering
1919

2020
import org.apache.spark.{Logging, SparkException}
21+
import org.apache.spark.annotation.Experimental
2122
import org.apache.spark.graphx._
2223
import org.apache.spark.graphx.impl.GraphImpl
2324
import org.apache.spark.mllib.linalg.Vectors
@@ -26,25 +27,33 @@ import org.apache.spark.rdd.RDD
2627
import org.apache.spark.util.random.XORShiftRandom
2728

2829
/**
30+
* :: Experimental ::
31+
*
2932
* Model produced by [[PowerIterationClustering]].
3033
*
3134
* @param k number of clusters
3235
* @param assignments an RDD of (vertexID, clusterID) pairs
3336
*/
37+
@Experimental
3438
class PowerIterationClusteringModel(
3539
val k: Int,
3640
val assignments: RDD[(Long, Int)]) extends Serializable
3741

3842
/**
39-
* Power Iteration Clustering (PIC), a scalable graph clustering algorithm developed by Lin and
40-
* Cohen (see http://www.icml2010.org/papers/387.pdf). From the abstract: PIC finds a very
43+
* :: Experimental ::
44+
*
45+
* Power Iteration Clustering (PIC), a scalable graph clustering algorithm developed by
46+
* [[http://www.icml2010.org/papers/387.pdf Lin and Cohen]]. From the abstract: PIC finds a very
4147
* low-dimensional embedding of a dataset using truncated power iteration on a normalized pair-wise
4248
* similarity matrix of the data.
4349
*
4450
* @param k Number of clusters.
4551
* @param maxIterations Maximum number of iterations of the PIC algorithm.
4652
* @param initMode Initialization mode.
53+
*
54+
* @see [[http://en.wikipedia.org/wiki/Spectral_clustering Spectral clustering (Wikipedia)]]
4755
*/
56+
@Experimental
4857
class PowerIterationClustering private[clustering] (
4958
private var k: Int,
5059
private var maxIterations: Int,
@@ -88,11 +97,12 @@ class PowerIterationClustering private[clustering] (
8897
/**
8998
* Run the PIC algorithm.
9099
*
91-
* @param similarities an RDD of (i, j, s_ij_) tuples representing the affinity matrix, which is
92-
* the matrix A in the PIC paper. The similarity s_ij_ must be nonnegative.
93-
* This is a symmetric matrix and hence s_ij_ = s_ji_. For any (i, j) with
94-
* nonzero similarity, there should be either (i, j, s_ij_) or (j, i, s_ji_)
95-
* in the input. Tuples with i = j are ignored, because we assume s_ij_ = 0.0.
100+
* @param similarities an RDD of (i, j, s,,ij,,) tuples representing the affinity matrix, which is
101+
* the matrix A in the PIC paper. The similarity s,,ij,, must be nonnegative.
102+
* This is a symmetric matrix and hence s,,ij,, = s,,ji,,. For any (i, j) with
103+
* nonzero similarity, there should be either (i, j, s,,ij,,) or
104+
* (j, i, s,,ji,,) in the input. Tuples with i = j are ignored, because we
105+
* assume s,,ij,, = 0.0.
96106
*
97107
* @return a [[PowerIterationClusteringModel]] that contains the clustering result
98108
*/
@@ -109,7 +119,7 @@ class PowerIterationClustering private[clustering] (
109119
* Runs the PIC algorithm.
110120
*
111121
* @param w The normalized affinity matrix, which is the matrix W in the PIC paper with
112-
* w_ij_ = a_ij_ / d_ii_ as its edge properties and the initial vector of the power
122+
* w,,ij,, = a,,ij,, / d,,ii,, as its edge properties and the initial vector of the power
113123
* iteration as its vertex properties.
114124
*/
115125
private def pic(w: Graph[Double, Double]): PowerIterationClusteringModel = {

mllib/src/main/scala/org/apache/spark/mllib/clustering/StreamingKMeans.scala

Lines changed: 11 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -21,15 +21,16 @@ import scala.reflect.ClassTag
2121

2222
import org.apache.spark.Logging
2323
import org.apache.spark.SparkContext._
24-
import org.apache.spark.annotation.DeveloperApi
24+
import org.apache.spark.annotation.{Experimental, DeveloperApi}
2525
import org.apache.spark.mllib.linalg.{BLAS, Vector, Vectors}
2626
import org.apache.spark.rdd.RDD
2727
import org.apache.spark.streaming.dstream.DStream
2828
import org.apache.spark.util.Utils
2929
import org.apache.spark.util.random.XORShiftRandom
3030

3131
/**
32-
* :: DeveloperApi ::
32+
* :: Experimental ::
33+
*
3334
* StreamingKMeansModel extends MLlib's KMeansModel for streaming
3435
* algorithms, so it can keep track of a continuously updated weight
3536
* associated with each cluster, and also update the model by
@@ -39,8 +40,10 @@ import org.apache.spark.util.random.XORShiftRandom
3940
* generalized to incorporate forgetfullness (i.e. decay).
4041
* The update rule (for each cluster) is:
4142
*
43+
* {{{
4244
* c_t+1 = [(c_t * n_t * a) + (x_t * m_t)] / [n_t + m_t]
4345
* n_t+t = n_t * a + m_t
46+
* }}}
4447
*
4548
* Where c_t is the previously estimated centroid for that cluster,
4649
* n_t is the number of points assigned to it thus far, x_t is the centroid
@@ -61,7 +64,7 @@ import org.apache.spark.util.random.XORShiftRandom
6164
* as batches or points.
6265
*
6366
*/
64-
@DeveloperApi
67+
@Experimental
6568
class StreamingKMeansModel(
6669
override val clusterCenters: Array[Vector],
6770
val clusterWeights: Array[Double]) extends KMeansModel(clusterCenters) with Logging {
@@ -140,7 +143,8 @@ class StreamingKMeansModel(
140143
}
141144

142145
/**
143-
* :: DeveloperApi ::
146+
* :: Experimental ::
147+
*
144148
* StreamingKMeans provides methods for configuring a
145149
* streaming k-means analysis, training the model on streaming,
146150
* and using the model to make predictions on streaming data.
@@ -149,13 +153,15 @@ class StreamingKMeansModel(
149153
* Use a builder pattern to construct a streaming k-means analysis
150154
* in an application, like:
151155
*
156+
* {{{
152157
* val model = new StreamingKMeans()
153158
* .setDecayFactor(0.5)
154159
* .setK(3)
155160
* .setRandomCenters(5, 100.0)
156161
* .trainOn(DStream)
162+
* }}}
157163
*/
158-
@DeveloperApi
164+
@Experimental
159165
class StreamingKMeans(
160166
var k: Int,
161167
var decayFactor: Double,

mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala

Lines changed: 19 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -25,16 +25,20 @@ import scala.collection.JavaConverters._
2525
import scala.reflect.ClassTag
2626

2727
import org.apache.spark.{HashPartitioner, Logging, Partitioner, SparkException}
28+
import org.apache.spark.annotation.Experimental
2829
import org.apache.spark.api.java.{JavaPairRDD, JavaRDD}
2930
import org.apache.spark.api.java.JavaSparkContext.fakeClassTag
3031
import org.apache.spark.rdd.RDD
3132
import org.apache.spark.storage.StorageLevel
3233

3334
/**
35+
* :: Experimental ::
36+
*
3437
* Model trained by [[FPGrowth]], which holds frequent itemsets.
3538
* @param freqItemsets frequent itemset, which is an RDD of (itemset, frequency) pairs
3639
* @tparam Item item type
3740
*/
41+
@Experimental
3842
class FPGrowthModel[Item: ClassTag](
3943
val freqItemsets: RDD[(Array[Item], Long)]) extends Serializable {
4044

@@ -45,28 +49,35 @@ class FPGrowthModel[Item: ClassTag](
4549
}
4650

4751
/**
48-
* This class implements Parallel FP-growth algorithm to do frequent pattern matching on input data.
49-
* Parallel FPGrowth (PFP) partitions computation in such a way that each machine executes an
50-
* independent group of mining tasks. More detail of this algorithm can be found at
51-
* [[http://dx.doi.org/10.1145/1454008.1454027, PFP]], and the original FP-growth paper can be
52-
* found at [[http://dx.doi.org/10.1145/335191.335372, FP-growth]]
52+
* :: Experimental ::
53+
*
54+
* A parallel FP-growth algorithm to mine frequent itemsets. The algorithm is described in
55+
* [[http://dx.doi.org/10.1145/1454008.1454027 Li et al., PFP: Parallel FP-Growth for Query
56+
* Recommendation]]. PFP distributes computation in such a way that each worker executes an
57+
* independent group of mining tasks. The FP-Growth algorithm is described in
58+
* [[http://dx.doi.org/10.1145/335191.335372 Han et al., Mining frequent patterns without candidate
59+
* generation]].
5360
*
5461
* @param minSupport the minimal support level of the frequent pattern, any pattern appears
5562
* more than (minSupport * size-of-the-dataset) times will be output
5663
* @param numPartitions number of partitions used by parallel FP-growth
64+
*
65+
* @see [[http://en.wikipedia.org/wiki/Association_rule_learning Association rule learning
66+
* (Wikipedia)]]
5767
*/
68+
@Experimental
5869
class FPGrowth private (
5970
private var minSupport: Double,
6071
private var numPartitions: Int) extends Logging with Serializable {
6172

6273
/**
63-
* Constructs a FPGrowth instance with default parameters:
64-
* {minSupport: 0.3, numPartitions: auto}
74+
* Constructs a default instance with default parameters {minSupport: `0.3`, numPartitions: same
75+
* as the input data}.
6576
*/
6677
def this() = this(0.3, -1)
6778

6879
/**
69-
* Sets the minimal support level (default: 0.3).
80+
* Sets the minimal support level (default: `0.3`).
7081
*/
7182
def setMinSupport(minSupport: Double): this.type = {
7283
this.minSupport = minSupport

0 commit comments

Comments
 (0)