|
1 | 1 | --- |
2 | 2 | layout: global |
3 | | -title: Feature Extraction - MLlib |
4 | | -displayTitle: <a href="mllib-guide.html">MLlib</a> - Feature Extraction |
| 3 | +title: Feature Extraction and Transformation - MLlib |
| 4 | +displayTitle: <a href="mllib-guide.html">MLlib</a> - Feature Extraction and Transformation |
5 | 5 | --- |
6 | 6 |
|
7 | 7 | * Table of contents |
@@ -148,3 +148,108 @@ for((synonym, cosineSimilarity) <- synonyms) { |
148 | 148 | {% endhighlight %} |
149 | 149 | </div> |
150 | 150 | </div> |
| 151 | + |
| 152 | +## StandardScaler |
| 153 | + |
| 154 | +Standardizes features by scaling to unit variance and/or removing the mean using column summary |
| 155 | +statistics on the samples in the training set. This is a very common pre-processing step. |
| 156 | + |
| 157 | +For example, RBF kernel of Support Vector Machines or the L1 and L2 regularized linear models |
| 158 | +typically work better when all features have unit variance and/or zero mean. |
| 159 | + |
| 160 | +Standardization can improve the convergence rate during the optimization process, and also prevents |
| 161 | +against features with very large variances exerting an overly large influence during model training. |
| 162 | + |
| 163 | +### Model Fitting |
| 164 | + |
| 165 | +[`StandardScaler`](api/scala/index.html#org.apache.spark.mllib.feature.StandardScaler) has the |
| 166 | +following parameters in the constructor: |
| 167 | + |
| 168 | +* `withMean` False by default. Centers the data with mean before scaling. It will build a dense |
| 169 | +output, so this does not work on sparse input and will raise an exception. |
| 170 | +* `withStd` True by default. Scales the data to unit variance. |
| 171 | + |
| 172 | +We provide a [`fit`](api/scala/index.html#org.apache.spark.mllib.feature.StandardScaler) method in |
| 173 | +`StandardScaler` which can take an input of `RDD[Vector]`, learn the summary statistics, and then |
| 174 | +return a model which can transform the input dataset into unit variance and/or zero mean features |
| 175 | +depending how we configure the `StandardScaler`. |
| 176 | + |
| 177 | +This model implements [`VectorTransformer`](api/scala/index.html#org.apache.spark.mllib.feature.VectorTransformer) |
| 178 | +which can apply the standardization on a `Vector` to produce a transformed `Vector` or on |
| 179 | +an `RDD[Vector]` to produce a transformed `RDD[Vector]`. |
| 180 | + |
| 181 | +Note that if the variance of a feature is zero, it will return default `0.0` value in the `Vector` |
| 182 | +for that feature. |
| 183 | + |
| 184 | +### Example |
| 185 | + |
| 186 | +The example below demonstrates how to load a dataset in libsvm format, and standardize the features |
| 187 | +so that the new features have unit variance and/or zero mean. |
| 188 | + |
| 189 | +<div class="codetabs"> |
| 190 | +<div data-lang="scala"> |
| 191 | +{% highlight scala %} |
| 192 | +import org.apache.spark.SparkContext._ |
| 193 | +import org.apache.spark.mllib.feature.StandardScaler |
| 194 | +import org.apache.spark.mllib.linalg.Vectors |
| 195 | +import org.apache.spark.mllib.util.MLUtils |
| 196 | + |
| 197 | +val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt") |
| 198 | + |
| 199 | +val scaler1 = new StandardScaler().fit(data.map(x => x.features)) |
| 200 | +val scaler2 = new StandardScaler(withMean = true, withStd = true).fit(data.map(x => x.features)) |
| 201 | + |
| 202 | +// data1 will be unit variance. |
| 203 | +val data1 = data.map(x => (x.label, scaler1.transform(x.features))) |
| 204 | + |
| 205 | +// Without converting the features into dense vectors, transformation with zero mean will raise |
| 206 | +// exception on sparse vector. |
| 207 | +// data2 will be unit variance and zero mean. |
| 208 | +val data2 = data.map(x => (x.label, scaler2.transform(Vectors.dense(x.features.toArray)))) |
| 209 | +{% endhighlight %} |
| 210 | +</div> |
| 211 | +</div> |
| 212 | + |
| 213 | +## Normalizer |
| 214 | + |
| 215 | +Normalizer scales individual samples to have unit $L^p$ norm. This is a common operation for text |
| 216 | +classification or clustering. For example, the dot product of two $L^2$ normalized TF-IDF vectors |
| 217 | +is the cosine similarity of the vectors. |
| 218 | + |
| 219 | +[`Normalizer`](api/scala/index.html#org.apache.spark.mllib.feature.Normalizer) has the following |
| 220 | +parameter in the constructor: |
| 221 | + |
| 222 | +* `p` Normalization in $L^p$ space, $p = 2$ by default. |
| 223 | + |
| 224 | +`Normalizer` implements [`VectorTransformer`](api/scala/index.html#org.apache.spark.mllib.feature.VectorTransformer) |
| 225 | +which can apply the normalization on a `Vector` to produce a transformed `Vector` or on |
| 226 | +an `RDD[Vector]` to produce a transformed `RDD[Vector]`. |
| 227 | + |
| 228 | +Note that if the norm of the input is zero, it will return the input vector. |
| 229 | + |
| 230 | +### Example |
| 231 | + |
| 232 | +The example below demonstrates how to load a dataset in libsvm format, and normalizes the features |
| 233 | +with $L^2$ norm, and $L^\infty$ norm. |
| 234 | + |
| 235 | +<div class="codetabs"> |
| 236 | +<div data-lang="scala"> |
| 237 | +{% highlight scala %} |
| 238 | +import org.apache.spark.SparkContext._ |
| 239 | +import org.apache.spark.mllib.feature.Normalizer |
| 240 | +import org.apache.spark.mllib.linalg.Vectors |
| 241 | +import org.apache.spark.mllib.util.MLUtils |
| 242 | + |
| 243 | +val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt") |
| 244 | + |
| 245 | +val normalizer1 = new Normalizer() |
| 246 | +val normalizer2 = new Normalizer(p = Double.PositiveInfinity) |
| 247 | + |
| 248 | +// Each sample in data1 will be normalized using $L^2$ norm. |
| 249 | +val data1 = data.map(x => (x.label, normalizer1.transform(x.features))) |
| 250 | + |
| 251 | +// Each sample in data2 will be normalized using $L^\infty$ norm. |
| 252 | +val data2 = data.map(x => (x.label, normalizer2.transform(x.features))) |
| 253 | +{% endhighlight %} |
| 254 | +</div> |
| 255 | +</div> |
0 commit comments