Skip to content

Commit 572952a

Browse files
DB Tsaimengxr
authored andcommitted
[SPARK-2841][MLlib] Documentation for feature transformations
Documentation for newly added feature transformations: 1. TF-IDF 2. StandardScaler 3. Normalizer Author: DB Tsai <[email protected]> Closes apache#2068 from dbtsai/transformer-documentation and squashes the following commits: 109f324 [DB Tsai] address feedback
1 parent ded6796 commit 572952a

File tree

1 file changed

+107
-2
lines changed

1 file changed

+107
-2
lines changed

docs/mllib-feature-extraction.md

Lines changed: 107 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
22
layout: global
3-
title: Feature Extraction - MLlib
4-
displayTitle: <a href="mllib-guide.html">MLlib</a> - Feature Extraction
3+
title: Feature Extraction and Transformation - MLlib
4+
displayTitle: <a href="mllib-guide.html">MLlib</a> - Feature Extraction and Transformation
55
---
66

77
* Table of contents
@@ -148,3 +148,108 @@ for((synonym, cosineSimilarity) <- synonyms) {
148148
{% endhighlight %}
149149
</div>
150150
</div>
151+
152+
## StandardScaler
153+
154+
Standardizes features by scaling to unit variance and/or removing the mean using column summary
155+
statistics on the samples in the training set. This is a very common pre-processing step.
156+
157+
For example, RBF kernel of Support Vector Machines or the L1 and L2 regularized linear models
158+
typically work better when all features have unit variance and/or zero mean.
159+
160+
Standardization can improve the convergence rate during the optimization process, and also prevents
161+
against features with very large variances exerting an overly large influence during model training.
162+
163+
### Model Fitting
164+
165+
[`StandardScaler`](api/scala/index.html#org.apache.spark.mllib.feature.StandardScaler) has the
166+
following parameters in the constructor:
167+
168+
* `withMean` False by default. Centers the data with mean before scaling. It will build a dense
169+
output, so this does not work on sparse input and will raise an exception.
170+
* `withStd` True by default. Scales the data to unit variance.
171+
172+
We provide a [`fit`](api/scala/index.html#org.apache.spark.mllib.feature.StandardScaler) method in
173+
`StandardScaler` which can take an input of `RDD[Vector]`, learn the summary statistics, and then
174+
return a model which can transform the input dataset into unit variance and/or zero mean features
175+
depending how we configure the `StandardScaler`.
176+
177+
This model implements [`VectorTransformer`](api/scala/index.html#org.apache.spark.mllib.feature.VectorTransformer)
178+
which can apply the standardization on a `Vector` to produce a transformed `Vector` or on
179+
an `RDD[Vector]` to produce a transformed `RDD[Vector]`.
180+
181+
Note that if the variance of a feature is zero, it will return default `0.0` value in the `Vector`
182+
for that feature.
183+
184+
### Example
185+
186+
The example below demonstrates how to load a dataset in libsvm format, and standardize the features
187+
so that the new features have unit variance and/or zero mean.
188+
189+
<div class="codetabs">
190+
<div data-lang="scala">
191+
{% highlight scala %}
192+
import org.apache.spark.SparkContext._
193+
import org.apache.spark.mllib.feature.StandardScaler
194+
import org.apache.spark.mllib.linalg.Vectors
195+
import org.apache.spark.mllib.util.MLUtils
196+
197+
val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
198+
199+
val scaler1 = new StandardScaler().fit(data.map(x => x.features))
200+
val scaler2 = new StandardScaler(withMean = true, withStd = true).fit(data.map(x => x.features))
201+
202+
// data1 will be unit variance.
203+
val data1 = data.map(x => (x.label, scaler1.transform(x.features)))
204+
205+
// Without converting the features into dense vectors, transformation with zero mean will raise
206+
// exception on sparse vector.
207+
// data2 will be unit variance and zero mean.
208+
val data2 = data.map(x => (x.label, scaler2.transform(Vectors.dense(x.features.toArray))))
209+
{% endhighlight %}
210+
</div>
211+
</div>
212+
213+
## Normalizer
214+
215+
Normalizer scales individual samples to have unit $L^p$ norm. This is a common operation for text
216+
classification or clustering. For example, the dot product of two $L^2$ normalized TF-IDF vectors
217+
is the cosine similarity of the vectors.
218+
219+
[`Normalizer`](api/scala/index.html#org.apache.spark.mllib.feature.Normalizer) has the following
220+
parameter in the constructor:
221+
222+
* `p` Normalization in $L^p$ space, $p = 2$ by default.
223+
224+
`Normalizer` implements [`VectorTransformer`](api/scala/index.html#org.apache.spark.mllib.feature.VectorTransformer)
225+
which can apply the normalization on a `Vector` to produce a transformed `Vector` or on
226+
an `RDD[Vector]` to produce a transformed `RDD[Vector]`.
227+
228+
Note that if the norm of the input is zero, it will return the input vector.
229+
230+
### Example
231+
232+
The example below demonstrates how to load a dataset in libsvm format, and normalizes the features
233+
with $L^2$ norm, and $L^\infty$ norm.
234+
235+
<div class="codetabs">
236+
<div data-lang="scala">
237+
{% highlight scala %}
238+
import org.apache.spark.SparkContext._
239+
import org.apache.spark.mllib.feature.Normalizer
240+
import org.apache.spark.mllib.linalg.Vectors
241+
import org.apache.spark.mllib.util.MLUtils
242+
243+
val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
244+
245+
val normalizer1 = new Normalizer()
246+
val normalizer2 = new Normalizer(p = Double.PositiveInfinity)
247+
248+
// Each sample in data1 will be normalized using $L^2$ norm.
249+
val data1 = data.map(x => (x.label, normalizer1.transform(x.features)))
250+
251+
// Each sample in data2 will be normalized using $L^\infty$ norm.
252+
val data2 = data.map(x => (x.label, normalizer2.transform(x.features)))
253+
{% endhighlight %}
254+
</div>
255+
</div>

0 commit comments

Comments
 (0)