Skip to content

Commit a0725a5

Browse files
committed
doc for MinMaxScaler
1 parent 2848f4d commit a0725a5

File tree

1 file changed

+69
-0
lines changed

1 file changed

+69
-0
lines changed

docs/ml-features.md

Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -865,6 +865,7 @@ val scaledData = scalerModel.transform(dataFrame)
865865
{% highlight java %}
866866
import org.apache.spark.api.java.JavaRDD;
867867
import org.apache.spark.ml.feature.StandardScaler;
868+
import org.apache.spark.ml.feature.StandardScalerModel;
868869
import org.apache.spark.mllib.regression.LabeledPoint;
869870
import org.apache.spark.mllib.util.MLUtils;
870871
import org.apache.spark.sql.DataFrame;
@@ -905,6 +906,74 @@ scaledData = scalerModel.transform(dataFrame)
905906
</div>
906907
</div>
907908

909+
## MinMaxScaler
910+
911+
`MinMaxScaler` transforms a dataset of `Vector` rows, rescaling each feature to a specific range (often [0, 1]). It takes parameters:
912+
913+
* `min`: 0.0 by default. Lower bound after transformation, shared by all features.
914+
* `max`: 1.0 by default. Upper bound after transformation, shared by all features.
915+
916+
`MinMaxScaler` is a `Model` which can be `fit` on a dataset to produce a `MinMaxScalerModel`; this amounts to computing summary statistics. The model can then transform each feature individually such that it is in the given range.
917+
918+
The rescaled value for a feature E is calculated as,
919+
920+
Rescaled(e_i) = \frac{e_i - E_{min}}{E_{max} - E_{min}} * (max - min) + min
921+
922+
For the case E_{max} == E_{min}, Rescaled(e_i) = 0.5 * (max + min)
923+
924+
Note that since zero values will probably be transformed to non-zero values, output of the transformer will be DenseVector even for sparse input.
925+
926+
More details can be found in the API docs for
927+
[MinMaxScaler](api/scala/index.html#org.apache.spark.ml.feature.MinMaxScaler) and
928+
[MinMaxScalerModel](api/scala/index.html#org.apache.spark.ml.feature.MinMaxScalerModel).
929+
930+
The following example demonstrates how to load a dataset in libsvm format and then rescale each feature to [0, 1].
931+
932+
<div class="codetabs">
933+
<div data-lang="scala">
934+
{% highlight scala %}
935+
import org.apache.spark.ml.feature.MinMaxScaler
936+
import org.apache.spark.mllib.util.MLUtils
937+
938+
val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
939+
val dataFrame = sqlContext.createDataFrame(data)
940+
val scaler = new MinMaxScaler()
941+
.setInputCol("features")
942+
.setOutputCol("scaledFeatures")
943+
944+
// Compute summary statistics by fitting the StandardScaler
945+
val scalerModel = scaler.fit(dataFrame)
946+
947+
// Normalize each feature to have unit standard deviation.
948+
val scaledData = scalerModel.transform(dataFrame)
949+
{% endhighlight %}
950+
</div>
951+
952+
<div data-lang="java">
953+
{% highlight java %}
954+
import org.apache.spark.api.java.JavaRDD;
955+
import org.apache.spark.ml.feature.MinMaxScaler;
956+
import org.apache.spark.ml.feature.MinMaxScalerModel;
957+
import org.apache.spark.mllib.regression.LabeledPoint;
958+
import org.apache.spark.mllib.util.MLUtils;
959+
import org.apache.spark.sql.DataFrame;
960+
961+
JavaRDD<LabeledPoint> data =
962+
MLUtils.loadLibSVMFile(jsc.sc(), "data/mllib/sample_libsvm_data.txt").toJavaRDD();
963+
DataFrame dataFrame = jsql.createDataFrame(data, LabeledPoint.class);
964+
MinMaxScaler scaler = new MinMaxScaler()
965+
.setInputCol("features")
966+
.setOutputCol("scaledFeatures");
967+
968+
// Compute summary statistics by fitting the StandardScaler
969+
MinMaxScalerModel scalerModel = scaler.fit(dataFrame);
970+
971+
// Normalize each feature to have unit standard deviation.
972+
DataFrame scaledData = scalerModel.transform(dataFrame);
973+
{% endhighlight %}
974+
</div>
975+
</div>
976+
908977
## Bucketizer
909978

910979
`Bucketizer` transforms a column of continuous features to a column of feature buckets, where the buckets are specified by users. It takes a parameter:

0 commit comments

Comments
 (0)