Skip to content

Commit e350e49

Browse files
committed
add all demos
1 parent 006ddf1 commit e350e49

File tree

1 file changed

+73
-2
lines changed

1 file changed

+73
-2
lines changed

docs/ml-features.md

Lines changed: 73 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -795,13 +795,84 @@ scaledData = scalerModel.transform(dataFrame)
795795

796796
* `splits`: Parameter for mapping continuous features into buckets. With n+1 splits, there are n buckets. A bucket defined by splits x,y holds values in the range [x,y) except the last bucket, which also includes y. Splits should be strictly increasing. Values at -inf, inf must be explicitly provided to cover all Double values; Otherwise, values outside the splits specified will be treated as errors. Two examples of `splits` are `Array(Double.NegativeInfinity, 0.0, 1.0, Double.PositiveInfinity)` and `Array(0.0, 1.0, 2.0)`.
797797

798-
Note that if the standard deviation of a feature is zero, it will return default `0.0` value in the `Vector` for that feature.
798+
Note that if you have no idea of the upper bound and lower bound of the targeted column, you would better add the `Double.NegativeInfinity` and `Double.PositiveInfinity` as the bounds of your splits to prevent a potenial out of Bucketizer bounds exception.
799+
800+
Note also that the splits that you provided have to be in strictly increasing order, i.e. `s0 < s1 < s2 < ... < sn`.
799801

800802
More details can be found in the API docs for [Bucketizer](api/scala/index.html#org.apache.spark.ml.feature.Bucketizer).
801803

802-
The following example demonstrates how to load a dataset in libsvm format and then normalize each feature to have unit standard deviation.
804+
The following example demonstrates how to bucketize a column of `Double`s into another index-wised column.
805+
806+
<div class="codetabs">
807+
<div data-lang="scala">
808+
{% highlight scala %}
809+
import org.apache.spark.ml.feature.Bucketizer
810+
import org.apache.spark.sql.DataFrame
811+
812+
// Since we know the bounds of data, there is no need to add -inf and inf.
813+
val splits = Array(-0.5, 0.0, 0.5)
814+
815+
val data = Array(-0.5, -0.3, 0.0, 0.2)
816+
val dataFrame = sqlContext.createDataFrame(data.map(Tuple1.apply)).toDF("feature")
817+
818+
val bucketizer = new Bucketizer().setInputCol("feature").setOutputCol("result").setSplits(splits)
819+
820+
// Transform original data into its bucket index.
821+
val bucketizedData = bucketizer.transform(dataFrame)
822+
{% endhighlight %}
823+
</div>
803824

825+
<div data-lang="java">
826+
{% highlight java %}
827+
import com.google.common.collect.Lists;
804828

829+
import org.apache.spark.sql.DataFrame;
830+
import org.apache.spark.sql.Row;
831+
import org.apache.spark.sql.RowFactory;
832+
import org.apache.spark.sql.types.DataTypes;
833+
import org.apache.spark.sql.types.Metadata;
834+
import org.apache.spark.sql.types.StructField;
835+
import org.apache.spark.sql.types.StructType;
836+
837+
double[] splits = {-0.5, 0.0, 0.5};
838+
839+
JavaRDD<Row> data = jsc.parallelize(Lists.newArrayList(
840+
RowFactory.create(-0.5),
841+
RowFactory.create(-0.3),
842+
RowFactory.create(0.0),
843+
RowFactory.create(0.2)
844+
));
845+
StructType schema = new StructType(new StructField[] {
846+
new StructField("feature", DataTypes.DoubleType, false, Metadata.empty())
847+
});
848+
DataFrame dataFrame = jsql.createDataFrame(data, schema);
849+
850+
Bucketizer bucketizer = new Bucketizer()
851+
.setInputCol("feature")
852+
.setOutputCol("result")
853+
.setSplits(splits);
854+
855+
// Transform original data into its bucket index.
856+
DataFrame bucketizedData = bucketizer.transform(dataFrame);
857+
{% endhighlight %}
858+
</div>
859+
860+
<div data-lang="python">
861+
{% highlight python %}
862+
from pyspark.ml.feature import Bucketizer
863+
864+
splits = [-0.5, 0.0, 0.5]
865+
866+
data = [(-0.5,), (-0.3,), (0.0,), (0.2,)]
867+
dataFrame = sqlContext.createDataFrame(data, ["feature"])
868+
869+
bucketizer = Bucketizer(splits=splits, inputCol="feature", outputCol="result")
870+
871+
# Transform original data into its bucket index.
872+
bucketizedData = bucketizer.transform(dataFrame)
873+
{% endhighlight %}
874+
</div>
875+
</div>
805876

806877
# Feature Selectors
807878

0 commit comments

Comments
 (0)