You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
*`splits`: Parameter for mapping continuous features into buckets. With n+1 splits, there are n buckets. A bucket defined by splits x,y holds values in the range [x,y) except the last bucket, which also includes y. Splits should be strictly increasing. Values at -inf, inf must be explicitly provided to cover all Double values; Otherwise, values outside the splits specified will be treated as errors. Two examples of `splits` are `Array(Double.NegativeInfinity, 0.0, 1.0, Double.PositiveInfinity)` and `Array(0.0, 1.0, 2.0)`.
797
797
798
-
Note that if the standard deviation of a feature is zero, it will return default `0.0` value in the `Vector` for that feature.
798
+
Note that if you have no idea of the upper bound and lower bound of the targeted column, you would better add the `Double.NegativeInfinity` and `Double.PositiveInfinity` as the bounds of your splits to prevent a potenial out of Bucketizer bounds exception.
799
+
800
+
Note also that the splits that you provided have to be in strictly increasing order, i.e. `s0 < s1 < s2 < ... < sn`.
799
801
800
802
More details can be found in the API docs for [Bucketizer](api/scala/index.html#org.apache.spark.ml.feature.Bucketizer).
801
803
802
-
The following example demonstrates how to load a dataset in libsvm format and then normalize each feature to have unit standard deviation.
804
+
The following example demonstrates how to bucketize a column of `Double`s into another index-wised column.
805
+
806
+
<divclass="codetabs">
807
+
<divdata-lang="scala">
808
+
{% highlight scala %}
809
+
import org.apache.spark.ml.feature.Bucketizer
810
+
import org.apache.spark.sql.DataFrame
811
+
812
+
// Since we know the bounds of data, there is no need to add -inf and inf.
813
+
val splits = Array(-0.5, 0.0, 0.5)
814
+
815
+
val data = Array(-0.5, -0.3, 0.0, 0.2)
816
+
val dataFrame = sqlContext.createDataFrame(data.map(Tuple1.apply)).toDF("feature")
817
+
818
+
val bucketizer = new Bucketizer().setInputCol("feature").setOutputCol("result").setSplits(splits)
819
+
820
+
// Transform original data into its bucket index.
821
+
val bucketizedData = bucketizer.transform(dataFrame)
822
+
{% endhighlight %}
823
+
</div>
803
824
825
+
<divdata-lang="java">
826
+
{% highlight java %}
827
+
import com.google.common.collect.Lists;
804
828
829
+
import org.apache.spark.sql.DataFrame;
830
+
import org.apache.spark.sql.Row;
831
+
import org.apache.spark.sql.RowFactory;
832
+
import org.apache.spark.sql.types.DataTypes;
833
+
import org.apache.spark.sql.types.Metadata;
834
+
import org.apache.spark.sql.types.StructField;
835
+
import org.apache.spark.sql.types.StructType;
836
+
837
+
double[] splits = {-0.5, 0.0, 0.5};
838
+
839
+
JavaRDD<Row> data = jsc.parallelize(Lists.newArrayList(
840
+
RowFactory.create(-0.5),
841
+
RowFactory.create(-0.3),
842
+
RowFactory.create(0.0),
843
+
RowFactory.create(0.2)
844
+
));
845
+
StructType schema = new StructType(new StructField[] {
846
+
new StructField("feature", DataTypes.DoubleType, false, Metadata.empty())
0 commit comments