Skip to content

Commit c5e46fb

Browse files
author
VinceShieh
committed
[SPARK-19590][pyspark][ML] update the document for QuantileDiscretizer in pyspark
This PR is to document the change on QuantileDiscretizer in pyspark for PR: #15428 Signed-off-by: VinceShieh <[email protected]>
1 parent 1ab9731 commit c5e46fb

File tree

1 file changed

+11
-1
lines changed

1 file changed

+11
-1
lines changed

python/pyspark/ml/feature.py

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1178,7 +1178,17 @@ class QuantileDiscretizer(JavaEstimator, HasInputCol, HasOutputCol, JavaMLReadab
11781178
11791179
`QuantileDiscretizer` takes a column with continuous features and outputs a column with binned
11801180
categorical features. The number of bins can be set using the :py:attr:`numBuckets` parameter.
1181-
The bin ranges are chosen using an approximate algorithm (see the documentation for
1181+
It is possible that the number of buckets used will be less than this value, for example, if
1182+
there are too few distinct values of the input to create enough distinct quantiles.
1183+
1184+
NaN handling: Note also that
1185+
QuantileDiscretizer will raise an error when it finds NaN values in the dataset, but the user
1186+
can also choose to either keep or remove NaN values within the dataset by setting
1187+
`handleInvalid`. If the user chooses to keep NaN values, they will be handled specially and
1188+
placed into their own bucket, for example, if 4 buckets are used, then non-NaN data will be
1189+
put into buckets[0-3], but NaNs will be counted in a special bucket[4].
1190+
1191+
Algorithm: The bin ranges are chosen using an approximate algorithm (see the documentation for
11821192
:py:meth:`~.DataFrameStatFunctions.approxQuantile` for a detailed description).
11831193
The precision of the approximation can be controlled with the
11841194
:py:attr:`relativeError` parameter.

0 commit comments

Comments
 (0)