Skip to content

Commit 0499ed9

Browse files
hhbyyhmengxr
authored andcommitted
[SPARK-16045][ML][DOC] Spark 2.0 ML.feature: doc update for stopwords and binarizer
## What changes were proposed in this pull request? jira: https://issues.apache.org/jira/browse/SPARK-16045 2.0 Audit: Update document for StopWordsRemover and Binarizer. ## How was this patch tested? manual review for doc Author: Yuhao Yang <[email protected]> Author: Yuhao Yang <[email protected]> Closes #13375 from hhbyyh/stopdoc. (cherry picked from commit a58f402) Signed-off-by: Xiangrui Meng <[email protected]>
1 parent 14e5dec commit 0499ed9

File tree

1 file changed

+10
-6
lines changed

1 file changed

+10
-6
lines changed

docs/ml-features.md

Lines changed: 10 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -251,11 +251,12 @@ frequently and don't carry as much meaning.
251251
`StopWordsRemover` takes as input a sequence of strings (e.g. the output
252252
of a [Tokenizer](ml-features.html#tokenizer)) and drops all the stop
253253
words from the input sequences. The list of stopwords is specified by
254-
the `stopWords` parameter. We provide [a list of stop
255-
words](http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words) by
256-
default, accessible by calling `getStopWords` on a newly instantiated
257-
`StopWordsRemover` instance. A boolean parameter `caseSensitive` indicates
258-
if the matches should be case sensitive (false by default).
254+
the `stopWords` parameter. Default stop words for some languages are accessible
255+
by calling `StopWordsRemover.loadDefaultStopWords(language)`, for which available
256+
options are "danish", "dutch", "english", "finnish", "french", "german", "hungarian",
257+
"italian", "norwegian", "portuguese", "russian", "spanish", "swedish" and "turkish".
258+
A boolean parameter `caseSensitive` indicates if the matches should be case sensitive
259+
(false by default).
259260

260261
**Examples**
261262

@@ -346,7 +347,10 @@ for more details on the API.
346347

347348
Binarization is the process of thresholding numerical features to binary (0/1) features.
348349

349-
`Binarizer` takes the common parameters `inputCol` and `outputCol`, as well as the `threshold` for binarization. Feature values greater than the threshold are binarized to 1.0; values equal to or less than the threshold are binarized to 0.0.
350+
`Binarizer` takes the common parameters `inputCol` and `outputCol`, as well as the `threshold`
351+
for binarization. Feature values greater than the threshold are binarized to 1.0; values equal
352+
to or less than the threshold are binarized to 0.0. Both Vector and Double types are supported
353+
for `inputCol`.
350354

351355
<div class="codetabs">
352356
<div data-lang="scala" markdown="1">

0 commit comments

Comments
 (0)