From 625b34ccebab982cd3e5f186dbf7abd6e2b55559 Mon Sep 17 00:00:00 2001 From: Yuhao Yang Date: Sat, 28 May 2016 19:07:11 +0800 Subject: [PATCH 1/2] doc for stopwords and binarizer --- docs/ml-features.md | 16 ++++++++++------ 1 file changed, 10 insertions(+), 6 deletions(-) diff --git a/docs/ml-features.md b/docs/ml-features.md index 3db24a384059..b16cc142811c 100644 --- a/docs/ml-features.md +++ b/docs/ml-features.md @@ -251,11 +251,12 @@ frequently and don't carry as much meaning. `StopWordsRemover` takes as input a sequence of strings (e.g. the output of a [Tokenizer](ml-features.html#tokenizer)) and drops all the stop words from the input sequences. The list of stopwords is specified by -the `stopWords` parameter. We provide [a list of stop -words](http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words) by -default, accessible by calling `getStopWords` on a newly instantiated -`StopWordsRemover` instance. A boolean parameter `caseSensitive` indicates -if the matches should be case sensitive (false by default). +the `stopWords` parameter. Default stop words for some languages are provided +("danish", "dutch", "english", "finnish", "french", "german", "hungarian", "italian", +"norwegian", "portuguese", "russian", "spanish", "swedish" and "turkish"), +which are accessible by calling `StopWordsRemover.loadDefaultStopWords(language)`. +A boolean parameter `caseSensitive` indicates if the matches should be case +sensitive (false by default). **Examples** @@ -346,7 +347,10 @@ for more details on the API. Binarization is the process of thresholding numerical features to binary (0/1) features. -`Binarizer` takes the common parameters `inputCol` and `outputCol`, as well as the `threshold` for binarization. Feature values greater than the threshold are binarized to 1.0; values equal to or less than the threshold are binarized to 0.0. +`Binarizer` takes the common parameters `inputCol` and `outputCol`, as well as the `threshold` +for binarization. Feature values greater than the threshold are binarized to 1.0; values equal +to or less than the threshold are binarized to 0.0. Both Vector and Double types are supported +for `inputCol`.
From e4e56b3e073f5817620765c2d8abaaa8163f8689 Mon Sep 17 00:00:00 2001 From: Yuhao Yang Date: Sun, 29 May 2016 11:15:28 -0400 Subject: [PATCH 2/2] clarify --- docs/ml-features.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/docs/ml-features.md b/docs/ml-features.md index b16cc142811c..3cb26443b951 100644 --- a/docs/ml-features.md +++ b/docs/ml-features.md @@ -251,12 +251,12 @@ frequently and don't carry as much meaning. `StopWordsRemover` takes as input a sequence of strings (e.g. the output of a [Tokenizer](ml-features.html#tokenizer)) and drops all the stop words from the input sequences. The list of stopwords is specified by -the `stopWords` parameter. Default stop words for some languages are provided -("danish", "dutch", "english", "finnish", "french", "german", "hungarian", "italian", -"norwegian", "portuguese", "russian", "spanish", "swedish" and "turkish"), -which are accessible by calling `StopWordsRemover.loadDefaultStopWords(language)`. -A boolean parameter `caseSensitive` indicates if the matches should be case -sensitive (false by default). +the `stopWords` parameter. Default stop words for some languages are accessible +by calling `StopWordsRemover.loadDefaultStopWords(language)`, for which available +options are "danish", "dutch", "english", "finnish", "french", "german", "hungarian", +"italian", "norwegian", "portuguese", "russian", "spanish", "swedish" and "turkish". +A boolean parameter `caseSensitive` indicates if the matches should be case sensitive +(false by default). **Examples**