You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/ml-features.md
+10-6Lines changed: 10 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -251,11 +251,12 @@ frequently and don't carry as much meaning.
251
251
`StopWordsRemover` takes as input a sequence of strings (e.g. the output
252
252
of a [Tokenizer](ml-features.html#tokenizer)) and drops all the stop
253
253
words from the input sequences. The list of stopwords is specified by
254
-
the `stopWords` parameter. We provide [a list of stop
255
-
words](http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words) by
256
-
default, accessible by calling `getStopWords` on a newly instantiated
257
-
`StopWordsRemover` instance. A boolean parameter `caseSensitive` indicates
258
-
if the matches should be case sensitive (false by default).
254
+
the `stopWords` parameter. Default stop words for some languages are accessible
255
+
by calling `StopWordsRemover.loadDefaultStopWords(language)`, for which available
256
+
options are "danish", "dutch", "english", "finnish", "french", "german", "hungarian",
257
+
"italian", "norwegian", "portuguese", "russian", "spanish", "swedish" and "turkish".
258
+
A boolean parameter `caseSensitive` indicates if the matches should be case sensitive
259
+
(false by default).
259
260
260
261
**Examples**
261
262
@@ -346,7 +347,10 @@ for more details on the API.
346
347
347
348
Binarization is the process of thresholding numerical features to binary (0/1) features.
348
349
349
-
`Binarizer` takes the common parameters `inputCol` and `outputCol`, as well as the `threshold` for binarization. Feature values greater than the threshold are binarized to 1.0; values equal to or less than the threshold are binarized to 0.0.
350
+
`Binarizer` takes the common parameters `inputCol` and `outputCol`, as well as the `threshold`
351
+
for binarization. Feature values greater than the threshold are binarized to 1.0; values equal
352
+
to or less than the threshold are binarized to 0.0. Both Vector and Double types are supported
0 commit comments