You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/ml-features.md
+30-11Lines changed: 30 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -217,21 +217,32 @@ for feature in result.select("result").take(3):
217
217
218
218
[Tokenization](http://en.wikipedia.org/wiki/Lexical_analysis#Tokenization) is the process of taking text (such as a sentence) and breaking it into individual terms (usually words). A simple [Tokenizer](api/scala/index.html#org.apache.spark.ml.feature.Tokenizer) class provides this functionality. The example below shows how to split sentences into sequences of words.
219
219
220
-
Note: A more advanced tokenizer is provided via [RegexTokenizer](api/scala/index.html#org.apache.spark.ml.feature.RegexTokenizer).
220
+
[RegexTokenizer](api/scala/index.html#org.apache.spark.ml.feature.RegexTokenizer) allows more
221
+
advanced tokenization based on regular expression (regex) matching.
222
+
By default, the parameter "pattern" (regex, default: \\s+) is used as delimiters to split the input text.
223
+
Alternatively, users can set parameter "gaps" to false indicating the regex "pattern" denotes
224
+
"tokens" rather than splitting gaps, and find all matching occurrences as the tokenization result.
0 commit comments