Skip to content

Commit 48a2f5d

Browse files
hhbyyhCodingCat
authored andcommitted
[SPARK-7583] [MLLIB] User guide update for RegexTokenizer
jira: https://issues.apache.org/jira/browse/SPARK-7583 User guide update for RegexTokenizer Author: Yuhao Yang <[email protected]> Closes apache#7828 from hhbyyh/regexTokenizerDoc.
1 parent ea969aa commit 48a2f5d

File tree

1 file changed

+30
-11
lines changed

1 file changed

+30
-11
lines changed

docs/ml-features.md

Lines changed: 30 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -217,21 +217,32 @@ for feature in result.select("result").take(3):
217217

218218
[Tokenization](http://en.wikipedia.org/wiki/Lexical_analysis#Tokenization) is the process of taking text (such as a sentence) and breaking it into individual terms (usually words). A simple [Tokenizer](api/scala/index.html#org.apache.spark.ml.feature.Tokenizer) class provides this functionality. The example below shows how to split sentences into sequences of words.
219219

220-
Note: A more advanced tokenizer is provided via [RegexTokenizer](api/scala/index.html#org.apache.spark.ml.feature.RegexTokenizer).
220+
[RegexTokenizer](api/scala/index.html#org.apache.spark.ml.feature.RegexTokenizer) allows more
221+
advanced tokenization based on regular expression (regex) matching.
222+
By default, the parameter "pattern" (regex, default: \\s+) is used as delimiters to split the input text.
223+
Alternatively, users can set parameter "gaps" to false indicating the regex "pattern" denotes
224+
"tokens" rather than splitting gaps, and find all matching occurrences as the tokenization result.
221225

222226
<div class="codetabs">
223227
<div data-lang="scala" markdown="1">
224228
{% highlight scala %}
225-
import org.apache.spark.ml.feature.Tokenizer
229+
import org.apache.spark.ml.feature.{Tokenizer, RegexTokenizer}
226230

227231
val sentenceDataFrame = sqlContext.createDataFrame(Seq(
228232
(0, "Hi I heard about Spark"),
229-
(0, "I wish Java could use case classes"),
230-
(1, "Logistic regression models are neat")
233+
(1, "I wish Java could use case classes"),
234+
(2, "Logistic,regression,models,are,neat")
231235
)).toDF("label", "sentence")
232236
val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words")
233-
val wordsDataFrame = tokenizer.transform(sentenceDataFrame)
234-
wordsDataFrame.select("words", "label").take(3).foreach(println)
237+
val regexTokenizer = new RegexTokenizer()
238+
.setInputCol("sentence")
239+
.setOutputCol("words")
240+
.setPattern("\\W") // alternatively .setPattern("\\w+").setGaps(false)
241+
242+
val tokenized = tokenizer.transform(sentenceDataFrame)
243+
tokenized.select("words", "label").take(3).foreach(println)
244+
val regexTokenized = regexTokenizer.transform(sentenceDataFrame)
245+
regexTokenized.select("words", "label").take(3).foreach(println)
235246
{% endhighlight %}
236247
</div>
237248

@@ -240,6 +251,7 @@ wordsDataFrame.select("words", "label").take(3).foreach(println)
240251
import com.google.common.collect.Lists;
241252

242253
import org.apache.spark.api.java.JavaRDD;
254+
import org.apache.spark.ml.feature.RegexTokenizer;
243255
import org.apache.spark.ml.feature.Tokenizer;
244256
import org.apache.spark.mllib.linalg.Vector;
245257
import org.apache.spark.sql.DataFrame;
@@ -252,8 +264,8 @@ import org.apache.spark.sql.types.StructType;
252264

253265
JavaRDD<Row> jrdd = jsc.parallelize(Lists.newArrayList(
254266
RowFactory.create(0, "Hi I heard about Spark"),
255-
RowFactory.create(0, "I wish Java could use case classes"),
256-
RowFactory.create(1, "Logistic regression models are neat")
267+
RowFactory.create(1, "I wish Java could use case classes"),
268+
RowFactory.create(2, "Logistic,regression,models,are,neat")
257269
));
258270
StructType schema = new StructType(new StructField[]{
259271
new StructField("label", DataTypes.DoubleType, false, Metadata.empty()),
@@ -267,22 +279,29 @@ for (Row r : wordsDataFrame.select("words", "label").take(3)) {
267279
for (String word : words) System.out.print(word + " ");
268280
System.out.println();
269281
}
282+
283+
RegexTokenizer regexTokenizer = new RegexTokenizer()
284+
.setInputCol("sentence")
285+
.setOutputCol("words")
286+
.setPattern("\\W"); // alternatively .setPattern("\\w+").setGaps(false);
270287
{% endhighlight %}
271288
</div>
272289

273290
<div data-lang="python" markdown="1">
274291
{% highlight python %}
275-
from pyspark.ml.feature import Tokenizer
292+
from pyspark.ml.feature import Tokenizer, RegexTokenizer
276293

277294
sentenceDataFrame = sqlContext.createDataFrame([
278295
(0, "Hi I heard about Spark"),
279-
(0, "I wish Java could use case classes"),
280-
(1, "Logistic regression models are neat")
296+
(1, "I wish Java could use case classes"),
297+
(2, "Logistic,regression,models,are,neat")
281298
], ["label", "sentence"])
282299
tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
283300
wordsDataFrame = tokenizer.transform(sentenceDataFrame)
284301
for words_label in wordsDataFrame.select("words", "label").take(3):
285302
print(words_label)
303+
regexTokenizer = RegexTokenizer(inputCol="sentence", outputCol="words", pattern="\\W")
304+
# alternatively, pattern="\\w+", gaps(False)
286305
{% endhighlight %}
287306
</div>
288307
</div>

0 commit comments

Comments
 (0)