Skip to content

Commit 702d85a

Browse files
zero323Felix Cheung
authored andcommitted
[SPARK-20208][R][DOCS] Document R fpGrowth support
## What changes were proposed in this pull request? Document fpGrowth in: - vignettes - programming guide - code example ## How was this patch tested? Manual tests. Author: zero323 <[email protected]> Closes #17557 from zero323/SPARK-20208.
1 parent e468a96 commit 702d85a

File tree

2 files changed

+86
-1
lines changed

2 files changed

+86
-1
lines changed

R/pkg/vignettes/sparkr-vignettes.Rmd

Lines changed: 36 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -505,6 +505,10 @@ SparkR supports the following machine learning models and algorithms.
505505

506506
* Alternating Least Squares (ALS)
507507

508+
#### Frequent Pattern Mining
509+
510+
* FP-growth
511+
508512
#### Statistics
509513

510514
* Kolmogorov-Smirnov Test
@@ -707,7 +711,7 @@ summary(tweedieGLM1)
707711
```
708712
We can try other distributions in the tweedie family, for example, a compound Poisson distribution with a log link:
709713
```{r}
710-
tweedieGLM2 <- spark.glm(carsDF, mpg ~ wt + hp, family = "tweedie",
714+
tweedieGLM2 <- spark.glm(carsDF, mpg ~ wt + hp, family = "tweedie",
711715
var.power = 1.2, link.power = 0.0)
712716
summary(tweedieGLM2)
713717
```
@@ -906,6 +910,37 @@ predicted <- predict(model, df)
906910
head(predicted)
907911
```
908912

913+
#### FP-growth
914+
915+
`spark.fpGrowth` executes FP-growth algorithm to mine frequent itemsets on a `SparkDataFrame`. `itemsCol` should be an array of values.
916+
917+
```{r}
918+
df <- selectExpr(createDataFrame(data.frame(rawItems = c(
919+
"T,R,U", "T,S", "V,R", "R,U,T,V", "R,S", "V,S,U", "U,R", "S,T", "V,R", "V,U,S",
920+
"T,V,U", "R,V", "T,S", "T,S", "S,T", "S,U", "T,R", "V,R", "S,V", "T,S,U"
921+
))), "split(rawItems, ',') AS items")
922+
923+
fpm <- spark.fpGrowth(df, minSupport = 0.2, minConfidence = 0.5)
924+
```
925+
926+
`spark.freqItemsets` method can be used to retrieve a `SparkDataFrame` with the frequent itemsets.
927+
928+
```{r}
929+
head(spark.freqItemsets(fpm))
930+
```
931+
932+
`spark.associationRules` returns a `SparkDataFrame` with the association rules.
933+
934+
```{r}
935+
head(spark.associationRules(fpm))
936+
```
937+
938+
We can make predictions based on the `antecedent`.
939+
940+
```{r}
941+
head(predict(fpm, df))
942+
```
943+
909944
#### Kolmogorov-Smirnov Test
910945

911946
`spark.kstest` runs a two-sided, one-sample [Kolmogorov-Smirnov (KS) test](https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test).

examples/src/main/r/ml/fpm.R

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
#
2+
# Licensed to the Apache Software Foundation (ASF) under one or more
3+
# contributor license agreements. See the NOTICE file distributed with
4+
# this work for additional information regarding copyright ownership.
5+
# The ASF licenses this file to You under the Apache License, Version 2.0
6+
# (the "License"); you may not use this file except in compliance with
7+
# the License. You may obtain a copy of the License at
8+
#
9+
# http://www.apache.org/licenses/LICENSE-2.0
10+
#
11+
# Unless required by applicable law or agreed to in writing, software
12+
# distributed under the License is distributed on an "AS IS" BASIS,
13+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
# See the License for the specific language governing permissions and
15+
# limitations under the License.
16+
#
17+
18+
# To run this example use
19+
# ./bin/spark-submit examples/src/main/r/ml/fpm.R
20+
21+
# Load SparkR library into your R session
22+
library(SparkR)
23+
24+
# Initialize SparkSession
25+
sparkR.session(appName = "SparkR-ML-fpm-example")
26+
27+
# $example on$
28+
# Load training data
29+
30+
df <- selectExpr(createDataFrame(data.frame(rawItems = c(
31+
"1,2,5", "1,2,3,5", "1,2"
32+
))), "split(rawItems, ',') AS items")
33+
34+
fpm <- spark.fpGrowth(df, itemsCol="items", minSupport=0.5, minConfidence=0.6)
35+
36+
# Extracting frequent itemsets
37+
38+
spark.freqItemsets(fpm)
39+
40+
# Extracting association rules
41+
42+
spark.associationRules(fpm)
43+
44+
# Predict uses association rules to and combines possible consequents
45+
46+
predict(fpm, df)
47+
48+
# $example off$
49+
50+
sparkR.session.stop()

0 commit comments

Comments
 (0)