Skip to content

Commit add9d1b

Browse files
YY-OnCallFelix Cheung
authored andcommitted
[SPARK-19791][ML] Add doc and example for fpgrowth
## What changes were proposed in this pull request? Add a new section for fpm Add Example for FPGrowth in scala and Java updated: Rewrite transform to be more compact. ## How was this patch tested? local doc generation. Author: Yuhao Yang <[email protected]> Closes #17130 from hhbyyh/fpmdoc.
1 parent b28c3bc commit add9d1b

File tree

8 files changed

+310
-18
lines changed

8 files changed

+310
-18
lines changed

docs/_data/menu-ml.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,8 @@
88
url: ml-clustering.html
99
- text: Collaborative filtering
1010
url: ml-collaborative-filtering.html
11+
- text: Frequent Pattern Mining
12+
url: ml-frequent-pattern-mining.html
1113
- text: Model selection and tuning
1214
url: ml-tuning.html
1315
- text: Advanced topics

docs/ml-frequent-pattern-mining.md

Lines changed: 87 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,87 @@
1+
---
2+
layout: global
3+
title: Frequent Pattern Mining
4+
displayTitle: Frequent Pattern Mining
5+
---
6+
7+
Mining frequent items, itemsets, subsequences, or other substructures is usually among the
8+
first steps to analyze a large-scale dataset, which has been an active research topic in
9+
data mining for years.
10+
We refer users to Wikipedia's [association rule learning](http://en.wikipedia.org/wiki/Association_rule_learning)
11+
for more information.
12+
13+
**Table of Contents**
14+
15+
* This will become a table of contents (this text will be scraped).
16+
{:toc}
17+
18+
## FP-Growth
19+
20+
The FP-growth algorithm is described in the paper
21+
[Han et al., Mining frequent patterns without candidate generation](http://dx.doi.org/10.1145/335191.335372),
22+
where "FP" stands for frequent pattern.
23+
Given a dataset of transactions, the first step of FP-growth is to calculate item frequencies and identify frequent items.
24+
Different from [Apriori-like](http://en.wikipedia.org/wiki/Apriori_algorithm) algorithms designed for the same purpose,
25+
the second step of FP-growth uses a suffix tree (FP-tree) structure to encode transactions without generating candidate sets
26+
explicitly, which are usually expensive to generate.
27+
After the second step, the frequent itemsets can be extracted from the FP-tree.
28+
In `spark.mllib`, we implemented a parallel version of FP-growth called PFP,
29+
as described in [Li et al., PFP: Parallel FP-growth for query recommendation](http://dx.doi.org/10.1145/1454008.1454027).
30+
PFP distributes the work of growing FP-trees based on the suffixes of transactions,
31+
and hence is more scalable than a single-machine implementation.
32+
We refer users to the papers for more details.
33+
34+
`spark.ml`'s FP-growth implementation takes the following (hyper-)parameters:
35+
36+
* `minSupport`: the minimum support for an itemset to be identified as frequent.
37+
For example, if an item appears 3 out of 5 transactions, it has a support of 3/5=0.6.
38+
* `minConfidence`: minimum confidence for generating Association Rule. Confidence is an indication of how often an
39+
association rule has been found to be true. For example, if in the transactions itemset `X` appears 4 times, `X`
40+
and `Y` co-occur only 2 times, the confidence for the rule `X => Y` is then 2/4 = 0.5. The parameter will not
41+
affect the mining for frequent itemsets, but specify the minimum confidence for generating association rules
42+
from frequent itemsets.
43+
* `numPartitions`: the number of partitions used to distribute the work. By default the param is not set, and
44+
number of partitions of the input dataset is used.
45+
46+
The `FPGrowthModel` provides:
47+
48+
* `freqItemsets`: frequent itemsets in the format of DataFrame("items"[Array], "freq"[Long])
49+
* `associationRules`: association rules generated with confidence above `minConfidence`, in the format of
50+
DataFrame("antecedent"[Array], "consequent"[Array], "confidence"[Double]).
51+
* `transform`: For each transaction in `itemsCol`, the `transform` method will compare its items against the antecedents
52+
of each association rule. If the record contains all the antecedents of a specific association rule, the rule
53+
will be considered as applicable and its consequents will be added to the prediction result. The transform
54+
method will summarize the consequents from all the applicable rules as prediction. The prediction column has
55+
the same data type as `itemsCol` and does not contain existing items in the `itemsCol`.
56+
57+
58+
**Examples**
59+
60+
<div class="codetabs">
61+
62+
<div data-lang="scala" markdown="1">
63+
Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.fpm.FPGrowth) for more details.
64+
65+
{% include_example scala/org/apache/spark/examples/ml/FPGrowthExample.scala %}
66+
</div>
67+
68+
<div data-lang="java" markdown="1">
69+
Refer to the [Java API docs](api/java/org/apache/spark/ml/fpm/FPGrowth.html) for more details.
70+
71+
{% include_example java/org/apache/spark/examples/ml/JavaFPGrowthExample.java %}
72+
</div>
73+
74+
<div data-lang="python" markdown="1">
75+
Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.fpm.FPGrowth) for more details.
76+
77+
{% include_example python/ml/fpgrowth_example.py %}
78+
</div>
79+
80+
<div data-lang="r" markdown="1">
81+
82+
Refer to the [R API docs](api/R/spark.fpGrowth.html) for more details.
83+
84+
{% include_example r/ml/fpm.R %}
85+
</div>
86+
87+
</div>

docs/mllib-frequent-pattern-mining.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ explicitly, which are usually expensive to generate.
2424
After the second step, the frequent itemsets can be extracted from the FP-tree.
2525
In `spark.mllib`, we implemented a parallel version of FP-growth called PFP,
2626
as described in [Li et al., PFP: Parallel FP-growth for query recommendation](http://dx.doi.org/10.1145/1454008.1454027).
27-
PFP distributes the work of growing FP-trees based on the suffices of transactions,
27+
PFP distributes the work of growing FP-trees based on the suffixes of transactions,
2828
and hence more scalable than a single-machine implementation.
2929
We refer users to the papers for more details.
3030

Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
/*
2+
* Licensed to the Apache Software Foundation (ASF) under one or more
3+
* contributor license agreements. See the NOTICE file distributed with
4+
* this work for additional information regarding copyright ownership.
5+
* The ASF licenses this file to You under the Apache License, Version 2.0
6+
* (the "License"); you may not use this file except in compliance with
7+
* the License. You may obtain a copy of the License at
8+
*
9+
* http://www.apache.org/licenses/LICENSE-2.0
10+
*
11+
* Unless required by applicable law or agreed to in writing, software
12+
* distributed under the License is distributed on an "AS IS" BASIS,
13+
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
* See the License for the specific language governing permissions and
15+
* limitations under the License.
16+
*/
17+
18+
package org.apache.spark.examples.ml;
19+
20+
// $example on$
21+
import java.util.Arrays;
22+
import java.util.List;
23+
24+
import org.apache.spark.ml.fpm.FPGrowth;
25+
import org.apache.spark.ml.fpm.FPGrowthModel;
26+
import org.apache.spark.sql.Dataset;
27+
import org.apache.spark.sql.Row;
28+
import org.apache.spark.sql.RowFactory;
29+
import org.apache.spark.sql.SparkSession;
30+
import org.apache.spark.sql.types.*;
31+
// $example off$
32+
33+
/**
34+
* An example demonstrating FPGrowth.
35+
* Run with
36+
* <pre>
37+
* bin/run-example ml.JavaFPGrowthExample
38+
* </pre>
39+
*/
40+
public class JavaFPGrowthExample {
41+
public static void main(String[] args) {
42+
SparkSession spark = SparkSession
43+
.builder()
44+
.appName("JavaFPGrowthExample")
45+
.getOrCreate();
46+
47+
// $example on$
48+
List<Row> data = Arrays.asList(
49+
RowFactory.create(Arrays.asList("1 2 5".split(" "))),
50+
RowFactory.create(Arrays.asList("1 2 3 5".split(" "))),
51+
RowFactory.create(Arrays.asList("1 2".split(" ")))
52+
);
53+
StructType schema = new StructType(new StructField[]{ new StructField(
54+
"items", new ArrayType(DataTypes.StringType, true), false, Metadata.empty())
55+
});
56+
Dataset<Row> itemsDF = spark.createDataFrame(data, schema);
57+
58+
FPGrowthModel model = new FPGrowth()
59+
.setItemsCol("items")
60+
.setMinSupport(0.5)
61+
.setMinConfidence(0.6)
62+
.fit(itemsDF);
63+
64+
// Display frequent itemsets.
65+
model.freqItemsets().show();
66+
67+
// Display generated association rules.
68+
model.associationRules().show();
69+
70+
// transform examines the input items against all the association rules and summarize the
71+
// consequents as prediction
72+
model.transform(itemsDF).show();
73+
// $example off$
74+
75+
spark.stop();
76+
}
77+
}
Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
#
2+
# Licensed to the Apache Software Foundation (ASF) under one or more
3+
# contributor license agreements. See the NOTICE file distributed with
4+
# this work for additional information regarding copyright ownership.
5+
# The ASF licenses this file to You under the Apache License, Version 2.0
6+
# (the "License"); you may not use this file except in compliance with
7+
# the License. You may obtain a copy of the License at
8+
#
9+
# http://www.apache.org/licenses/LICENSE-2.0
10+
#
11+
# Unless required by applicable law or agreed to in writing, software
12+
# distributed under the License is distributed on an "AS IS" BASIS,
13+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
# See the License for the specific language governing permissions and
15+
# limitations under the License.
16+
#
17+
18+
# $example on$
19+
from pyspark.ml.fpm import FPGrowth
20+
# $example off$
21+
from pyspark.sql import SparkSession
22+
23+
"""
24+
An example demonstrating FPGrowth.
25+
Run with:
26+
bin/spark-submit examples/src/main/python/ml/fpgrowth_example.py
27+
"""
28+
29+
if __name__ == "__main__":
30+
spark = SparkSession\
31+
.builder\
32+
.appName("FPGrowthExample")\
33+
.getOrCreate()
34+
35+
# $example on$
36+
df = spark.createDataFrame([
37+
(0, [1, 2, 5]),
38+
(1, [1, 2, 3, 5]),
39+
(2, [1, 2])
40+
], ["id", "items"])
41+
42+
fpGrowth = FPGrowth(itemsCol="items", minSupport=0.5, minConfidence=0.6)
43+
model = fpGrowth.fit(df)
44+
45+
# Display frequent itemsets.
46+
model.freqItemsets.show()
47+
48+
# Display generated association rules.
49+
model.associationRules.show()
50+
51+
# transform examines the input items against all the association rules and summarize the
52+
# consequents as prediction
53+
model.transform(df).show()
54+
# $example off$
55+
56+
spark.stop()
Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,67 @@
1+
/*
2+
* Licensed to the Apache Software Foundation (ASF) under one or more
3+
* contributor license agreements. See the NOTICE file distributed with
4+
* this work for additional information regarding copyright ownership.
5+
* The ASF licenses this file to You under the Apache License, Version 2.0
6+
* (the "License"); you may not use this file except in compliance with
7+
* the License. You may obtain a copy of the License at
8+
*
9+
* http://www.apache.org/licenses/LICENSE-2.0
10+
*
11+
* Unless required by applicable law or agreed to in writing, software
12+
* distributed under the License is distributed on an "AS IS" BASIS,
13+
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
* See the License for the specific language governing permissions and
15+
* limitations under the License.
16+
*/
17+
18+
package org.apache.spark.examples.ml
19+
20+
// scalastyle:off println
21+
22+
// $example on$
23+
import org.apache.spark.ml.fpm.FPGrowth
24+
// $example off$
25+
import org.apache.spark.sql.SparkSession
26+
27+
/**
28+
* An example demonstrating FP-Growth.
29+
* Run with
30+
* {{{
31+
* bin/run-example ml.FPGrowthExample
32+
* }}}
33+
*/
34+
object FPGrowthExample {
35+
36+
def main(args: Array[String]): Unit = {
37+
val spark = SparkSession
38+
.builder
39+
.appName(s"${this.getClass.getSimpleName}")
40+
.getOrCreate()
41+
import spark.implicits._
42+
43+
// $example on$
44+
val dataset = spark.createDataset(Seq(
45+
"1 2 5",
46+
"1 2 3 5",
47+
"1 2")
48+
).map(t => t.split(" ")).toDF("items")
49+
50+
val fpgrowth = new FPGrowth().setItemsCol("items").setMinSupport(0.5).setMinConfidence(0.6)
51+
val model = fpgrowth.fit(dataset)
52+
53+
// Display frequent itemsets.
54+
model.freqItemsets.show()
55+
56+
// Display generated association rules.
57+
model.associationRules.show()
58+
59+
// transform examines the input items against all the association rules and summarize the
60+
// consequents as prediction
61+
model.transform(dataset).show()
62+
// $example off$
63+
64+
spark.stop()
65+
}
66+
}
67+
// scalastyle:on println

0 commit comments

Comments
 (0)