Skip to content

Commit d8831a2

Browse files
author
Marcelo Vanzin
committed
Merge branch 'master' into SPARK-7736
2 parents cfef35d + ee093c8 commit d8831a2

16 files changed

+490
-29
lines changed

docs/ml-features.md

Lines changed: 19 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1212,7 +1212,7 @@ v_N
12121212
This example below demonstrates how to transform vectors using a transforming vector value.
12131213

12141214
<div class="codetabs">
1215-
<div data-lang="scala">
1215+
<div data-lang="scala" markdown="1">
12161216
{% highlight scala %}
12171217
import org.apache.spark.ml.feature.ElementwiseProduct
12181218
import org.apache.spark.mllib.linalg.Vectors
@@ -1229,12 +1229,12 @@ val transformer = new ElementwiseProduct()
12291229
.setOutputCol("transformedVector")
12301230

12311231
// Batch transform the vectors to create new column:
1232-
val transformedData = transformer.transform(dataFrame)
1232+
transformer.transform(dataFrame).show()
12331233

12341234
{% endhighlight %}
12351235
</div>
12361236

1237-
<div data-lang="java">
1237+
<div data-lang="java" markdown="1">
12381238
{% highlight java %}
12391239
import com.google.common.collect.Lists;
12401240

@@ -1267,10 +1267,25 @@ ElementwiseProduct transformer = new ElementwiseProduct()
12671267
.setInputCol("vector")
12681268
.setOutputCol("transformedVector");
12691269
// Batch transform the vectors to create new column:
1270-
DataFrame transformedData = transformer.transform(dataFrame);
1270+
transformer.transform(dataFrame).show();
12711271

12721272
{% endhighlight %}
12731273
</div>
1274+
1275+
<div data-lang="python" markdown="1">
1276+
{% highlight python %}
1277+
from pyspark.ml.feature import ElementwiseProduct
1278+
from pyspark.mllib.linalg import Vectors
1279+
1280+
data = [(Vectors.dense([1.0, 2.0, 3.0]),), (Vectors.dense([4.0, 5.0, 6.0]),)]
1281+
df = sqlContext.createDataFrame(data, ["vector"])
1282+
transformer = ElementwiseProduct(scalingVec=Vectors.dense([0.0, 1.0, 2.0]),
1283+
inputCol="vector", outputCol="transformedVector")
1284+
transformer.transform(df).show()
1285+
1286+
{% endhighlight %}
1287+
</div>
1288+
12741289
</div>
12751290

12761291
## VectorAssembler

docs/mllib-frequent-pattern-mining.md

Lines changed: 96 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -96,3 +96,99 @@ for (FPGrowth.FreqItemset<String> itemset: model.freqItemsets().toJavaRDD().coll
9696

9797
</div>
9898
</div>
99+
100+
## PrefixSpan
101+
102+
PrefixSpan is a sequential pattern mining algorithm described in
103+
[Pei et al., Mining Sequential Patterns by Pattern-Growth: The
104+
PrefixSpan Approach](http://dx.doi.org/10.1109%2FTKDE.2004.77). We refer
105+
the reader to the referenced paper for formalizing the sequential
106+
pattern mining problem.
107+
108+
MLlib's PrefixSpan implementation takes the following parameters:
109+
110+
* `minSupport`: the minimum support required to be considered a frequent
111+
sequential pattern.
112+
* `maxPatternLength`: the maximum length of a frequent sequential
113+
pattern. Any frequent pattern exceeding this length will not be
114+
included in the results.
115+
* `maxLocalProjDBSize`: the maximum number of items allowed in a
116+
prefix-projected database before local iterative processing of the
117+
projected databse begins. This parameter should be tuned with respect
118+
to the size of your executors.
119+
120+
**Examples**
121+
122+
The following example illustrates PrefixSpan running on the sequences
123+
(using same notation as Pei et al):
124+
125+
~~~
126+
<(12)3>
127+
<1(32)(12)>
128+
<(12)5>
129+
<6>
130+
~~~
131+
132+
<div class="codetabs">
133+
<div data-lang="scala" markdown="1">
134+
135+
[`PrefixSpan`](api/scala/index.html#org.apache.spark.mllib.fpm.PrefixSpan) implements the
136+
PrefixSpan algorithm.
137+
Calling `PrefixSpan.run` returns a
138+
[`PrefixSpanModel`](api/scala/index.html#org.apache.spark.mllib.fpm.PrefixSpanModel)
139+
that stores the frequent sequences with their frequencies.
140+
141+
{% highlight scala %}
142+
import org.apache.spark.mllib.fpm.PrefixSpan
143+
144+
val sequences = sc.parallelize(Seq(
145+
Array(Array(1, 2), Array(3)),
146+
Array(Array(1), Array(3, 2), Array(1, 2)),
147+
Array(Array(1, 2), Array(5)),
148+
Array(Array(6))
149+
), 2).cache()
150+
val prefixSpan = new PrefixSpan()
151+
.setMinSupport(0.5)
152+
.setMaxPatternLength(5)
153+
val model = prefixSpan.run(sequences)
154+
model.freqSequences.collect().foreach { freqSequence =>
155+
println(
156+
freqSequence.sequence.map(_.mkString("[", ", ", "]")).mkString("[", ", ", "]") + ", " + freqSequence.freq)
157+
}
158+
{% endhighlight %}
159+
160+
</div>
161+
162+
<div data-lang="java" markdown="1">
163+
164+
[`PrefixSpan`](api/java/org/apache/spark/mllib/fpm/PrefixSpan.html) implements the
165+
PrefixSpan algorithm.
166+
Calling `PrefixSpan.run` returns a
167+
[`PrefixSpanModel`](api/java/org/apache/spark/mllib/fpm/PrefixSpanModel.html)
168+
that stores the frequent sequences with their frequencies.
169+
170+
{% highlight java %}
171+
import java.util.Arrays;
172+
import java.util.List;
173+
174+
import org.apache.spark.mllib.fpm.PrefixSpan;
175+
import org.apache.spark.mllib.fpm.PrefixSpanModel;
176+
177+
JavaRDD<List<List<Integer>>> sequences = sc.parallelize(Arrays.asList(
178+
Arrays.asList(Arrays.asList(1, 2), Arrays.asList(3)),
179+
Arrays.asList(Arrays.asList(1), Arrays.asList(3, 2), Arrays.asList(1, 2)),
180+
Arrays.asList(Arrays.asList(1, 2), Arrays.asList(5)),
181+
Arrays.asList(Arrays.asList(6))
182+
), 2);
183+
PrefixSpan prefixSpan = new PrefixSpan()
184+
.setMinSupport(0.5)
185+
.setMaxPatternLength(5);
186+
PrefixSpanModel<Integer> model = prefixSpan.run(sequences);
187+
for (PrefixSpan.FreqSequence<Integer> freqSeq: model.freqSequences().toJavaRDD().collect()) {
188+
System.out.println(freqSeq.javaSequence() + ", " + freqSeq.freq());
189+
}
190+
{% endhighlight %}
191+
192+
</div>
193+
</div>
194+

docs/mllib-guide.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,7 @@ This lists functionality included in `spark.mllib`, the main MLlib API.
4848
* [Feature extraction and transformation](mllib-feature-extraction.html)
4949
* [Frequent pattern mining](mllib-frequent-pattern-mining.html)
5050
* [FP-growth](mllib-frequent-pattern-mining.html#fp-growth)
51+
* [PrefixSpan](mllib-frequent-pattern-mining.html#prefix-span)
5152
* [Evaluation Metrics](mllib-evaluation-metrics.html)
5253
* [Optimization (developer)](mllib-optimization.html)
5354
* [stochastic gradient descent](mllib-optimization.html#stochastic-gradient-descent-sgd)

docs/mllib-statistics.md

Lines changed: 124 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -438,22 +438,65 @@ run a 1-sample, 2-sided Kolmogorov-Smirnov test. The following example demonstra
438438
and interpret the hypothesis tests.
439439

440440
{% highlight scala %}
441-
import org.apache.spark.SparkContext
442-
import org.apache.spark.mllib.stat.Statistics._
441+
import org.apache.spark.mllib.stat.Statistics
443442

444443
val data: RDD[Double] = ... // an RDD of sample data
445444

446445
// run a KS test for the sample versus a standard normal distribution
447446
val testResult = Statistics.kolmogorovSmirnovTest(data, "norm", 0, 1)
448447
println(testResult) // summary of the test including the p-value, test statistic,
449-
// and null hypothesis
450-
// if our p-value indicates significance, we can reject the null hypothesis
448+
// and null hypothesis
449+
// if our p-value indicates significance, we can reject the null hypothesis
451450

452451
// perform a KS test using a cumulative distribution function of our making
453452
val myCDF: Double => Double = ...
454453
val testResult2 = Statistics.kolmogorovSmirnovTest(data, myCDF)
455454
{% endhighlight %}
456455
</div>
456+
457+
<div data-lang="java" markdown="1">
458+
[`Statistics`](api/java/org/apache/spark/mllib/stat/Statistics.html) provides methods to
459+
run a 1-sample, 2-sided Kolmogorov-Smirnov test. The following example demonstrates how to run
460+
and interpret the hypothesis tests.
461+
462+
{% highlight java %}
463+
import java.util.Arrays;
464+
465+
import org.apache.spark.api.java.JavaDoubleRDD;
466+
import org.apache.spark.api.java.JavaSparkContext;
467+
468+
import org.apache.spark.mllib.stat.Statistics;
469+
import org.apache.spark.mllib.stat.test.KolmogorovSmirnovTestResult;
470+
471+
JavaSparkContext jsc = ...
472+
JavaDoubleRDD data = jsc.parallelizeDoubles(Arrays.asList(0.2, 1.0, ...));
473+
KolmogorovSmirnovTestResult testResult = Statistics.kolmogorovSmirnovTest(data, "norm", 0.0, 1.0);
474+
// summary of the test including the p-value, test statistic,
475+
// and null hypothesis
476+
// if our p-value indicates significance, we can reject the null hypothesis
477+
System.out.println(testResult);
478+
{% endhighlight %}
479+
</div>
480+
481+
<div data-lang="python" markdown="1">
482+
[`Statistics`](api/python/pyspark.mllib.html#pyspark.mllib.stat.Statistics) provides methods to
483+
run a 1-sample, 2-sided Kolmogorov-Smirnov test. The following example demonstrates how to run
484+
and interpret the hypothesis tests.
485+
486+
{% highlight python %}
487+
from pyspark.mllib.stat import Statistics
488+
489+
parallelData = sc.parallelize([1.0, 2.0, ... ])
490+
491+
# run a KS test for the sample versus a standard normal distribution
492+
testResult = Statistics.kolmogorovSmirnovTest(parallelData, "norm", 0, 1)
493+
print(testResult) # summary of the test including the p-value, test statistic,
494+
# and null hypothesis
495+
# if our p-value indicates significance, we can reject the null hypothesis
496+
# Note that the Scala functionality of calling Statistics.kolmogorovSmirnovTest with
497+
# a lambda to calculate the CDF is not made available in the Python API
498+
{% endhighlight %}
499+
</div>
457500
</div>
458501

459502

@@ -528,5 +571,82 @@ u = RandomRDDs.uniformRDD(sc, 1000000L, 10)
528571
v = u.map(lambda x: 1.0 + 2.0 * x)
529572
{% endhighlight %}
530573
</div>
574+
</div>
575+
576+
## Kernel density estimation
577+
578+
[Kernel density estimation](https://en.wikipedia.org/wiki/Kernel_density_estimation) is a technique
579+
useful for visualizing empirical probability distributions without requiring assumptions about the
580+
particular distribution that the observed samples are drawn from. It computes an estimate of the
581+
probability density function of a random variables, evaluated at a given set of points. It achieves
582+
this estimate by expressing the PDF of the empirical distribution at a particular point as the the
583+
mean of PDFs of normal distributions centered around each of the samples.
584+
585+
<div class="codetabs">
586+
587+
<div data-lang="scala" markdown="1">
588+
[`KernelDensity`](api/scala/index.html#org.apache.spark.mllib.stat.KernelDensity) provides methods
589+
to compute kernel density estimates from an RDD of samples. The following example demonstrates how
590+
to do so.
591+
592+
{% highlight scala %}
593+
import org.apache.spark.mllib.stat.KernelDensity
594+
import org.apache.spark.rdd.RDD
595+
596+
val data: RDD[Double] = ... // an RDD of sample data
597+
598+
// Construct the density estimator with the sample data and a standard deviation for the Gaussian
599+
// kernels
600+
val kd = new KernelDensity()
601+
.setSample(data)
602+
.setBandwidth(3.0)
603+
604+
// Find density estimates for the given values
605+
val densities = kd.estimate(Array(-1.0, 2.0, 5.0))
606+
{% endhighlight %}
607+
</div>
608+
609+
<div data-lang="java" markdown="1">
610+
[`KernelDensity`](api/java/index.html#org.apache.spark.mllib.stat.KernelDensity) provides methods
611+
to compute kernel density estimates from an RDD of samples. The following example demonstrates how
612+
to do so.
613+
614+
{% highlight java %}
615+
import org.apache.spark.mllib.stat.KernelDensity;
616+
import org.apache.spark.rdd.RDD;
617+
618+
RDD<Double> data = ... // an RDD of sample data
619+
620+
// Construct the density estimator with the sample data and a standard deviation for the Gaussian
621+
// kernels
622+
KernelDensity kd = new KernelDensity()
623+
.setSample(data)
624+
.setBandwidth(3.0);
625+
626+
// Find density estimates for the given values
627+
double[] densities = kd.estimate(new double[] {-1.0, 2.0, 5.0});
628+
{% endhighlight %}
629+
</div>
630+
631+
<div data-lang="python" markdown="1">
632+
[`KernelDensity`](api/python/pyspark.mllib.html#pyspark.mllib.stat.KernelDensity) provides methods
633+
to compute kernel density estimates from an RDD of samples. The following example demonstrates how
634+
to do so.
635+
636+
{% highlight python %}
637+
from pyspark.mllib.stat import KernelDensity
638+
639+
data = ... # an RDD of sample data
640+
641+
# Construct the density estimator with the sample data and a standard deviation for the Gaussian
642+
# kernels
643+
kd = KernelDensity()
644+
kd.setSample(data)
645+
kd.setBandwidth(3.0)
646+
647+
# Find density estimates for the given values
648+
densities = kd.estimate([-1.0, 2.0, 5.0])
649+
{% endhighlight %}
650+
</div>
531651

532652
</div>

0 commit comments

Comments
 (0)