Skip to content

Commit f7f2ac6

Browse files
sryzamengxr
authored andcommitted
[SPARK-7707] User guide and example code for KernelDensity
Author: Sandy Ryza <[email protected]> Closes #8230 from sryza/sandy-spark-7707. (cherry picked from commit f9d1a92) Signed-off-by: Xiangrui Meng <[email protected]>
1 parent 4fc3b8c commit f7f2ac6

File tree

1 file changed

+77
-0
lines changed

1 file changed

+77
-0
lines changed

docs/mllib-statistics.md

Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -493,5 +493,82 @@ u = RandomRDDs.uniformRDD(sc, 1000000L, 10)
493493
v = u.map(lambda x: 1.0 + 2.0 * x)
494494
{% endhighlight %}
495495
</div>
496+
</div>
497+
498+
## Kernel density estimation
499+
500+
[Kernel density estimation](https://en.wikipedia.org/wiki/Kernel_density_estimation) is a technique
501+
useful for visualizing empirical probability distributions without requiring assumptions about the
502+
particular distribution that the observed samples are drawn from. It computes an estimate of the
503+
probability density function of a random variables, evaluated at a given set of points. It achieves
504+
this estimate by expressing the PDF of the empirical distribution at a particular point as the the
505+
mean of PDFs of normal distributions centered around each of the samples.
506+
507+
<div class="codetabs">
508+
509+
<div data-lang="scala" markdown="1">
510+
[`KernelDensity`](api/scala/index.html#org.apache.spark.mllib.stat.KernelDensity) provides methods
511+
to compute kernel density estimates from an RDD of samples. The following example demonstrates how
512+
to do so.
513+
514+
{% highlight scala %}
515+
import org.apache.spark.mllib.stat.KernelDensity
516+
import org.apache.spark.rdd.RDD
517+
518+
val data: RDD[Double] = ... // an RDD of sample data
519+
520+
// Construct the density estimator with the sample data and a standard deviation for the Gaussian
521+
// kernels
522+
val kd = new KernelDensity()
523+
.setSample(data)
524+
.setBandwidth(3.0)
525+
526+
// Find density estimates for the given values
527+
val densities = kd.estimate(Array(-1.0, 2.0, 5.0))
528+
{% endhighlight %}
529+
</div>
530+
531+
<div data-lang="java" markdown="1">
532+
[`KernelDensity`](api/java/index.html#org.apache.spark.mllib.stat.KernelDensity) provides methods
533+
to compute kernel density estimates from an RDD of samples. The following example demonstrates how
534+
to do so.
535+
536+
{% highlight java %}
537+
import org.apache.spark.mllib.stat.KernelDensity;
538+
import org.apache.spark.rdd.RDD;
539+
540+
RDD<Double> data = ... // an RDD of sample data
541+
542+
// Construct the density estimator with the sample data and a standard deviation for the Gaussian
543+
// kernels
544+
KernelDensity kd = new KernelDensity()
545+
.setSample(data)
546+
.setBandwidth(3.0);
547+
548+
// Find density estimates for the given values
549+
double[] densities = kd.estimate(new double[] {-1.0, 2.0, 5.0});
550+
{% endhighlight %}
551+
</div>
552+
553+
<div data-lang="python" markdown="1">
554+
[`KernelDensity`](api/python/pyspark.mllib.html#pyspark.mllib.stat.KernelDensity) provides methods
555+
to compute kernel density estimates from an RDD of samples. The following example demonstrates how
556+
to do so.
557+
558+
{% highlight python %}
559+
from pyspark.mllib.stat import KernelDensity
560+
561+
data = ... # an RDD of sample data
562+
563+
# Construct the density estimator with the sample data and a standard deviation for the Gaussian
564+
# kernels
565+
kd = KernelDensity()
566+
kd.setSample(data)
567+
kd.setBandwidth(3.0)
568+
569+
# Find density estimates for the given values
570+
densities = kd.estimate([-1.0, 2.0, 5.0])
571+
{% endhighlight %}
572+
</div>
496573

497574
</div>

0 commit comments

Comments
 (0)