@@ -493,5 +493,82 @@ u = RandomRDDs.uniformRDD(sc, 1000000L, 10)
493493v = u.map(lambda x: 1.0 + 2.0 * x)
494494{% endhighlight %}
495495</div >
496+ </div >
497+
498+ ## Kernel density estimation
499+
500+ [ Kernel density estimation] ( https://en.wikipedia.org/wiki/Kernel_density_estimation ) is a technique
501+ useful for visualizing empirical probability distributions without requiring assumptions about the
502+ particular distribution that the observed samples are drawn from. It computes an estimate of the
503+ probability density function of a random variables, evaluated at a given set of points. It achieves
504+ this estimate by expressing the PDF of the empirical distribution at a particular point as the the
505+ mean of PDFs of normal distributions centered around each of the samples.
506+
507+ <div class =" codetabs " >
508+
509+ <div data-lang =" scala " markdown =" 1 " >
510+ [ ` KernelDensity ` ] ( api/scala/index.html#org.apache.spark.mllib.stat.KernelDensity ) provides methods
511+ to compute kernel density estimates from an RDD of samples. The following example demonstrates how
512+ to do so.
513+
514+ {% highlight scala %}
515+ import org.apache.spark.mllib.stat.KernelDensity
516+ import org.apache.spark.rdd.RDD
517+
518+ val data: RDD[ Double] = ... // an RDD of sample data
519+
520+ // Construct the density estimator with the sample data and a standard deviation for the Gaussian
521+ // kernels
522+ val kd = new KernelDensity()
523+ .setSample(data)
524+ .setBandwidth(3.0)
525+
526+ // Find density estimates for the given values
527+ val densities = kd.estimate(Array(-1.0, 2.0, 5.0))
528+ {% endhighlight %}
529+ </div >
530+
531+ <div data-lang =" java " markdown =" 1 " >
532+ [ ` KernelDensity ` ] ( api/java/index.html#org.apache.spark.mllib.stat.KernelDensity ) provides methods
533+ to compute kernel density estimates from an RDD of samples. The following example demonstrates how
534+ to do so.
535+
536+ {% highlight java %}
537+ import org.apache.spark.mllib.stat.KernelDensity;
538+ import org.apache.spark.rdd.RDD;
539+
540+ RDD<Double > data = ... // an RDD of sample data
541+
542+ // Construct the density estimator with the sample data and a standard deviation for the Gaussian
543+ // kernels
544+ KernelDensity kd = new KernelDensity()
545+ .setSample(data)
546+ .setBandwidth(3.0);
547+
548+ // Find density estimates for the given values
549+ double[ ] densities = kd.estimate(new double[ ] {-1.0, 2.0, 5.0});
550+ {% endhighlight %}
551+ </div >
552+
553+ <div data-lang =" python " markdown =" 1 " >
554+ [ ` KernelDensity ` ] ( api/python/pyspark.mllib.html#pyspark.mllib.stat.KernelDensity ) provides methods
555+ to compute kernel density estimates from an RDD of samples. The following example demonstrates how
556+ to do so.
557+
558+ {% highlight python %}
559+ from pyspark.mllib.stat import KernelDensity
560+
561+ data = ... # an RDD of sample data
562+
563+ # Construct the density estimator with the sample data and a standard deviation for the Gaussian
564+ # kernels
565+ kd = KernelDensity()
566+ kd.setSample(data)
567+ kd.setBandwidth(3.0)
568+
569+ # Find density estimates for the given values
570+ densities = kd.estimate([ -1.0, 2.0, 5.0] )
571+ {% endhighlight %}
572+ </div >
496573
497574</div >
0 commit comments