@@ -438,22 +438,65 @@ run a 1-sample, 2-sided Kolmogorov-Smirnov test. The following example demonstra
438438and interpret the hypothesis tests.
439439
440440{% highlight scala %}
441- import org.apache.spark.SparkContext
442- import org.apache.spark.mllib.stat.Statistics._
441+ import org.apache.spark.mllib.stat.Statistics
443442
444443val data: RDD[ Double] = ... // an RDD of sample data
445444
446445// run a KS test for the sample versus a standard normal distribution
447446val testResult = Statistics.kolmogorovSmirnovTest(data, "norm", 0, 1)
448447println(testResult) // summary of the test including the p-value, test statistic,
449- // and null hypothesis
450- // if our p-value indicates significance, we can reject the null hypothesis
448+ // and null hypothesis
449+ // if our p-value indicates significance, we can reject the null hypothesis
451450
452451// perform a KS test using a cumulative distribution function of our making
453452val myCDF: Double => Double = ...
454453val testResult2 = Statistics.kolmogorovSmirnovTest(data, myCDF)
455454{% endhighlight %}
456455</div >
456+
457+ <div data-lang =" java " markdown =" 1 " >
458+ [ ` Statistics ` ] ( api/java/org/apache/spark/mllib/stat/Statistics.html ) provides methods to
459+ run a 1-sample, 2-sided Kolmogorov-Smirnov test. The following example demonstrates how to run
460+ and interpret the hypothesis tests.
461+
462+ {% highlight java %}
463+ import java.util.Arrays;
464+
465+ import org.apache.spark.api.java.JavaDoubleRDD;
466+ import org.apache.spark.api.java.JavaSparkContext;
467+
468+ import org.apache.spark.mllib.stat.Statistics;
469+ import org.apache.spark.mllib.stat.test.KolmogorovSmirnovTestResult;
470+
471+ JavaSparkContext jsc = ...
472+ JavaDoubleRDD data = jsc.parallelizeDoubles(Arrays.asList(0.2, 1.0, ...));
473+ KolmogorovSmirnovTestResult testResult = Statistics.kolmogorovSmirnovTest(data, "norm", 0.0, 1.0);
474+ // summary of the test including the p-value, test statistic,
475+ // and null hypothesis
476+ // if our p-value indicates significance, we can reject the null hypothesis
477+ System.out.println(testResult);
478+ {% endhighlight %}
479+ </div >
480+
481+ <div data-lang =" python " markdown =" 1 " >
482+ [ ` Statistics ` ] ( api/python/pyspark.mllib.html#pyspark.mllib.stat.Statistics ) provides methods to
483+ run a 1-sample, 2-sided Kolmogorov-Smirnov test. The following example demonstrates how to run
484+ and interpret the hypothesis tests.
485+
486+ {% highlight python %}
487+ from pyspark.mllib.stat import Statistics
488+
489+ parallelData = sc.parallelize([ 1.0, 2.0, ... ] )
490+
491+ # run a KS test for the sample versus a standard normal distribution
492+ testResult = Statistics.kolmogorovSmirnovTest(parallelData, "norm", 0, 1)
493+ print(testResult) # summary of the test including the p-value, test statistic,
494+ # and null hypothesis
495+ # if our p-value indicates significance, we can reject the null hypothesis
496+ # Note that the Scala functionality of calling Statistics.kolmogorovSmirnovTest with
497+ # a lambda to calculate the CDF is not made available in the Python API
498+ {% endhighlight %}
499+ </div >
457500</div >
458501
459502
@@ -528,5 +571,82 @@ u = RandomRDDs.uniformRDD(sc, 1000000L, 10)
528571v = u.map(lambda x: 1.0 + 2.0 * x)
529572{% endhighlight %}
530573</div >
574+ </div >
575+
576+ ## Kernel density estimation
577+
578+ [ Kernel density estimation] ( https://en.wikipedia.org/wiki/Kernel_density_estimation ) is a technique
579+ useful for visualizing empirical probability distributions without requiring assumptions about the
580+ particular distribution that the observed samples are drawn from. It computes an estimate of the
581+ probability density function of a random variables, evaluated at a given set of points. It achieves
582+ this estimate by expressing the PDF of the empirical distribution at a particular point as the the
583+ mean of PDFs of normal distributions centered around each of the samples.
584+
585+ <div class =" codetabs " >
586+
587+ <div data-lang =" scala " markdown =" 1 " >
588+ [ ` KernelDensity ` ] ( api/scala/index.html#org.apache.spark.mllib.stat.KernelDensity ) provides methods
589+ to compute kernel density estimates from an RDD of samples. The following example demonstrates how
590+ to do so.
591+
592+ {% highlight scala %}
593+ import org.apache.spark.mllib.stat.KernelDensity
594+ import org.apache.spark.rdd.RDD
595+
596+ val data: RDD[ Double] = ... // an RDD of sample data
597+
598+ // Construct the density estimator with the sample data and a standard deviation for the Gaussian
599+ // kernels
600+ val kd = new KernelDensity()
601+ .setSample(data)
602+ .setBandwidth(3.0)
603+
604+ // Find density estimates for the given values
605+ val densities = kd.estimate(Array(-1.0, 2.0, 5.0))
606+ {% endhighlight %}
607+ </div >
608+
609+ <div data-lang =" java " markdown =" 1 " >
610+ [ ` KernelDensity ` ] ( api/java/index.html#org.apache.spark.mllib.stat.KernelDensity ) provides methods
611+ to compute kernel density estimates from an RDD of samples. The following example demonstrates how
612+ to do so.
613+
614+ {% highlight java %}
615+ import org.apache.spark.mllib.stat.KernelDensity;
616+ import org.apache.spark.rdd.RDD;
617+
618+ RDD<Double > data = ... // an RDD of sample data
619+
620+ // Construct the density estimator with the sample data and a standard deviation for the Gaussian
621+ // kernels
622+ KernelDensity kd = new KernelDensity()
623+ .setSample(data)
624+ .setBandwidth(3.0);
625+
626+ // Find density estimates for the given values
627+ double[ ] densities = kd.estimate(new double[ ] {-1.0, 2.0, 5.0});
628+ {% endhighlight %}
629+ </div >
630+
631+ <div data-lang =" python " markdown =" 1 " >
632+ [ ` KernelDensity ` ] ( api/python/pyspark.mllib.html#pyspark.mllib.stat.KernelDensity ) provides methods
633+ to compute kernel density estimates from an RDD of samples. The following example demonstrates how
634+ to do so.
635+
636+ {% highlight python %}
637+ from pyspark.mllib.stat import KernelDensity
638+
639+ data = ... # an RDD of sample data
640+
641+ # Construct the density estimator with the sample data and a standard deviation for the Gaussian
642+ # kernels
643+ kd = KernelDensity()
644+ kd.setSample(data)
645+ kd.setBandwidth(3.0)
646+
647+ # Find density estimates for the given values
648+ densities = kd.estimate([ -1.0, 2.0, 5.0] )
649+ {% endhighlight %}
650+ </div >
531651
532652</div >
0 commit comments