Skip to content

Commit c90c605

Browse files
jose.cambroneromengxr
authored andcommitted
[SPARK-9902] [MLLIB] Add Java and Python examples to user guide for 1-sample KS test
added doc examples for python. Author: jose.cambronero <[email protected]> Closes #8154 from josepablocam/spark_9902.
1 parent f9d1a92 commit c90c605

File tree

1 file changed

+47
-4
lines changed

1 file changed

+47
-4
lines changed

docs/mllib-statistics.md

Lines changed: 47 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -438,22 +438,65 @@ run a 1-sample, 2-sided Kolmogorov-Smirnov test. The following example demonstra
438438
and interpret the hypothesis tests.
439439

440440
{% highlight scala %}
441-
import org.apache.spark.SparkContext
442-
import org.apache.spark.mllib.stat.Statistics._
441+
import org.apache.spark.mllib.stat.Statistics
443442

444443
val data: RDD[Double] = ... // an RDD of sample data
445444

446445
// run a KS test for the sample versus a standard normal distribution
447446
val testResult = Statistics.kolmogorovSmirnovTest(data, "norm", 0, 1)
448447
println(testResult) // summary of the test including the p-value, test statistic,
449-
// and null hypothesis
450-
// if our p-value indicates significance, we can reject the null hypothesis
448+
// and null hypothesis
449+
// if our p-value indicates significance, we can reject the null hypothesis
451450

452451
// perform a KS test using a cumulative distribution function of our making
453452
val myCDF: Double => Double = ...
454453
val testResult2 = Statistics.kolmogorovSmirnovTest(data, myCDF)
455454
{% endhighlight %}
456455
</div>
456+
457+
<div data-lang="java" markdown="1">
458+
[`Statistics`](api/java/org/apache/spark/mllib/stat/Statistics.html) provides methods to
459+
run a 1-sample, 2-sided Kolmogorov-Smirnov test. The following example demonstrates how to run
460+
and interpret the hypothesis tests.
461+
462+
{% highlight java %}
463+
import java.util.Arrays;
464+
465+
import org.apache.spark.api.java.JavaDoubleRDD;
466+
import org.apache.spark.api.java.JavaSparkContext;
467+
468+
import org.apache.spark.mllib.stat.Statistics;
469+
import org.apache.spark.mllib.stat.test.KolmogorovSmirnovTestResult;
470+
471+
JavaSparkContext jsc = ...
472+
JavaDoubleRDD data = jsc.parallelizeDoubles(Arrays.asList(0.2, 1.0, ...));
473+
KolmogorovSmirnovTestResult testResult = Statistics.kolmogorovSmirnovTest(data, "norm", 0.0, 1.0);
474+
// summary of the test including the p-value, test statistic,
475+
// and null hypothesis
476+
// if our p-value indicates significance, we can reject the null hypothesis
477+
System.out.println(testResult);
478+
{% endhighlight %}
479+
</div>
480+
481+
<div data-lang="python" markdown="1">
482+
[`Statistics`](api/python/pyspark.mllib.html#pyspark.mllib.stat.Statistics) provides methods to
483+
run a 1-sample, 2-sided Kolmogorov-Smirnov test. The following example demonstrates how to run
484+
and interpret the hypothesis tests.
485+
486+
{% highlight python %}
487+
from pyspark.mllib.stat import Statistics
488+
489+
parallelData = sc.parallelize([1.0, 2.0, ... ])
490+
491+
# run a KS test for the sample versus a standard normal distribution
492+
testResult = Statistics.kolmogorovSmirnovTest(parallelData, "norm", 0, 1)
493+
print(testResult) # summary of the test including the p-value, test statistic,
494+
# and null hypothesis
495+
# if our p-value indicates significance, we can reject the null hypothesis
496+
# Note that the Scala functionality of calling Statistics.kolmogorovSmirnovTest with
497+
# a lambda to calculate the CDF is not made available in the Python API
498+
{% endhighlight %}
499+
</div>
457500
</div>
458501

459502

0 commit comments

Comments
 (0)