[SPARK-18325][SparkR][ML] SparkR ML wrappers example code and user guide #16148

yanboliang · 2016-12-05T15:08:52Z

What changes were proposed in this pull request?

Add all R examples for ML wrappers which were added during 2.1 release cycle.
Split the whole ml.R example file into individual example for each algorithm, which will be convenient for users to rerun them.
Add corresponding examples to ML user guide.
Update ML section of SparkR user guide.

Note: MLlib Scala/Java/Python examples will be consistent, however, SparkR examples may different from them, since R users may use the algorithms in a different way, for example, using R formula to specify featuresCol and labelCol.

How was this patch tested?

Run all examples manually.

SparkQA · 2016-12-05T15:36:38Z

Test build #69673 has finished for PR 16148 at commit 11a280e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wangmiao1981 · 2016-12-05T19:05:30Z

docs/ml-classification-regression.md


+<div data-lang="r" markdown="1">
+
+More details on parameters can be found in the [R API documentation](api/R/spark.logit.html).


Change to "Refer to the [R API docs]... for more details"? For consistence.

Actually this is consistent with L59 and L66.

wangmiao1981 · 2016-12-05T19:06:51Z

docs/ml-classification-regression.md

+
+<div data-lang="r" markdown="1">
+
+Refer to the [`IsotonicRegression` R API docs](api/R/spark.isoreg.html) for more details on the API.


Delete "on the API" for consistence?

This is consistent with L834, L840 and L846

wangmiao1981 · 2016-12-05T19:07:59Z

docs/sparkr.md

+SparkR supports the following machine learning algorithms currently:
+
+* `spark.glm` or `glm`: `Generalized Linear Model`
+* `spark.survreg`: `Accelerated Failure Time (AFT) Survival Regressio Model`


Typo: Regressio -> Regression

wangmiao1981 · 2016-12-05T19:11:55Z

examples/src/main/r/ml/gaussianMixture.R

+# Prediction
+predictions <- predict(model, test)
+showDF(predictions)
+# $example off$


wangmiao1981 · 2016-12-05T19:13:22Z

examples/src/main/r/ml/isoreg.R

+# Prediction
+predictions <- predict(model, test)
+showDF(predictions)
+# $example off$


wangmiao1981 · 2016-12-05T19:13:44Z

examples/src/main/r/ml/lda.R

+# The log perplexity of the LDA model
+logPerplexity <- spark.perplexity(model, test)
+print(paste0("The upper bound bound on perplexity: ", logPerplexity))
+# $example off$


wangmiao1981 · 2016-12-05T19:14:29Z

examples/src/main/r/ml/ml.R

+# Print the summary of each model
+print(model.summaries)
+
+


Extra blank line

wangmiao1981

Minor comments on format and typo etc.

felixcheung · 2016-12-06T04:47:47Z

docs/sparkr.md

-### Generalized Linear Model
-
-[spark.glm()](api/R/spark.glm.html) or [glm()](api/R/glm.html) fits generalized linear model against a Spark DataFrame.
-Currently "gaussian", "binomial", "poisson" and "gamma" families are supported.


looks like we would be missing out some R specific things from this delete?

These descriptions can be found in the SparkR API doc. I'm more prefer to link the algorithms listed here to the corresponding R API docs and MLlib user guide sections rather than duplicated adding them here.

Ok, generally I'd agree. I think we should have more information on this though since the SparkR API doc is still kind of thin, perhaps this should be part R content for the ML programming guide instead?

felixcheung · 2016-12-06T04:48:07Z

docs/sparkr.md

-### Accelerated Failure Time (AFT) Survival Regression Model
-
-[spark.survreg()](api/R/spark.survreg.html) fits an accelerated failure time (AFT) survival regression model on a SparkDataFrame.
-Note that the formula of [spark.survreg()](api/R/spark.survreg.html) does not support operator '.' currently.


another R specific info that would be deleted?

Ditto, can be found at R API doc.

felixcheung · 2016-12-06T04:50:50Z

examples/src/main/r/ml/randomForest.R

+test <- df
+
+# Fit a random forest classification model with spark.randomForest
+model <- spark.randomForest(training, label ~ features, "classification", numTrees=10)


nit: I would put space around, ie. numTrees = 10 instead

ditto below

felixcheung · 2016-12-06T04:52:21Z

examples/src/main/r/ml/lda.R

+test <- df
+
+# Fit a latent dirichlet allocation model with spark.lda
+model <- spark.lda(training, k=10, maxIter=10)


nit: please put space, ie. k = 10, maxIter = 10

felixcheung · 2016-12-06T04:53:11Z

this is great, thanks! btw, how are these examples getting run? is there a way to know if the examples are broken because of API changes?

yanboliang

@wangmiao1981 @felixcheung I addressed your comments. Thanks for your reviewing.

yanboliang · 2016-12-06T03:16:20Z

docs/ml-classification-regression.md


+<div data-lang="r" markdown="1">
+
+More details on parameters can be found in the [R API documentation](api/R/spark.logit.html).


Actually this is consistent with L59 and L66.

yanboliang · 2016-12-06T03:21:42Z

docs/ml-classification-regression.md

+
+<div data-lang="r" markdown="1">
+
+Refer to the [`IsotonicRegression` R API docs](api/R/spark.isoreg.html) for more details on the API.


This is consistent with L834, L840 and L846

yanboliang · 2016-12-06T07:07:15Z

docs/sparkr.md

-### Generalized Linear Model
-
-[spark.glm()](api/R/spark.glm.html) or [glm()](api/R/glm.html) fits generalized linear model against a Spark DataFrame.
-Currently "gaussian", "binomial", "poisson" and "gamma" families are supported.


These descriptions can be found in the SparkR API doc. I'm more prefer to link the algorithms listed here to the corresponding R API docs and MLlib user guide sections rather than duplicated adding them here.

yanboliang · 2016-12-06T07:07:37Z

docs/sparkr.md

-### Accelerated Failure Time (AFT) Survival Regression Model
-
-[spark.survreg()](api/R/spark.survreg.html) fits an accelerated failure time (AFT) survival regression model on a SparkDataFrame.
-Note that the formula of [spark.survreg()](api/R/spark.survreg.html) does not support operator '.' currently.


Ditto, can be found at R API doc.

yanboliang · 2016-12-06T07:14:56Z

examples/src/main/r/ml/gaussianMixture.R

+# Prediction
+predictions <- predict(model, test)
+showDF(predictions)
+# $example off$


yanboliang · 2016-12-06T07:22:55Z

examples/src/main/r/ml/lda.R

+test <- df
+
+# Fit a latent dirichlet allocation model with spark.lda
+model <- spark.lda(training, k=10, maxIter=10)


yanboliang · 2016-12-06T07:23:22Z

examples/src/main/r/ml/lda.R

+# The log perplexity of the LDA model
+logPerplexity <- spark.perplexity(model, test)
+print(paste0("The upper bound bound on perplexity: ", logPerplexity))
+# $example off$


yanboliang · 2016-12-06T07:31:14Z

examples/src/main/r/ml/ml.R

+# Print the summary of each model
+print(model.summaries)
+
+


yanboliang · 2016-12-06T07:31:53Z

examples/src/main/r/ml/randomForest.R

+test <- df
+
+# Fit a random forest classification model with spark.randomForest
+model <- spark.randomForest(training, label ~ features, "classification", numTrees=10)


yanboliang · 2016-12-06T07:33:06Z

docs/sparkr.md

+SparkR supports the following machine learning algorithms currently:
+
+* `spark.glm` or `glm`: `Generalized Linear Model`
+* `spark.survreg`: `Accelerated Failure Time (AFT) Survival Regressio Model`


yanboliang · 2016-12-06T08:15:27Z

@felixcheung That's a good question. We don't run all MLlib examples for testing until SPARK-12347 was resolved.

felixcheung · 2016-12-06T08:23:51Z

hmm, yea, I suspect these example can be easily forgotten while changes are made - they could end up getting out of sync easily

yanboliang · 2016-12-06T08:30:08Z

Yeah, I have the same concerns, so we should push forward SPARK-12347 and it should not be very hard to do that.

SparkQA · 2016-12-06T08:30:38Z

Test build #69717 has finished for PR 16148 at commit b8a42f2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-12-06T09:50:56Z

Test build #69721 has finished for PR 16148 at commit 747ca68.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2016-12-07T00:10:20Z

docs/sparkr.md

+* [`spark.glm`](api/R/spark.glm.html) or [`glm`](api/R/glm.html): [`Generalized Linear Model`](ml-classification-regression.html#generalized-linear-regression)
+* [`spark.survreg`](api/R/spark.survreg.html): [`Accelerated Failure Time (AFT) Survival Regression Model`](ml-classification-regression.html#survival-regression)
+* [`spark.naiveBayes`](api/R/spark.naiveBayes.html): [`Naive Bayes Model`](ml-classification-regression.html#naive-bayes)
+* [`spark.kmeans`](api/R/spark.kmeans.html): [`KMeans Model`](ml-clustering.html#k-means)


nit: k-Means? I think we have the - in general

felixcheung · 2016-12-07T00:12:59Z

examples/src/main/r/ml/ml.R

+train <- function(family) {
+  model <- glm(Sepal.Length ~ Sepal.Width + Species, iris, family = family)
+  summary(model)
+}


might want to use this for example instead
https://github.com/apache/spark/blame/master/R/pkg/vignettes/sparkr-vignettes.Rmd#L394

You means to make them consistent?

I mean running 2 families on glm in parallel is not a very useful or practical example. The link above, running a (potentially longer) list of costs might be a better example to include here.

Good suggestion, I'll update it. Thanks.

felixcheung · 2016-12-07T00:33:52Z

examples/src/main/r/ml/glm.R

+summary(gaussianGLM2)
+
+# Fit a generalized linear model of family "binomial" with spark.glm
+binomialDF <- filter(irisDF, irisDF$Species != "setosa")


perhaps to add a comment on why we are filtering out setosa?

felixcheung · 2016-12-07T00:34:25Z

A couple more comments, LGTM otherwise. Thanks

SparkQA · 2016-12-07T15:47:12Z

Test build #69799 has finished for PR 16148 at commit 02d5ac9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-12-08T04:40:36Z

Test build #69842 has finished for PR 16148 at commit ac89b1c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2016-12-08T07:47:46Z

LGTM. Looks like we are locked down for 2.1. Good to have with all the new examples but seems like a lot of code (example) changes?

SparkQA · 2016-12-08T13:49:55Z

Test build #69862 has finished for PR 16148 at commit 65bd462.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang · 2016-12-08T14:18:20Z

Yeah, 2.1 RC2 has been cut, but I think we still need to merge it into branch-2.1, since it's possible we have new RC. Even if RC2 passed voting, we will have 2.1.1, 2.1.2, ... which need this examples and docs. Merged into master and branch-2.1. Thanks for all reviewing.

## What changes were proposed in this pull request? * Add all R examples for ML wrappers which were added during 2.1 release cycle. * Split the whole ```ml.R``` example file into individual example for each algorithm, which will be convenient for users to rerun them. * Add corresponding examples to ML user guide. * Update ML section of SparkR user guide. Note: MLlib Scala/Java/Python examples will be consistent, however, SparkR examples may different from them, since R users may use the algorithms in a different way, for example, using R ```formula``` to specify ```featuresCol``` and ```labelCol```. ## How was this patch tested? Run all examples manually. Author: Yanbo Liang <[email protected]> Closes #16148 from yanboliang/spark-18325. (cherry picked from commit 9bf8f3c) Signed-off-by: Yanbo Liang <[email protected]>

jkbradley · 2016-12-08T19:51:49Z

@felixcheung This is fine to merge since it is for docs/examples only. But in general, we should be better about adding these right away after adding the R APIs, rather than doing them all at the end of the release cycle. If a contributor adds a new API, they should follow up immediately with the docs.

felixcheung · 2016-12-08T20:46:30Z

Agreed. We will definitely keep that in mind but there is certainly a lot going in working the API already.

## What changes were proposed in this pull request? * Add all R examples for ML wrappers which were added during 2.1 release cycle. * Split the whole ```ml.R``` example file into individual example for each algorithm, which will be convenient for users to rerun them. * Add corresponding examples to ML user guide. * Update ML section of SparkR user guide. Note: MLlib Scala/Java/Python examples will be consistent, however, SparkR examples may different from them, since R users may use the algorithms in a different way, for example, using R ```formula``` to specify ```featuresCol``` and ```labelCol```. ## How was this patch tested? Run all examples manually. Author: Yanbo Liang <[email protected]> Closes apache#16148 from yanboliang/spark-18325.

yanboliang added 2 commits December 5, 2016 06:18

Add SparkR ML wrappers example code

c0592b2

Update SparkR ML wrapper user guide

11a280e

wangmiao1981 reviewed Dec 5, 2016

View reviewed changes

felixcheung reviewed Dec 6, 2016

View reviewed changes

Address comments.

b8a42f2

yanboliang commented Dec 6, 2016

View reviewed changes

Update spark.gaussianMixture example.

747ca68

felixcheung reviewed Dec 7, 2016

View reviewed changes

Update docs.

02d5ac9

Update R ML example to make it more practical

ac89b1c

spark.als reg -> regParam.

65bd462

asfgit closed this in 9bf8f3c Dec 8, 2016

yanboliang deleted the spark-18325 branch December 8, 2016 14:23


		<div data-lang="r" markdown="1">

		More details on parameters can be found in the [R API documentation](api/R/spark.logit.html).


		<div data-lang="r" markdown="1">

		Refer to the [`IsotonicRegression` R API docs](api/R/spark.isoreg.html) for more details on the API.

[SPARK-18325][SparkR][ML] SparkR ML wrappers example code and user guide #16148

[SPARK-18325][SparkR][ML] SparkR ML wrappers example code and user guide #16148

Uh oh!

Conversation

yanboliang commented Dec 5, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Dec 5, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wangmiao1981 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

felixcheung Dec 7, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

felixcheung commented Dec 6, 2016

Uh oh!

yanboliang left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

yanboliang commented Dec 5, 2016 •

edited

Loading

felixcheung Dec 7, 2016 •

edited

Loading