-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-18325][SparkR][ML] SparkR ML wrappers example code and user guide #16148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #69673 has finished for PR 16148 at commit
|
|
|
||
| <div data-lang="r" markdown="1"> | ||
|
|
||
| More details on parameters can be found in the [R API documentation](api/R/spark.logit.html). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change to "Refer to the [R API docs]... for more details"? For consistence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually this is consistent with L59 and L66.
|
|
||
| <div data-lang="r" markdown="1"> | ||
|
|
||
| Refer to the [`IsotonicRegression` R API docs](api/R/spark.isoreg.html) for more details on the API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delete "on the API" for consistence?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is consistent with L834, L840 and L846
docs/sparkr.md
Outdated
| SparkR supports the following machine learning algorithms currently: | ||
|
|
||
| * `spark.glm` or `glm`: `Generalized Linear Model` | ||
| * `spark.survreg`: `Accelerated Failure Time (AFT) Survival Regressio Model` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: Regressio -> Regression
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| # Prediction | ||
| predictions <- predict(model, test) | ||
| showDF(predictions) | ||
| # $example off$ No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
newline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
examples/src/main/r/ml/isoreg.R
Outdated
| # Prediction | ||
| predictions <- predict(model, test) | ||
| showDF(predictions) | ||
| # $example off$ No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
newline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
examples/src/main/r/ml/lda.R
Outdated
| # The log perplexity of the LDA model | ||
| logPerplexity <- spark.perplexity(model, test) | ||
| print(paste0("The upper bound bound on perplexity: ", logPerplexity)) | ||
| # $example off$ No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
newline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
examples/src/main/r/ml/ml.R
Outdated
| # Print the summary of each model | ||
| print(model.summaries) | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extra blank line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
wangmiao1981
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments on format and typo etc.
| ### Generalized Linear Model | ||
|
|
||
| [spark.glm()](api/R/spark.glm.html) or [glm()](api/R/glm.html) fits generalized linear model against a Spark DataFrame. | ||
| Currently "gaussian", "binomial", "poisson" and "gamma" families are supported. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like we would be missing out some R specific things from this delete?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These descriptions can be found in the SparkR API doc. I'm more prefer to link the algorithms listed here to the corresponding R API docs and MLlib user guide sections rather than duplicated adding them here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, generally I'd agree. I think we should have more information on this though since the SparkR API doc is still kind of thin, perhaps this should be part R content for the ML programming guide instead?
| ### Accelerated Failure Time (AFT) Survival Regression Model | ||
|
|
||
| [spark.survreg()](api/R/spark.survreg.html) fits an accelerated failure time (AFT) survival regression model on a SparkDataFrame. | ||
| Note that the formula of [spark.survreg()](api/R/spark.survreg.html) does not support operator '.' currently. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another R specific info that would be deleted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto, can be found at R API doc.
| test <- df | ||
|
|
||
| # Fit a random forest classification model with spark.randomForest | ||
| model <- spark.randomForest(training, label ~ features, "classification", numTrees=10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I would put space around, ie. numTrees = 10 instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
examples/src/main/r/ml/lda.R
Outdated
| test <- df | ||
|
|
||
| # Fit a latent dirichlet allocation model with spark.lda | ||
| model <- spark.lda(training, k=10, maxIter=10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: please put space, ie. k = 10, maxIter = 10
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
this is great, thanks! btw, how are these examples getting run? is there a way to know if the examples are broken because of API changes? |
yanboliang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wangmiao1981 @felixcheung I addressed your comments. Thanks for your reviewing.
|
|
||
| <div data-lang="r" markdown="1"> | ||
|
|
||
| More details on parameters can be found in the [R API documentation](api/R/spark.logit.html). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually this is consistent with L59 and L66.
|
|
||
| <div data-lang="r" markdown="1"> | ||
|
|
||
| Refer to the [`IsotonicRegression` R API docs](api/R/spark.isoreg.html) for more details on the API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is consistent with L834, L840 and L846
| ### Generalized Linear Model | ||
|
|
||
| [spark.glm()](api/R/spark.glm.html) or [glm()](api/R/glm.html) fits generalized linear model against a Spark DataFrame. | ||
| Currently "gaussian", "binomial", "poisson" and "gamma" families are supported. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These descriptions can be found in the SparkR API doc. I'm more prefer to link the algorithms listed here to the corresponding R API docs and MLlib user guide sections rather than duplicated adding them here.
| ### Accelerated Failure Time (AFT) Survival Regression Model | ||
|
|
||
| [spark.survreg()](api/R/spark.survreg.html) fits an accelerated failure time (AFT) survival regression model on a SparkDataFrame. | ||
| Note that the formula of [spark.survreg()](api/R/spark.survreg.html) does not support operator '.' currently. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto, can be found at R API doc.
| # Prediction | ||
| predictions <- predict(model, test) | ||
| showDF(predictions) | ||
| # $example off$ No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
examples/src/main/r/ml/lda.R
Outdated
| test <- df | ||
|
|
||
| # Fit a latent dirichlet allocation model with spark.lda | ||
| model <- spark.lda(training, k=10, maxIter=10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
examples/src/main/r/ml/lda.R
Outdated
| # The log perplexity of the LDA model | ||
| logPerplexity <- spark.perplexity(model, test) | ||
| print(paste0("The upper bound bound on perplexity: ", logPerplexity)) | ||
| # $example off$ No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
examples/src/main/r/ml/ml.R
Outdated
| # Print the summary of each model | ||
| print(model.summaries) | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| test <- df | ||
|
|
||
| # Fit a random forest classification model with spark.randomForest | ||
| model <- spark.randomForest(training, label ~ features, "classification", numTrees=10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
docs/sparkr.md
Outdated
| SparkR supports the following machine learning algorithms currently: | ||
|
|
||
| * `spark.glm` or `glm`: `Generalized Linear Model` | ||
| * `spark.survreg`: `Accelerated Failure Time (AFT) Survival Regressio Model` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
@felixcheung That's a good question. We don't run all MLlib examples for testing until SPARK-12347 was resolved. |
|
hmm, yea, I suspect these example can be easily forgotten while changes are made - they could end up getting out of sync easily |
|
Yeah, I have the same concerns, so we should push forward SPARK-12347 and it should not be very hard to do that. |
|
Test build #69717 has finished for PR 16148 at commit
|
|
Test build #69721 has finished for PR 16148 at commit
|
docs/sparkr.md
Outdated
| * [`spark.glm`](api/R/spark.glm.html) or [`glm`](api/R/glm.html): [`Generalized Linear Model`](ml-classification-regression.html#generalized-linear-regression) | ||
| * [`spark.survreg`](api/R/spark.survreg.html): [`Accelerated Failure Time (AFT) Survival Regression Model`](ml-classification-regression.html#survival-regression) | ||
| * [`spark.naiveBayes`](api/R/spark.naiveBayes.html): [`Naive Bayes Model`](ml-classification-regression.html#naive-bayes) | ||
| * [`spark.kmeans`](api/R/spark.kmeans.html): [`KMeans Model`](ml-clustering.html#k-means) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: k-Means? I think we have the - in general
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| train <- function(family) { | ||
| model <- glm(Sepal.Length ~ Sepal.Width + Species, iris, family = family) | ||
| summary(model) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might want to use this for example instead
https://github.com/apache/spark/blame/master/R/pkg/vignettes/sparkr-vignettes.Rmd#L394
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You means to make them consistent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean running 2 families on glm in parallel is not a very useful or practical example. The link above, running a (potentially longer) list of costs might be a better example to include here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion, I'll update it. Thanks.
| summary(gaussianGLM2) | ||
|
|
||
| # Fit a generalized linear model of family "binomial" with spark.glm | ||
| binomialDF <- filter(irisDF, irisDF$Species != "setosa") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps to add a comment on why we are filtering out setosa?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
A couple more comments, LGTM otherwise. Thanks |
|
Test build #69799 has finished for PR 16148 at commit
|
|
Test build #69842 has finished for PR 16148 at commit
|
|
LGTM. Looks like we are locked down for 2.1. Good to have with all the new examples but seems like a lot of code (example) changes? |
|
Test build #69862 has finished for PR 16148 at commit
|
|
Yeah, 2.1 RC2 has been cut, but I think we still need to merge it into branch-2.1, since it's possible we have new RC. Even if RC2 passed voting, we will have 2.1.1, 2.1.2, ... which need this examples and docs. Merged into master and branch-2.1. Thanks for all reviewing. |
## What changes were proposed in this pull request? * Add all R examples for ML wrappers which were added during 2.1 release cycle. * Split the whole ```ml.R``` example file into individual example for each algorithm, which will be convenient for users to rerun them. * Add corresponding examples to ML user guide. * Update ML section of SparkR user guide. Note: MLlib Scala/Java/Python examples will be consistent, however, SparkR examples may different from them, since R users may use the algorithms in a different way, for example, using R ```formula``` to specify ```featuresCol``` and ```labelCol```. ## How was this patch tested? Run all examples manually. Author: Yanbo Liang <[email protected]> Closes #16148 from yanboliang/spark-18325. (cherry picked from commit 9bf8f3c) Signed-off-by: Yanbo Liang <[email protected]>
|
@felixcheung This is fine to merge since it is for docs/examples only. But in general, we should be better about adding these right away after adding the R APIs, rather than doing them all at the end of the release cycle. If a contributor adds a new API, they should follow up immediately with the docs. |
|
Agreed. We will definitely keep that in mind but there is certainly a lot going in working the API already.
|
## What changes were proposed in this pull request? * Add all R examples for ML wrappers which were added during 2.1 release cycle. * Split the whole ```ml.R``` example file into individual example for each algorithm, which will be convenient for users to rerun them. * Add corresponding examples to ML user guide. * Update ML section of SparkR user guide. Note: MLlib Scala/Java/Python examples will be consistent, however, SparkR examples may different from them, since R users may use the algorithms in a different way, for example, using R ```formula``` to specify ```featuresCol``` and ```labelCol```. ## How was this patch tested? Run all examples manually. Author: Yanbo Liang <[email protected]> Closes apache#16148 from yanboliang/spark-18325.
## What changes were proposed in this pull request? * Add all R examples for ML wrappers which were added during 2.1 release cycle. * Split the whole ```ml.R``` example file into individual example for each algorithm, which will be convenient for users to rerun them. * Add corresponding examples to ML user guide. * Update ML section of SparkR user guide. Note: MLlib Scala/Java/Python examples will be consistent, however, SparkR examples may different from them, since R users may use the algorithms in a different way, for example, using R ```formula``` to specify ```featuresCol``` and ```labelCol```. ## How was this patch tested? Run all examples manually. Author: Yanbo Liang <[email protected]> Closes apache#16148 from yanboliang/spark-18325.
What changes were proposed in this pull request?
ml.Rexample file into individual example for each algorithm, which will be convenient for users to rerun them.Note: MLlib Scala/Java/Python examples will be consistent, however, SparkR examples may different from them, since R users may use the algorithms in a different way, for example, using R
formulato specifyfeaturesColandlabelCol.How was this patch tested?
Run all examples manually.