Skip to content

Conversation

@yanboliang
Copy link
Contributor

@yanboliang yanboliang commented Dec 5, 2016

What changes were proposed in this pull request?

  • Add all R examples for ML wrappers which were added during 2.1 release cycle.
  • Split the whole ml.R example file into individual example for each algorithm, which will be convenient for users to rerun them.
  • Add corresponding examples to ML user guide.
  • Update ML section of SparkR user guide.

Note: MLlib Scala/Java/Python examples will be consistent, however, SparkR examples may different from them, since R users may use the algorithms in a different way, for example, using R formula to specify featuresCol and labelCol.

How was this patch tested?

Run all examples manually.

@SparkQA
Copy link

SparkQA commented Dec 5, 2016

Test build #69673 has finished for PR 16148 at commit 11a280e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


<div data-lang="r" markdown="1">

More details on parameters can be found in the [R API documentation](api/R/spark.logit.html).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to "Refer to the [R API docs]... for more details"? For consistence.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this is consistent with L59 and L66.


<div data-lang="r" markdown="1">

Refer to the [`IsotonicRegression` R API docs](api/R/spark.isoreg.html) for more details on the API.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete "on the API" for consistence?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is consistent with L834, L840 and L846

docs/sparkr.md Outdated
SparkR supports the following machine learning algorithms currently:

* `spark.glm` or `glm`: `Generalized Linear Model`
* `spark.survreg`: `Accelerated Failure Time (AFT) Survival Regressio Model`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: Regressio -> Regression

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

# Prediction
predictions <- predict(model, test)
showDF(predictions)
# $example off$ No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

newline

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

# Prediction
predictions <- predict(model, test)
showDF(predictions)
# $example off$ No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

newline

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

# The log perplexity of the LDA model
logPerplexity <- spark.perplexity(model, test)
print(paste0("The upper bound bound on perplexity: ", logPerplexity))
# $example off$ No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

newline

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

# Print the summary of each model
print(model.summaries)


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra blank line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

@wangmiao1981 wangmiao1981 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments on format and typo etc.

### Generalized Linear Model

[spark.glm()](api/R/spark.glm.html) or [glm()](api/R/glm.html) fits generalized linear model against a Spark DataFrame.
Currently "gaussian", "binomial", "poisson" and "gamma" families are supported.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like we would be missing out some R specific things from this delete?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These descriptions can be found in the SparkR API doc. I'm more prefer to link the algorithms listed here to the corresponding R API docs and MLlib user guide sections rather than duplicated adding them here.

Copy link
Member

@felixcheung felixcheung Dec 7, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, generally I'd agree. I think we should have more information on this though since the SparkR API doc is still kind of thin, perhaps this should be part R content for the ML programming guide instead?

### Accelerated Failure Time (AFT) Survival Regression Model

[spark.survreg()](api/R/spark.survreg.html) fits an accelerated failure time (AFT) survival regression model on a SparkDataFrame.
Note that the formula of [spark.survreg()](api/R/spark.survreg.html) does not support operator '.' currently.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another R specific info that would be deleted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto, can be found at R API doc.

test <- df

# Fit a random forest classification model with spark.randomForest
model <- spark.randomForest(training, label ~ features, "classification", numTrees=10)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I would put space around, ie. numTrees = 10 instead

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

test <- df

# Fit a latent dirichlet allocation model with spark.lda
model <- spark.lda(training, k=10, maxIter=10)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: please put space, ie. k = 10, maxIter = 10

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@felixcheung
Copy link
Member

this is great, thanks! btw, how are these examples getting run? is there a way to know if the examples are broken because of API changes?

Copy link
Contributor Author

@yanboliang yanboliang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wangmiao1981 @felixcheung I addressed your comments. Thanks for your reviewing.


<div data-lang="r" markdown="1">

More details on parameters can be found in the [R API documentation](api/R/spark.logit.html).
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this is consistent with L59 and L66.


<div data-lang="r" markdown="1">

Refer to the [`IsotonicRegression` R API docs](api/R/spark.isoreg.html) for more details on the API.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is consistent with L834, L840 and L846

### Generalized Linear Model

[spark.glm()](api/R/spark.glm.html) or [glm()](api/R/glm.html) fits generalized linear model against a Spark DataFrame.
Currently "gaussian", "binomial", "poisson" and "gamma" families are supported.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These descriptions can be found in the SparkR API doc. I'm more prefer to link the algorithms listed here to the corresponding R API docs and MLlib user guide sections rather than duplicated adding them here.

### Accelerated Failure Time (AFT) Survival Regression Model

[spark.survreg()](api/R/spark.survreg.html) fits an accelerated failure time (AFT) survival regression model on a SparkDataFrame.
Note that the formula of [spark.survreg()](api/R/spark.survreg.html) does not support operator '.' currently.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto, can be found at R API doc.

# Prediction
predictions <- predict(model, test)
showDF(predictions)
# $example off$ No newline at end of file
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

test <- df

# Fit a latent dirichlet allocation model with spark.lda
model <- spark.lda(training, k=10, maxIter=10)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

# The log perplexity of the LDA model
logPerplexity <- spark.perplexity(model, test)
print(paste0("The upper bound bound on perplexity: ", logPerplexity))
# $example off$ No newline at end of file
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

# Print the summary of each model
print(model.summaries)


Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

test <- df

# Fit a random forest classification model with spark.randomForest
model <- spark.randomForest(training, label ~ features, "classification", numTrees=10)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

docs/sparkr.md Outdated
SparkR supports the following machine learning algorithms currently:

* `spark.glm` or `glm`: `Generalized Linear Model`
* `spark.survreg`: `Accelerated Failure Time (AFT) Survival Regressio Model`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@yanboliang
Copy link
Contributor Author

@felixcheung That's a good question. We don't run all MLlib examples for testing until SPARK-12347 was resolved.

@felixcheung
Copy link
Member

hmm, yea, I suspect these example can be easily forgotten while changes are made - they could end up getting out of sync easily

@yanboliang
Copy link
Contributor Author

Yeah, I have the same concerns, so we should push forward SPARK-12347 and it should not be very hard to do that.

@SparkQA
Copy link

SparkQA commented Dec 6, 2016

Test build #69717 has finished for PR 16148 at commit b8a42f2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 6, 2016

Test build #69721 has finished for PR 16148 at commit 747ca68.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

docs/sparkr.md Outdated
* [`spark.glm`](api/R/spark.glm.html) or [`glm`](api/R/glm.html): [`Generalized Linear Model`](ml-classification-regression.html#generalized-linear-regression)
* [`spark.survreg`](api/R/spark.survreg.html): [`Accelerated Failure Time (AFT) Survival Regression Model`](ml-classification-regression.html#survival-regression)
* [`spark.naiveBayes`](api/R/spark.naiveBayes.html): [`Naive Bayes Model`](ml-classification-regression.html#naive-bayes)
* [`spark.kmeans`](api/R/spark.kmeans.html): [`KMeans Model`](ml-clustering.html#k-means)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: k-Means? I think we have the - in general

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

train <- function(family) {
model <- glm(Sepal.Length ~ Sepal.Width + Species, iris, family = family)
summary(model)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You means to make them consistent?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean running 2 families on glm in parallel is not a very useful or practical example. The link above, running a (potentially longer) list of costs might be a better example to include here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion, I'll update it. Thanks.

summary(gaussianGLM2)

# Fit a generalized linear model of family "binomial" with spark.glm
binomialDF <- filter(irisDF, irisDF$Species != "setosa")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps to add a comment on why we are filtering out setosa?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@felixcheung
Copy link
Member

A couple more comments, LGTM otherwise. Thanks

@SparkQA
Copy link

SparkQA commented Dec 7, 2016

Test build #69799 has finished for PR 16148 at commit 02d5ac9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 8, 2016

Test build #69842 has finished for PR 16148 at commit ac89b1c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

LGTM. Looks like we are locked down for 2.1. Good to have with all the new examples but seems like a lot of code (example) changes?

@SparkQA
Copy link

SparkQA commented Dec 8, 2016

Test build #69862 has finished for PR 16148 at commit 65bd462.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yanboliang
Copy link
Contributor Author

Yeah, 2.1 RC2 has been cut, but I think we still need to merge it into branch-2.1, since it's possible we have new RC. Even if RC2 passed voting, we will have 2.1.1, 2.1.2, ... which need this examples and docs. Merged into master and branch-2.1. Thanks for all reviewing.

asfgit pushed a commit that referenced this pull request Dec 8, 2016
## What changes were proposed in this pull request?
* Add all R examples for ML wrappers which were added during 2.1 release cycle.
* Split the whole ```ml.R``` example file into individual example for each algorithm, which will be convenient for users to rerun them.
* Add corresponding examples to ML user guide.
* Update ML section of SparkR user guide.

Note: MLlib Scala/Java/Python examples will be consistent, however, SparkR examples may different from them, since R users may use the algorithms in a different way, for example, using R ```formula``` to specify ```featuresCol``` and ```labelCol```.

## How was this patch tested?
Run all examples manually.

Author: Yanbo Liang <[email protected]>

Closes #16148 from yanboliang/spark-18325.

(cherry picked from commit 9bf8f3c)
Signed-off-by: Yanbo Liang <[email protected]>
@asfgit asfgit closed this in 9bf8f3c Dec 8, 2016
@yanboliang yanboliang deleted the spark-18325 branch December 8, 2016 14:23
@jkbradley
Copy link
Member

@felixcheung This is fine to merge since it is for docs/examples only. But in general, we should be better about adding these right away after adding the R APIs, rather than doing them all at the end of the release cycle. If a contributor adds a new API, they should follow up immediately with the docs.

@felixcheung
Copy link
Member

felixcheung commented Dec 8, 2016 via email

robert3005 pushed a commit to palantir/spark that referenced this pull request Dec 15, 2016
## What changes were proposed in this pull request?
* Add all R examples for ML wrappers which were added during 2.1 release cycle.
* Split the whole ```ml.R``` example file into individual example for each algorithm, which will be convenient for users to rerun them.
* Add corresponding examples to ML user guide.
* Update ML section of SparkR user guide.

Note: MLlib Scala/Java/Python examples will be consistent, however, SparkR examples may different from them, since R users may use the algorithms in a different way, for example, using R ```formula``` to specify ```featuresCol``` and ```labelCol```.

## How was this patch tested?
Run all examples manually.

Author: Yanbo Liang <[email protected]>

Closes apache#16148 from yanboliang/spark-18325.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
## What changes were proposed in this pull request?
* Add all R examples for ML wrappers which were added during 2.1 release cycle.
* Split the whole ```ml.R``` example file into individual example for each algorithm, which will be convenient for users to rerun them.
* Add corresponding examples to ML user guide.
* Update ML section of SparkR user guide.

Note: MLlib Scala/Java/Python examples will be consistent, however, SparkR examples may different from them, since R users may use the algorithms in a different way, for example, using R ```formula``` to specify ```featuresCol``` and ```labelCol```.

## How was this patch tested?
Run all examples manually.

Author: Yanbo Liang <[email protected]>

Closes apache#16148 from yanboliang/spark-18325.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants