Skip to content

Commit f0de45c

Browse files
GayathriMuralimengxr
authored andcommitted
[SPARK-15129][R][DOC] R API changes in ML
## What changes were proposed in this pull request? Make user guide changes to SparkR documentation for all changes that happened in 2.0 to Machine Learning APIs Author: GayathriMurali <[email protected]> Closes #13285 from GayathriMurali/SPARK-15129. (cherry picked from commit af2a4b0) Signed-off-by: Xiangrui Meng <[email protected]>
1 parent 57feaa5 commit f0de45c

File tree

2 files changed

+21
-60
lines changed

2 files changed

+21
-60
lines changed

docs/sparkr.md

Lines changed: 19 additions & 58 deletions
Original file line numberDiff line numberDiff line change
@@ -285,71 +285,32 @@ head(teenagers)
285285

286286
# Machine Learning
287287

288-
SparkR allows the fitting of generalized linear models over DataFrames using the [glm()](api/R/glm.html) function. Under the hood, SparkR uses MLlib to train a model of the specified family. Currently the gaussian and binomial families are supported. We support a subset of the available R formula operators for model fitting, including '~', '.', ':', '+', and '-'.
288+
SparkR supports the following Machine Learning algorithms.
289289

290-
The [summary()](api/R/summary.html) function gives the summary of a model produced by [glm()](api/R/glm.html).
290+
* Generalized Linear Regression Model [spark.glm()](api/R/spark.glm.html)
291+
* Naive Bayes [spark.naiveBayes()](api/R/spark.naiveBayes.html)
292+
* KMeans [spark.kmeans()](api/R/spark.kmeans.html)
293+
* AFT Survival Regression [spark.survreg()](api/R/spark.survreg.html)
291294

292-
* For gaussian GLM model, it returns a list with 'devianceResiduals' and 'coefficients' components. The 'devianceResiduals' gives the min/max deviance residuals of the estimation; the 'coefficients' gives the estimated coefficients and their estimated standard errors, t values and p-values. (It only available when model fitted by normal solver.)
293-
* For binomial GLM model, it returns a list with 'coefficients' component which gives the estimated coefficients.
295+
[Generalized Linear Regression](api/R/spark.glm.html) can be used to train a model from a specified family. Currently the Gaussian, Binomial, Poisson and Gamma families are supported. We support a subset of the available R formula operators for model fitting, including '~', '.', ':', '+', and '-'.
294296

295-
The examples below show the use of building gaussian GLM model and binomial GLM model using SparkR.
297+
The [summary()](api/R/summary.html) function gives the summary of a model produced by different algorithms listed above.
298+
It produces the similar result compared with R summary function.
296299

297-
## Gaussian GLM model
300+
## Model persistence
298301

299-
<div data-lang="r" markdown="1">
300-
{% highlight r %}
301-
# Create the DataFrame
302-
df <- createDataFrame(sqlContext, iris)
303-
304-
# Fit a gaussian GLM model over the dataset.
305-
model <- glm(Sepal_Length ~ Sepal_Width + Species, data = df, family = "gaussian")
306-
307-
# Model summary are returned in a similar format to R's native glm().
308-
summary(model)
309-
##$devianceResiduals
310-
## Min Max
311-
## -1.307112 1.412532
312-
##
313-
##$coefficients
314-
## Estimate Std. Error t value Pr(>|t|)
315-
##(Intercept) 2.251393 0.3697543 6.08889 9.568102e-09
316-
##Sepal_Width 0.8035609 0.106339 7.556598 4.187317e-12
317-
##Species_versicolor 1.458743 0.1121079 13.01195 0
318-
##Species_virginica 1.946817 0.100015 19.46525 0
319-
320-
# Make predictions based on the model.
321-
predictions <- predict(model, newData = df)
322-
head(select(predictions, "Sepal_Length", "prediction"))
323-
## Sepal_Length prediction
324-
##1 5.1 5.063856
325-
##2 4.9 4.662076
326-
##3 4.7 4.822788
327-
##4 4.6 4.742432
328-
##5 5.0 5.144212
329-
##6 5.4 5.385281
330-
{% endhighlight %}
331-
</div>
302+
* [write.ml](api/R/write.ml.html) allows users to save a fitted model in a given input path
303+
* [read.ml](api/R/read.ml.html) allows users to read/load the model which was saved using write.ml in a given path
332304

333-
## Binomial GLM model
305+
Model persistence is supported for all Machine Learning algorithms for all families.
334306

335-
<div data-lang="r" markdown="1">
336-
{% highlight r %}
337-
# Create the DataFrame
338-
df <- createDataFrame(sqlContext, iris)
339-
training <- filter(df, df$Species != "setosa")
340-
341-
# Fit a binomial GLM model over the dataset.
342-
model <- glm(Species ~ Sepal_Length + Sepal_Width, data = training, family = "binomial")
343-
344-
# Model coefficients are returned in a similar format to R's native glm().
345-
summary(model)
346-
##$coefficients
347-
## Estimate
348-
##(Intercept) -13.046005
349-
##Sepal_Length 1.902373
350-
##Sepal_Width 0.404655
351-
{% endhighlight %}
352-
</div>
307+
The examples below show how to build several models:
308+
* GLM using the Gaussian and Binomial model families
309+
* AFT survival regression model
310+
* Naive Bayes model
311+
* K-Means model
312+
313+
{% include_example r/ml.R %}
353314

354315
# R Function Name Conflicts
355316

examples/src/main/r/ml.R

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@ library(SparkR)
2525
sc <- sparkR.init(appName="SparkR-ML-example")
2626
sqlContext <- sparkRSQL.init(sc)
2727

28+
# $example on$
2829
############################ spark.glm and glm ##############################################
2930

3031
irisDF <- suppressWarnings(createDataFrame(sqlContext, iris))
@@ -57,7 +58,6 @@ binomialPredictions <- predict(binomialGLM, binomialTestDF)
5758
showDF(binomialPredictions)
5859

5960
############################ spark.survreg ##############################################
60-
6161
# Use the ovarian dataset available in R survival package
6262
library(survival)
6363

@@ -121,7 +121,7 @@ gaussianGLM <- spark.glm(gaussianDF, Sepal_Length ~ Sepal_Width + Species, famil
121121
modelPath <- tempfile(pattern = "ml", fileext = ".tmp")
122122
write.ml(gaussianGLM, modelPath)
123123
gaussianGLM2 <- read.ml(modelPath)
124-
124+
# $example off$
125125
# Check model summary
126126
summary(gaussianGLM2)
127127

0 commit comments

Comments
 (0)