-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-18325][SparkR][ML] SparkR ML wrappers example code and user guide #16148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
c0592b2
11a280e
b8a42f2
747ca68
02d5ac9
ac89b1c
65bd462
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -75,6 +75,13 @@ More details on parameters can be found in the [Python API documentation](api/py | |
| {% include_example python/ml/logistic_regression_with_elastic_net.py %} | ||
| </div> | ||
|
|
||
| <div data-lang="r" markdown="1"> | ||
|
|
||
| More details on parameters can be found in the [R API documentation](api/R/spark.logit.html). | ||
|
|
||
| {% include_example binomial r/ml/logit.R %} | ||
| </div> | ||
|
|
||
| </div> | ||
|
|
||
| The `spark.ml` implementation of logistic regression also supports | ||
|
|
@@ -165,6 +172,13 @@ model with elastic net regularization. | |
| {% include_example python/ml/multiclass_logistic_regression_with_elastic_net.py %} | ||
| </div> | ||
|
|
||
| <div data-lang="r" markdown="1"> | ||
|
|
||
| More details on parameters can be found in the [R API documentation](api/R/spark.logit.html). | ||
|
|
||
| {% include_example multinomial r/ml/logit.R %} | ||
| </div> | ||
|
|
||
| </div> | ||
|
|
||
|
|
||
|
|
@@ -236,6 +250,14 @@ Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.classificat | |
|
|
||
| {% include_example python/ml/random_forest_classifier_example.py %} | ||
| </div> | ||
|
|
||
| <div data-lang="r" markdown="1"> | ||
|
|
||
| Refer to the [R API docs](api/R/spark.randomForest.html) for more details. | ||
|
|
||
| {% include_example classification r/ml/randomForest.R %} | ||
| </div> | ||
|
|
||
| </div> | ||
|
|
||
| ## Gradient-boosted tree classifier | ||
|
|
@@ -269,6 +291,14 @@ Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.classificat | |
|
|
||
| {% include_example python/ml/gradient_boosted_tree_classifier_example.py %} | ||
| </div> | ||
|
|
||
| <div data-lang="r" markdown="1"> | ||
|
|
||
| Refer to the [R API docs](api/R/spark.gbt.html) for more details. | ||
|
|
||
| {% include_example classification r/ml/gbt.R %} | ||
| </div> | ||
|
|
||
| </div> | ||
|
|
||
| ## Multilayer perceptron classifier | ||
|
|
@@ -318,6 +348,13 @@ Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.classificat | |
| {% include_example python/ml/multilayer_perceptron_classification.py %} | ||
| </div> | ||
|
|
||
| <div data-lang="r" markdown="1"> | ||
|
|
||
| Refer to the [R API docs](api/R/spark.mlp.html) for more details. | ||
|
|
||
| {% include_example r/ml/mlp.R %} | ||
| </div> | ||
|
|
||
| </div> | ||
|
|
||
|
|
||
|
|
@@ -394,7 +431,7 @@ Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.classificat | |
|
|
||
| Refer to the [R API docs](api/R/spark.naiveBayes.html) for more details. | ||
|
|
||
| {% include_example naiveBayes r/ml.R %} | ||
| {% include_example r/ml/naiveBayes.R %} | ||
| </div> | ||
|
|
||
| </div> | ||
|
|
@@ -578,7 +615,7 @@ Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.regression. | |
|
|
||
| Refer to the [R API docs](api/R/spark.glm.html) for more details. | ||
|
|
||
| {% include_example glm r/ml.R %} | ||
| {% include_example r/ml/glm.R %} | ||
| </div> | ||
|
|
||
| </div> | ||
|
|
@@ -650,6 +687,14 @@ Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.regression. | |
|
|
||
| {% include_example python/ml/random_forest_regressor_example.py %} | ||
| </div> | ||
|
|
||
| <div data-lang="r" markdown="1"> | ||
|
|
||
| Refer to the [R API docs](api/R/spark.randomForest.html) for more details. | ||
|
|
||
| {% include_example regression r/ml/randomForest.R %} | ||
| </div> | ||
|
|
||
| </div> | ||
|
|
||
| ## Gradient-boosted tree regression | ||
|
|
@@ -683,6 +728,14 @@ Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.regression. | |
|
|
||
| {% include_example python/ml/gradient_boosted_tree_regressor_example.py %} | ||
| </div> | ||
|
|
||
| <div data-lang="r" markdown="1"> | ||
|
|
||
| Refer to the [R API docs](api/R/spark.gbt.html) for more details. | ||
|
|
||
| {% include_example regression r/ml/gbt.R %} | ||
| </div> | ||
|
|
||
| </div> | ||
|
|
||
|
|
||
|
|
@@ -774,7 +827,7 @@ Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.regression. | |
|
|
||
| Refer to the [R API docs](api/R/spark.survreg.html) for more details. | ||
|
|
||
| {% include_example survreg r/ml.R %} | ||
| {% include_example r/ml/survreg.R %} | ||
| </div> | ||
|
|
||
| </div> | ||
|
|
@@ -847,6 +900,14 @@ Refer to the [`IsotonicRegression` Python docs](api/python/pyspark.ml.html#pyspa | |
|
|
||
| {% include_example python/ml/isotonic_regression_example.py %} | ||
| </div> | ||
|
|
||
| <div data-lang="r" markdown="1"> | ||
|
|
||
| Refer to the [`IsotonicRegression` R API docs](api/R/spark.isoreg.html) for more details on the API. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Delete "on the API" for consistence?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is consistent with L834, L840 and L846 |
||
|
|
||
| {% include_example r/ml/isoreg.R %} | ||
| </div> | ||
|
|
||
| </div> | ||
|
|
||
| # Linear methods | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -512,39 +512,33 @@ head(teenagers) | |
|
|
||
| # Machine Learning | ||
|
|
||
| SparkR supports the following machine learning algorithms currently: `Generalized Linear Model`, `Accelerated Failure Time (AFT) Survival Regression Model`, `Naive Bayes Model` and `KMeans Model`. | ||
| Under the hood, SparkR uses MLlib to train the model. | ||
| Users can call `summary` to print a summary of the fitted model, [predict](api/R/predict.html) to make predictions on new data, and [write.ml](api/R/write.ml.html)/[read.ml](api/R/read.ml.html) to save/load fitted models. | ||
| SparkR supports a subset of the available R formula operators for model fitting, including ‘~’, ‘.’, ‘:’, ‘+’, and ‘-‘. | ||
|
|
||
| ## Algorithms | ||
|
|
||
| ### Generalized Linear Model | ||
|
|
||
| [spark.glm()](api/R/spark.glm.html) or [glm()](api/R/glm.html) fits generalized linear model against a Spark DataFrame. | ||
| Currently "gaussian", "binomial", "poisson" and "gamma" families are supported. | ||
| {% include_example glm r/ml.R %} | ||
|
|
||
| ### Accelerated Failure Time (AFT) Survival Regression Model | ||
|
|
||
| [spark.survreg()](api/R/spark.survreg.html) fits an accelerated failure time (AFT) survival regression model on a SparkDataFrame. | ||
| Note that the formula of [spark.survreg()](api/R/spark.survreg.html) does not support operator '.' currently. | ||
| {% include_example survreg r/ml.R %} | ||
|
|
||
| ### Naive Bayes Model | ||
|
|
||
| [spark.naiveBayes()](api/R/spark.naiveBayes.html) fits a Bernoulli naive Bayes model against a SparkDataFrame. Only categorical data is supported. | ||
| {% include_example naiveBayes r/ml.R %} | ||
|
|
||
| ### KMeans Model | ||
| SparkR supports the following machine learning algorithms currently: | ||
|
|
||
| * [`spark.glm`](api/R/spark.glm.html) or [`glm`](api/R/glm.html): [`Generalized Linear Model`](ml-classification-regression.html#generalized-linear-regression) | ||
| * [`spark.survreg`](api/R/spark.survreg.html): [`Accelerated Failure Time (AFT) Survival Regression Model`](ml-classification-regression.html#survival-regression) | ||
| * [`spark.naiveBayes`](api/R/spark.naiveBayes.html): [`Naive Bayes Model`](ml-classification-regression.html#naive-bayes) | ||
| * [`spark.kmeans`](api/R/spark.kmeans.html): [`K-Means Model`](ml-clustering.html#k-means) | ||
| * [`spark.logit`](api/R/spark.logit.html): [`Logistic Regression Model`](ml-classification-regression.html#logistic-regression) | ||
| * [`spark.isoreg`](api/R/spark.isoreg.html): [`Isotonic Regression Model`](ml-classification-regression.html#isotonic-regression) | ||
| * [`spark.gaussianMixture`](api/R/spark.gaussianMixture.html): [`Gaussian Mixture Model`](ml-clustering.html#gaussian-mixture-model-gmm) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. looks like we would be missing out some R specific things from this delete?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These descriptions can be found in the SparkR API doc. I'm more prefer to link the algorithms listed here to the corresponding R API docs and MLlib user guide sections rather than duplicated adding them here.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok, generally I'd agree. I think we should have more information on this though since the SparkR API doc is still kind of thin, perhaps this should be part R content for the ML programming guide instead? |
||
| * [`spark.lda`](api/R/spark.lda.html): [`Latent Dirichlet Allocation (LDA) Model`](ml-clustering.html#latent-dirichlet-allocation-lda) | ||
| * [`spark.mlp`](api/R/spark.mlp.html): [`Multilayer Perceptron Classification Model`](ml-classification-regression.html#multilayer-perceptron-classifier) | ||
| * [`spark.gbt`](api/R/spark.gbt.html): `Gradient Boosted Tree Model for` [`Regression`](ml-classification-regression.html#gradient-boosted-tree-regression) `and` [`Classification`](ml-classification-regression.html#gradient-boosted-tree-classifier) | ||
| * [`spark.randomForest`](api/R/spark.randomForest.html): `Random Forest Model for` [`Regression`](ml-classification-regression.html#random-forest-regression) `and` [`Classification`](ml-classification-regression.html#random-forest-classifier) | ||
| * [`spark.als`](api/R/spark.als.html): [`Alternating Least Squares (ALS) matrix factorization Model`](ml-collaborative-filtering.html#collaborative-filtering) | ||
| * [`spark.kstest`](api/R/spark.kstest.html): `Kolmogorov-Smirnov Test` | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. another R specific info that would be deleted?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ditto, can be found at R API doc. |
||
|
|
||
| Under the hood, SparkR uses MLlib to train the model. Please refer to the corresponding section of MLlib user guide for example code. | ||
| Users can call `summary` to print a summary of the fitted model, [predict](api/R/predict.html) to make predictions on new data, and [write.ml](api/R/write.ml.html)/[read.ml](api/R/read.ml.html) to save/load fitted models. | ||
| SparkR supports a subset of the available R formula operators for model fitting, including ‘~’, ‘.’, ‘:’, ‘+’, and ‘-‘. | ||
|
|
||
| [spark.kmeans()](api/R/spark.kmeans.html) fits a k-means clustering model against a Spark DataFrame, similarly to R's kmeans(). | ||
| {% include_example kmeans r/ml.R %} | ||
|
|
||
| ## Model persistence | ||
|
|
||
| The following example shows how to save/load a MLlib model by SparkR. | ||
| {% include_example read_write r/ml.R %} | ||
| {% include_example read_write r/ml/ml.R %} | ||
|
|
||
| # R Function Name Conflicts | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change to "Refer to the [R API docs]... for more details"? For consistence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually this is consistent with L59 and L66.