[SPARK-15177] [SparkR] [ML] SparkR 2.0 QA: New R APIs and API docs for mllib.R #13023

yanboliang · 2016-05-10T09:08:59Z

What changes were proposed in this pull request?

SparkR 2.0 QA: New R APIs and API docs for mllib.R. Main changes including:

Make spark.glm and spark.naiveBayes API more consistent with Spark naming convention. Because most Spark MLlib algorithms do not override the base R functions, we can make the argument names consistent with Spark MLlib rather than base R. Meanwhile, make the default value to be consistent with MLlib.

From:

setMethod("spark.glm", signature(data = "SparkDataFrame", formula = "formula"),
          function(data, formula, family = gaussian, epsilon = 1e-06, maxit = 25) {})
setMethod("spark.naiveBayes", signature(data = "SparkDataFrame", formula = "formula"),
          function(data, formula, laplace = 0, ...) {})

To:

setMethod("spark.glm", signature(data = "SparkDataFrame", formula = "formula"),
          function(data, formula, family = gaussian, tol = 1e-06, maxIter = 25) {})
setMethod("spark.naiveBayes", signature(data = "SparkDataFrame", formula = "formula"),
          function(data, formula, smoothing = 1.0, ...) {})

Refactor the structure of mllib.R, and it will make developers and contributors understand code more clearly.

spark.glm, glm, summary, predict
spark.survreg, summary, predict
spark.naiveBayes, summary, predict
spark.kmeans, summary, fitted, predit
write.ml, read.ml

Remove duplicated test code in test_mllib.R. Because glm internally will call spark.glm to train model, we should not duplicate all spark.glm tests twice and just run the basic test case to check the API working well.
Fix and update docs.
Formatting code.

How was this patch tested?

Existing unit tests.

SparkQA · 2016-05-10T09:51:43Z

Test build #58225 has finished for PR 13023 at commit 1a7700f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang · 2016-05-10T10:01:14Z

R/pkg/R/mllib.R

 setMethod("glm", signature(formula = "formula", family = "ANY", data = "SparkDataFrame"),
          function(formula, family = gaussian, data, epsilon = 1e-06, maxit = 25) {
            spark.glm(data, formula, family, epsilon, maxit)
          })


Because glm is R-compliant function, so I left the argument names consistent with native R.

yanboliang · 2016-05-10T10:16:28Z

cc @mengxr

SparkQA · 2016-05-12T13:47:11Z

Test build #58484 has finished for PR 13023 at commit 9065535.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2016-05-12T23:21:40Z

R/pkg/R/mllib.R

 #' Fit a k-means model
 #'
-#' Fit a k-means model, similarly to R's kmeans().
+#' Fits a k-means model, similarly to R's kmeans().


seems to be changing a few times here - should this be Fits a ... or Fit a ...?

Fits. See https://docs.oracle.com/javase/8/docs/api/java/lang/Object.html for examples.

I'm referring to lines

#' Fit a k-means model #' #' Fit a k-means model, similarly to R's kmeans().

ie. the first line and the third line.

For example, it shows up for glm like this http://spark.apache.org/docs/latest/api/R/glm.html

Which I'd think would be rather odd if they are not consistent.

vectorijk · 2016-05-31T11:33:38Z

Suggested by this comment, I was wondering if we also need to update the docs for k-means and naive bayes in http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/sparkr.html. Maybe we can include that change in this PR.

yanboliang · 2016-05-31T14:43:12Z

@vectorijk There is a separate PR focus on updating machine learning section of SparkR users guide. FYI #13285. Thanks.

shivaram · 2016-06-20T18:59:48Z

@mengxr @yanboliang Is this PR still active ? Just checking if this is something we should track for the 2.0 release

mengxr · 2016-06-20T23:32:24Z

It would be nice to get this in. @yanboliang is traveling. I can help send a PR based on this one.

…istent with MLlib ## What changes were proposed in this pull request? This PR is a subset of #13023 by yanboliang to make SparkR model param names and default values consistent with MLlib. I tried to avoid other changes from #13023 to keep this PR minimal. I will send a follow-up PR to improve the documentation. Main changes: * `spark.glm`: epsilon -> tol, maxit -> maxIter * `spark.kmeans`: default k -> 2, default maxIter -> 20, default initMode -> "k-means||" * `spark.naiveBayes`: laplace -> smoothing, default 1.0 ## How was this patch tested? Existing unit tests. Author: Xiangrui Meng <[email protected]> Closes #13801 from mengxr/SPARK-15177.1.

…istent with MLlib ## What changes were proposed in this pull request? This PR is a subset of #13023 by yanboliang to make SparkR model param names and default values consistent with MLlib. I tried to avoid other changes from #13023 to keep this PR minimal. I will send a follow-up PR to improve the documentation. Main changes: * `spark.glm`: epsilon -> tol, maxit -> maxIter * `spark.kmeans`: default k -> 2, default maxIter -> 20, default initMode -> "k-means||" * `spark.naiveBayes`: laplace -> smoothing, default 1.0 ## How was this patch tested? Existing unit tests. Author: Xiangrui Meng <[email protected]> Closes #13801 from mengxr/SPARK-15177.1. (cherry picked from commit 4f83ca1) Signed-off-by: Xiangrui Meng <[email protected]>

mengxr · 2016-06-21T18:34:05Z

@yanboliang We are going to split the work into multiple PRs (SPARK-16090). Do you mind closing this PR for now? Thanks!

SparkR 2.0 QA: New R APIs and API docs for mllib.R

1a7700f

yanboliang reviewed May 10, 2016
View reviewed changes

Remove 's' at the end of the title for a method

9065535

felixcheung reviewed May 12, 2016
View reviewed changes

mengxr mentioned this pull request Jun 21, 2016

[SPARK-15177.1] [R] make SparkR model params and default values consistent with MLlib #13801

Closed

yanboliang closed this Jun 25, 2016

yanboliang deleted the spark-15177 branch June 25, 2016 02:42

[SPARK-15177] [SparkR] [ML] SparkR 2.0 QA: New R APIs and API docs for mllib.R #13023

[SPARK-15177] [SparkR] [ML] SparkR 2.0 QA: New R APIs and API docs for mllib.R #13023

Uh oh!

Conversation

yanboliang commented May 10, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented May 10, 2016

Uh oh!

yanboliang May 10, 2016

Choose a reason for hiding this comment

Uh oh!

yanboliang commented May 10, 2016

Uh oh!

SparkQA commented May 12, 2016

Uh oh!

felixcheung May 12, 2016

Choose a reason for hiding this comment

Uh oh!

mengxr May 20, 2016

Choose a reason for hiding this comment

Uh oh!

felixcheung May 20, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vectorijk commented May 31, 2016

Uh oh!

yanboliang commented May 31, 2016

Uh oh!

shivaram commented Jun 20, 2016

Uh oh!

mengxr commented Jun 20, 2016

Uh oh!

mengxr commented Jun 21, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

yanboliang commented May 10, 2016 •

edited

Loading

felixcheung May 20, 2016 •

edited

Loading