Skip to content

Conversation

@yanboliang
Copy link
Contributor

@yanboliang yanboliang commented May 10, 2016

What changes were proposed in this pull request?

SparkR 2.0 QA: New R APIs and API docs for mllib.R. Main changes including:

  • Make spark.glm and spark.naiveBayes API more consistent with Spark naming convention. Because most Spark MLlib algorithms do not override the base R functions, we can make the argument names consistent with Spark MLlib rather than base R. Meanwhile, make the default value to be consistent with MLlib.

From:

setMethod("spark.glm", signature(data = "SparkDataFrame", formula = "formula"),
          function(data, formula, family = gaussian, epsilon = 1e-06, maxit = 25) {})
setMethod("spark.naiveBayes", signature(data = "SparkDataFrame", formula = "formula"),
          function(data, formula, laplace = 0, ...) {})

To:

setMethod("spark.glm", signature(data = "SparkDataFrame", formula = "formula"),
          function(data, formula, family = gaussian, tol = 1e-06, maxIter = 25) {})
setMethod("spark.naiveBayes", signature(data = "SparkDataFrame", formula = "formula"),
          function(data, formula, smoothing = 1.0, ...) {}) 
  • Refactor the structure of mllib.R, and it will make developers and contributors understand code more clearly.
spark.glm, glm, summary, predict
spark.survreg, summary, predict
spark.naiveBayes, summary, predict
spark.kmeans, summary, fitted, predit
write.ml, read.ml
  • Remove duplicated test code in test_mllib.R. Because glm internally will call spark.glm to train model, we should not duplicate all spark.glm tests twice and just run the basic test case to check the API working well.
  • Fix and update docs.
  • Formatting code.

How was this patch tested?

Existing unit tests.

@SparkQA
Copy link

SparkQA commented May 10, 2016

Test build #58225 has finished for PR 13023 at commit 1a7700f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

setMethod("glm", signature(formula = "formula", family = "ANY", data = "SparkDataFrame"),
function(formula, family = gaussian, data, epsilon = 1e-06, maxit = 25) {
spark.glm(data, formula, family, epsilon, maxit)
})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because glm is R-compliant function, so I left the argument names consistent with native R.

@yanboliang
Copy link
Contributor Author

cc @mengxr

@SparkQA
Copy link

SparkQA commented May 12, 2016

Test build #58484 has finished for PR 13023 at commit 9065535.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

#' Fit a k-means model
#'
#' Fit a k-means model, similarly to R's kmeans().
#' Fits a k-means model, similarly to R's kmeans().
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems to be changing a few times here - should this be Fits a ... or Fit a ...?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@felixcheung felixcheung May 20, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm referring to lines

#' Fit a k-means model
#'
#' Fit a k-means model, similarly to R's kmeans().

ie. the first line and the third line.

For example, it shows up for glm like this http://spark.apache.org/docs/latest/api/R/glm.html
image

Which I'd think would be rather odd if they are not consistent.

@vectorijk
Copy link
Contributor

Suggested by this comment, I was wondering if we also need to update the docs for k-means and naive bayes in http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/sparkr.html. Maybe we can include that change in this PR.

@yanboliang
Copy link
Contributor Author

@vectorijk There is a separate PR focus on updating machine learning section of SparkR users guide. FYI #13285. Thanks.

@shivaram
Copy link
Contributor

@mengxr @yanboliang Is this PR still active ? Just checking if this is something we should track for the 2.0 release

@mengxr
Copy link
Contributor

mengxr commented Jun 20, 2016

It would be nice to get this in. @yanboliang is traveling. I can help send a PR based on this one.

asfgit pushed a commit that referenced this pull request Jun 21, 2016
…istent with MLlib

## What changes were proposed in this pull request?

This PR is a subset of #13023 by yanboliang to make SparkR model param names and default values consistent with MLlib. I tried to avoid other changes from #13023 to keep this PR minimal. I will send a follow-up PR to improve the documentation.

Main changes:
* `spark.glm`: epsilon -> tol, maxit -> maxIter
* `spark.kmeans`: default k -> 2, default maxIter -> 20, default initMode -> "k-means||"
* `spark.naiveBayes`: laplace -> smoothing, default 1.0

## How was this patch tested?

Existing unit tests.

Author: Xiangrui Meng <[email protected]>

Closes #13801 from mengxr/SPARK-15177.1.
asfgit pushed a commit that referenced this pull request Jun 21, 2016
…istent with MLlib

## What changes were proposed in this pull request?

This PR is a subset of #13023 by yanboliang to make SparkR model param names and default values consistent with MLlib. I tried to avoid other changes from #13023 to keep this PR minimal. I will send a follow-up PR to improve the documentation.

Main changes:
* `spark.glm`: epsilon -> tol, maxit -> maxIter
* `spark.kmeans`: default k -> 2, default maxIter -> 20, default initMode -> "k-means||"
* `spark.naiveBayes`: laplace -> smoothing, default 1.0

## How was this patch tested?

Existing unit tests.

Author: Xiangrui Meng <[email protected]>

Closes #13801 from mengxr/SPARK-15177.1.

(cherry picked from commit 4f83ca1)
Signed-off-by: Xiangrui Meng <[email protected]>
@mengxr
Copy link
Contributor

mengxr commented Jun 21, 2016

@yanboliang We are going to split the work into multiple PRs (SPARK-16090). Do you mind closing this PR for now? Thanks!

@yanboliang yanboliang closed this Jun 25, 2016
@yanboliang yanboliang deleted the spark-15177 branch June 25, 2016 02:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants