Skip to content

Conversation

@hhbyyh
Copy link
Contributor

@hhbyyh hhbyyh commented Mar 6, 2016

What changes were proposed in this pull request?

jira:https://issues.apache.org/jira/browse/SPARK-12566
This JIRA is for extending the support of MLlib's Generalized Linear Models (GLMs) to more model families and link functions in SparkR. After SPARK-12811, we should be able to wrap GeneralizedLinearRegression in SparkR with support of popular families and link functions.

How was this patch tested?

WIP, some manual test

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Mar 6, 2016

Since we already have a glm in SparkR which is based on LogisticRegressionModel and LinearRegressionModel. There're three ways to extend it as I understand:

  1. Change the current glm to use GeneralizedLinearRegression. Create another lm interface for sparkR, and use LR as the model.
  2. Keep glm R interface. and replace its implementation with GLM. This means R can not invoke LR anymore.
  3. Keep glm R interface, and combine the implementation with both LR and GLM based on different solver parameter.
    I'd prefer to use option 1. And I'm gonna send one PR(WIP) for solution 2, which can later be adjusted to 1 or 3.

@SparkQA
Copy link

SparkQA commented Mar 7, 2016

Test build #52534 has finished for PR 11549 at commit 1cca19e.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

if (solver.trim != "irls") throw new SparkException("Currently only support irls")

val formula = new RFormula().setFormula(value)
val regex = "^\\s*(\\w+)\\s*(\\(\\s*link\\s*=\\s*\"(\\w+)\"\\s*\\))?\\s*$".r
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in order to minimize the escaping, you can use Scala's raw strings:

"""^\s*(\w+...\s*$""".r

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not use regex here. Extract the names on R side:

> b <- binomial(link = "logit")
> b$family
[1] "binomial"
> b$link
[1] "logit"

@thunterdb
Copy link
Contributor

@hhbyyh thanks! I just have some small comments; my main comment being in the jira ticket regarding the choice of options 1/2/3.

@yanboliang
Copy link
Contributor

@hhbyyh #11694 has been merged, so we can provide all R-like summary statistic for glm. Thanks!

@mengxr
Copy link
Contributor

mengxr commented Mar 16, 2016

@yanboliang @hhbyyh Let us do the summary statistics under another JIRA: https://issues.apache.org/jira/browse/SPARK-13925

R/pkg/R/mllib.R Outdated
setMethod("glm", signature(formula = "formula", family = "ANY", data = "DataFrame"),
function(formula, family = c("gaussian", "binomial"), data, lambda = 0, alpha = 0,
standardize = TRUE, solver = "auto") {
function(formula, family = c("gaussian", "binomial", "poisson", "gamma"), data,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should match R's signature for family now. We can support family/link functions that R supports. If user input binomial("logit"), we can extract the family name and the link name before we call the Scala implementation.

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Mar 17, 2016

Thanks @mengxr @thunterdb @yanboliang for the review. Sent an update:

  1. resolve the conflict with GLMSummary.
  2. revert the summary statistics related part.
  3. extract family and link name in R.

@mengxr If we move the summary statistics to another PR. It might be hard to pass the ut without verifying statistics. Yet we might want to first decide which option to go.

This is the link to @thunterdb's comment in jira: https://issues.apache.org/jira/browse/SPARK-12566?focusedCommentId=15188057&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15188057

@SparkQA
Copy link

SparkQA commented Mar 17, 2016

Test build #53410 has finished for PR 11549 at commit 8b3dd3e.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yanboliang
Copy link
Contributor

@hhbyyh I vote option 3 in JIRA. We already have GeneralizedLinearRegression in Scala, so it's better to call this implementation from SparkR directly. Due to GeneralizedLinearRegression can not handle dataset with more than 4096 features, so it should call LinearRegression and LogisticRegression with "l-bfgs" solver currently. Actually for gaussian family, "normal" solver equals to "irls" solver, we can unify them as "irls" and deprecated "normal" solver.
I think We should support GeneralizedLinearRegression's solver can be switched between auto, irls, l-bfgs at Scala side in a separate PR. After that the SparkR::glm can directly call the Scala implementation.

#' quasi-Newton optimization method. "normal" denotes using Normal Equation as an
#' analytical solution to the linear regression problem. The default value is "auto"
#' which means that the solver algorithm is selected automatically.
#' @param solver Currently only support "irls" which is also the default solver.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous comment was more explicit, especially with respect to 'auto' (the default). It should mention auto and irls as the two options.

@thunterdb
Copy link
Contributor

@hhbyyh thanks for the split. I have two small comments. Also, can you include some tests in test_mllib.R? It should be very close to the manual testing you did and to the tests already in that file.

@felixcheung
Copy link
Member

Looks like there are breaking signature changes - should we document that?

@jkbradley
Copy link
Member

Ping! Let me know if I can help get this in for 2.0

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Apr 5, 2016

Thanks @jkbradley. We cannot decide which options to go. I think @yanboliang and @thunterdb both would like to go with option 3, yet there're more details to be decided about the mapping between family, link, solver and glm/lm implementations.

Actually I think it may be more efficient if one of the committers can take lead on this. This is more like a strategic decision rather than code wrapper. I would not mind close this PR.

@yanboliang
Copy link
Contributor

@jkbradley @hhbyyh I can work on this PR.

@jkbradley
Copy link
Member

@hhbyyh @yanboliang I just wrote out my thoughts in the JIRA, and I think they match what @yanboliang suggested above (for option 3).

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Apr 7, 2016

@jkbradley Thanks for the suggestion.
@yanboliang Please start work on this if you are interested, since it's your idea that matches the best solution.

@yanboliang
Copy link
Contributor

@hhbyyh I sent #12294 , please feel free to comment. Thanks! Would you mind to close this PR?

@hhbyyh hhbyyh closed this Apr 12, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants