-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-12566] [ML] [WIP] GLM model family, link function support in SparkR:::glm #11549
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Since we already have a
|
|
Test build #52534 has finished for PR 11549 at commit
|
| if (solver.trim != "irls") throw new SparkException("Currently only support irls") | ||
|
|
||
| val formula = new RFormula().setFormula(value) | ||
| val regex = "^\\s*(\\w+)\\s*(\\(\\s*link\\s*=\\s*\"(\\w+)\"\\s*\\))?\\s*$".r |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in order to minimize the escaping, you can use Scala's raw strings:
"""^\s*(\w+...\s*$""".rThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not use regex here. Extract the names on R side:
> b <- binomial(link = "logit")
> b$family
[1] "binomial"
> b$link
[1] "logit"|
@hhbyyh thanks! I just have some small comments; my main comment being in the jira ticket regarding the choice of options 1/2/3. |
|
@yanboliang @hhbyyh Let us do the summary statistics under another JIRA: https://issues.apache.org/jira/browse/SPARK-13925 |
R/pkg/R/mllib.R
Outdated
| setMethod("glm", signature(formula = "formula", family = "ANY", data = "DataFrame"), | ||
| function(formula, family = c("gaussian", "binomial"), data, lambda = 0, alpha = 0, | ||
| standardize = TRUE, solver = "auto") { | ||
| function(formula, family = c("gaussian", "binomial", "poisson", "gamma"), data, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should match R's signature for family now. We can support family/link functions that R supports. If user input binomial("logit"), we can extract the family name and the link name before we call the Scala implementation.
|
Thanks @mengxr @thunterdb @yanboliang for the review. Sent an update:
@mengxr If we move the summary statistics to another PR. It might be hard to pass the ut without verifying statistics. Yet we might want to first decide which option to go. This is the link to @thunterdb's comment in jira: https://issues.apache.org/jira/browse/SPARK-12566?focusedCommentId=15188057&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15188057 |
|
Test build #53410 has finished for PR 11549 at commit
|
|
@hhbyyh I vote option 3 in JIRA. We already have |
| #' quasi-Newton optimization method. "normal" denotes using Normal Equation as an | ||
| #' analytical solution to the linear regression problem. The default value is "auto" | ||
| #' which means that the solver algorithm is selected automatically. | ||
| #' @param solver Currently only support "irls" which is also the default solver. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous comment was more explicit, especially with respect to 'auto' (the default). It should mention auto and irls as the two options.
|
@hhbyyh thanks for the split. I have two small comments. Also, can you include some tests in |
|
Looks like there are breaking signature changes - should we document that? |
|
Ping! Let me know if I can help get this in for 2.0 |
|
Thanks @jkbradley. We cannot decide which options to go. I think @yanboliang and @thunterdb both would like to go with option 3, yet there're more details to be decided about the mapping between family, link, solver and glm/lm implementations. Actually I think it may be more efficient if one of the committers can take lead on this. This is more like a strategic decision rather than code wrapper. I would not mind close this PR. |
|
@jkbradley @hhbyyh I can work on this PR. |
|
@hhbyyh @yanboliang I just wrote out my thoughts in the JIRA, and I think they match what @yanboliang suggested above (for option 3). |
|
@jkbradley Thanks for the suggestion. |
What changes were proposed in this pull request?
jira:https://issues.apache.org/jira/browse/SPARK-12566
This JIRA is for extending the support of MLlib's Generalized Linear Models (GLMs) to more model families and link functions in SparkR. After SPARK-12811, we should be able to wrap GeneralizedLinearRegression in SparkR with support of popular families and link functions.
How was this patch tested?
WIP, some manual test