[SPARK-12566] [ML] [WIP] GLM model family, link function support in SparkR:::glm #11549

hhbyyh · 2016-03-06T23:30:08Z

What changes were proposed in this pull request?

jira:https://issues.apache.org/jira/browse/SPARK-12566
This JIRA is for extending the support of MLlib's Generalized Linear Models (GLMs) to more model families and link functions in SparkR. After SPARK-12811, we should be able to wrap GeneralizedLinearRegression in SparkR with support of popular families and link functions.

How was this patch tested?

WIP, some manual test

hhbyyh · 2016-03-06T23:31:05Z

Since we already have a glm in SparkR which is based on LogisticRegressionModel and LinearRegressionModel. There're three ways to extend it as I understand:

Change the current glm to use GeneralizedLinearRegression. Create another lm interface for sparkR, and use LR as the model.
Keep glm R interface. and replace its implementation with GLM. This means R can not invoke LR anymore.
Keep glm R interface, and combine the implementation with both LR and GLM based on different solver parameter.
I'd prefer to use option 1. And I'm gonna send one PR(WIP) for solution 2, which can later be adjusted to 1 or 3.

SparkQA · 2016-03-07T00:17:35Z

Test build #52534 has finished for PR 11549 at commit 1cca19e.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

thunterdb · 2016-03-09T18:24:03Z

mllib/src/main/scala/org/apache/spark/ml/r/SparkRWrappers.scala

+    if (solver.trim != "irls") throw new SparkException("Currently only support irls")
+
+    val formula = new RFormula().setFormula(value)
+    val regex = "^\\s*(\\w+)\\s*(\\(\\s*link\\s*=\\s*\"(\\w+)\"\\s*\\))?\\s*$".r


in order to minimize the escaping, you can use Scala's raw strings:

"""^\s*(\w+...\s*$""".r

Do not use regex here. Extract the names on R side:

> b <- binomial(link = "logit") > b$family [1] "binomial" > b$link [1] "logit"

thunterdb · 2016-03-09T21:58:37Z

@hhbyyh thanks! I just have some small comments; my main comment being in the jira ticket regarding the choice of options 1/2/3.

yanboliang · 2016-03-16T07:12:56Z

@hhbyyh #11694 has been merged, so we can provide all R-like summary statistic for glm. Thanks!

mengxr · 2016-03-16T07:19:23Z

@yanboliang @hhbyyh Let us do the summary statistics under another JIRA: https://issues.apache.org/jira/browse/SPARK-13925

mengxr · 2016-03-16T07:23:20Z

R/pkg/R/mllib.R

 setMethod("glm", signature(formula = "formula", family = "ANY", data = "DataFrame"),
-          function(formula, family = c("gaussian", "binomial"), data, lambda = 0, alpha = 0,
-            standardize = TRUE, solver = "auto") {
+          function(formula, family = c("gaussian", "binomial", "poisson", "gamma"), data,


We should match R's signature for family now. We can support family/link functions that R supports. If user input binomial("logit"), we can extract the family name and the link name before we call the Scala implementation.

hhbyyh · 2016-03-17T09:00:47Z

Thanks @mengxr @thunterdb @yanboliang for the review. Sent an update:

resolve the conflict with GLMSummary.
revert the summary statistics related part.
extract family and link name in R.

@mengxr If we move the summary statistics to another PR. It might be hard to pass the ut without verifying statistics. Yet we might want to first decide which option to go.

This is the link to @thunterdb's comment in jira: https://issues.apache.org/jira/browse/SPARK-12566?focusedCommentId=15188057&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15188057

SparkQA · 2016-03-17T09:44:08Z

Test build #53410 has finished for PR 11549 at commit 8b3dd3e.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang · 2016-03-17T14:35:39Z

@hhbyyh I vote option 3 in JIRA. We already have GeneralizedLinearRegression in Scala, so it's better to call this implementation from SparkR directly. Due to GeneralizedLinearRegression can not handle dataset with more than 4096 features, so it should call LinearRegression and LogisticRegression with "l-bfgs" solver currently. Actually for gaussian family, "normal" solver equals to "irls" solver, we can unify them as "irls" and deprecated "normal" solver.
I think We should support GeneralizedLinearRegression's solver can be switched between auto, irls, l-bfgs at Scala side in a separate PR. After that the SparkR::glm can directly call the Scala implementation.

thunterdb · 2016-03-21T22:14:41Z

R/pkg/R/mllib.R

-#'               quasi-Newton optimization method. "normal" denotes using Normal Equation as an
-#'               analytical solution to the linear regression problem. The default value is "auto"
-#'               which means that the solver algorithm is selected automatically.
+#' @param solver Currently only support "irls" which is also the default solver.


The previous comment was more explicit, especially with respect to 'auto' (the default). It should mention auto and irls as the two options.

thunterdb · 2016-03-21T22:27:12Z

@hhbyyh thanks for the split. I have two small comments. Also, can you include some tests in test_mllib.R? It should be very close to the manual testing you did and to the tests already in that file.

felixcheung · 2016-03-21T23:50:05Z

Looks like there are breaking signature changes - should we document that?

jkbradley · 2016-04-04T20:06:57Z

Ping! Let me know if I can help get this in for 2.0

hhbyyh · 2016-04-05T03:47:58Z

Thanks @jkbradley. We cannot decide which options to go. I think @yanboliang and @thunterdb both would like to go with option 3, yet there're more details to be decided about the mapping between family, link, solver and glm/lm implementations.

Actually I think it may be more efficient if one of the committers can take lead on this. This is more like a strategic decision rather than code wrapper. I would not mind close this PR.

yanboliang · 2016-04-05T14:56:49Z

@jkbradley @hhbyyh I can work on this PR.

jkbradley · 2016-04-06T23:53:14Z

@hhbyyh @yanboliang I just wrote out my thoughts in the JIRA, and I think they match what @yanboliang suggested above (for option 3).

hhbyyh · 2016-04-07T01:15:50Z

@jkbradley Thanks for the suggestion.
@yanboliang Please start work on this if you are interested, since it's your idea that matches the best solution.

yanboliang · 2016-04-11T08:08:29Z

@hhbyyh I sent #12294 , please feel free to comment. Thanks! Would you mind to close this PR?

hhbyyh added 2 commits March 6, 2016 15:00

change R glm to use GLM

6c93365

refine family

1cca19e

thunterdb reviewed Mar 9, 2016
View reviewed changes

mengxr reviewed Mar 16, 2016
View reviewed changes

hhbyyh added 2 commits March 16, 2016 23:18

resolve conflict

a5d203e

extract link and family name in R

8b3dd3e

thunterdb reviewed Mar 21, 2016
View reviewed changes

hhbyyh closed this Apr 12, 2016

[SPARK-12566] [ML] [WIP] GLM model family, link function support in SparkR:::glm #11549

[SPARK-12566] [ML] [WIP] GLM model family, link function support in SparkR:::glm #11549

Uh oh!

Conversation

hhbyyh commented Mar 6, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

hhbyyh commented Mar 6, 2016

Uh oh!

SparkQA commented Mar 7, 2016

Uh oh!

thunterdb Mar 9, 2016

Choose a reason for hiding this comment

Uh oh!

mengxr Mar 16, 2016

Choose a reason for hiding this comment

Uh oh!

thunterdb commented Mar 9, 2016

Uh oh!

yanboliang commented Mar 16, 2016

Uh oh!

mengxr commented Mar 16, 2016

Uh oh!

mengxr Mar 16, 2016

Choose a reason for hiding this comment

Uh oh!

hhbyyh commented Mar 17, 2016

Uh oh!

SparkQA commented Mar 17, 2016

Uh oh!

yanboliang commented Mar 17, 2016

Uh oh!

thunterdb Mar 21, 2016

Choose a reason for hiding this comment

Uh oh!

thunterdb commented Mar 21, 2016

Uh oh!

felixcheung commented Mar 21, 2016

Uh oh!

jkbradley commented Apr 4, 2016

Uh oh!

hhbyyh commented Apr 5, 2016

Uh oh!

yanboliang commented Apr 5, 2016

Uh oh!

jkbradley commented Apr 6, 2016

Uh oh!

hhbyyh commented Apr 7, 2016

Uh oh!

yanboliang commented Apr 11, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants