[SPARK-18686][SparkR][ML] Several cleanup and improvements for spark.logit. #16117

yanboliang · 2016-12-02T08:21:44Z

What changes were proposed in this pull request?

Several cleanup and improvements for spark.logit:

summary should return coefficients matrix, and should output labels for each class if the model is multinomial logistic regression model.
summary should not return areaUnderROC, roc, pr, ..., since most of them are DataFrame which are less important for R users. Meanwhile, these metrics ignore instance weights (setting all to 1.0) which will be changed in later Spark version. In case it will introduce breaking changes, we do not expose them currently.
SparkR test improvement: comparing the training result with native R glmnet.
Remove argument aggregationDepth from spark.logit, since it's an expert Param(related with Spark architecture and job execution) that would be used rarely by R users.

How was this patch tested?

Unit tests.

The summary output after this change:
multinomial logistic regression:

> df <- suppressWarnings(createDataFrame(iris))
> model <- spark.logit(df, Species ~ ., regParam = 0.5)
> summary(model)
$coefficients
             versicolor  virginica   setosa
(Intercept)  1.514031    -2.609108   1.095077
Sepal_Length 0.02511006  0.2649821   -0.2900921
Sepal_Width  -0.5291215  -0.02016446 0.549286
Petal_Length 0.03647411  0.1544119   -0.190886
Petal_Width  0.000236092 0.4195804   -0.4198165

binomial logistic regression:

> df <- suppressWarnings(createDataFrame(iris))
> training <- df[df$Species %in% c("versicolor", "virginica"), ]
> model <- spark.logit(training, Species ~ ., regParam = 0.5)
> summary(model)
$coefficients
             Estimate
(Intercept)  -6.053815
Sepal_Length 0.2449379
Sepal_Width  0.1648321
Petal_Length 0.4730718
Petal_Width  1.031947

SparkQA · 2016-12-02T09:23:26Z

Test build #69556 has finished for PR 16117 at commit e78adcd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang · 2016-12-02T09:29:27Z

cc @wangmiao1981 @felixcheung @jkbradley

wangmiao1981

In general, it looks good to me. I have some minor comments.

wangmiao1981 · 2016-12-02T18:22:43Z

mllib/src/main/scala/org/apache/spark/ml/r/LogisticRegressionWrapper.scala

+      new LogisticRegressionWrapper(pipeline, features, labels)
    }
  }
 }


wangmiao1981 · 2016-12-02T18:29:56Z

R/pkg/R/mllib.R

+#'
 #' # save fitted model to input path
 #' path <- "path/to/model"
 #' write.ml(blr_model, path)


Since you changed blr_model -> model, it should be model here.

Good catch, updated. Thanks.

wangmiao1981 · 2016-12-02T18:33:16Z

R/pkg/R/mllib.R

+            features <- callJMethod(jobj, "rFeatures")
+            labels <- callJMethod(jobj, "labels")
+            coefficients <- callJMethod(jobj, "rCoefficients")
+            nCol <- length(coefficients) / length(features)


Can we do the nCol calculation and column name on scala side? So, we don't have to call rFeatures and labels on R side, which makes the logic simpler.

Yeah, we could. The reason I did this way is we may add more model statistics which are different for binomial and multinomial logistic regression later, so we need to distinguish them at R side in any way.

Sounds good.

SparkQA · 2016-12-04T13:49:30Z

Test build #69638 has finished for PR 16117 at commit 6fc9dec.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2016-12-04T19:12:45Z

R/pkg/R/mllib.R

-#' # summary of binary logistic regression
-#' blr_summary <- summary(blr_model)
-#' blr_fmeasure <- collect(select(blr_summary$fMeasureByThreshold, "threshold", "F-Measure"))
+#' df <- suppressWarnings(createDataFrame(iris))


we tried not to have suppressWarnings in example - while a warning message is not great (we have a JIRA on columns with _...) IMO it's best not to have it in example which probably raises more questions for a causal reader.

felixcheung · 2016-12-04T19:16:33Z

LGTM except a minor comment on example. thanks!

SparkQA · 2016-12-06T09:29:14Z

Test build #69718 has finished for PR 16117 at commit 914c9ac.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang · 2016-12-07T08:31:09Z

Since this will involve breaking change if we merge it after 2.1 release, and actually I think it belongs to the scope of QA, so I'll merge it into master and branch-2.1, thanks for all your reviewing.

…logit. ## What changes were proposed in this pull request? Several cleanup and improvements for ```spark.logit```: * ```summary``` should return coefficients matrix, and should output labels for each class if the model is multinomial logistic regression model. * ```summary``` should not return ```areaUnderROC, roc, pr, ...```, since most of them are DataFrame which are less important for R users. Meanwhile, these metrics ignore instance weights (setting all to 1.0) which will be changed in later Spark version. In case it will introduce breaking changes, we do not expose them currently. * SparkR test improvement: comparing the training result with native R glmnet. * Remove argument ```aggregationDepth``` from ```spark.logit```, since it's an expert Param(related with Spark architecture and job execution) that would be used rarely by R users. ## How was this patch tested? Unit tests. The ```summary``` output after this change: multinomial logistic regression: ``` > df <- suppressWarnings(createDataFrame(iris)) > model <- spark.logit(df, Species ~ ., regParam = 0.5) > summary(model) $coefficients versicolor virginica setosa (Intercept) 1.514031 -2.609108 1.095077 Sepal_Length 0.02511006 0.2649821 -0.2900921 Sepal_Width -0.5291215 -0.02016446 0.549286 Petal_Length 0.03647411 0.1544119 -0.190886 Petal_Width 0.000236092 0.4195804 -0.4198165 ``` binomial logistic regression: ``` > df <- suppressWarnings(createDataFrame(iris)) > training <- df[df$Species %in% c("versicolor", "virginica"), ] > model <- spark.logit(training, Species ~ ., regParam = 0.5) > summary(model) $coefficients Estimate (Intercept) -6.053815 Sepal_Length 0.2449379 Sepal_Width 0.1648321 Petal_Length 0.4730718 Petal_Width 1.031947 ``` Author: Yanbo Liang <[email protected]> Closes #16117 from yanboliang/spark-18686. (cherry picked from commit 90b59d1) Signed-off-by: Yanbo Liang <[email protected]>

…logit. ## What changes were proposed in this pull request? Several cleanup and improvements for ```spark.logit```: * ```summary``` should return coefficients matrix, and should output labels for each class if the model is multinomial logistic regression model. * ```summary``` should not return ```areaUnderROC, roc, pr, ...```, since most of them are DataFrame which are less important for R users. Meanwhile, these metrics ignore instance weights (setting all to 1.0) which will be changed in later Spark version. In case it will introduce breaking changes, we do not expose them currently. * SparkR test improvement: comparing the training result with native R glmnet. * Remove argument ```aggregationDepth``` from ```spark.logit```, since it's an expert Param(related with Spark architecture and job execution) that would be used rarely by R users. ## How was this patch tested? Unit tests. The ```summary``` output after this change: multinomial logistic regression: ``` > df <- suppressWarnings(createDataFrame(iris)) > model <- spark.logit(df, Species ~ ., regParam = 0.5) > summary(model) $coefficients versicolor virginica setosa (Intercept) 1.514031 -2.609108 1.095077 Sepal_Length 0.02511006 0.2649821 -0.2900921 Sepal_Width -0.5291215 -0.02016446 0.549286 Petal_Length 0.03647411 0.1544119 -0.190886 Petal_Width 0.000236092 0.4195804 -0.4198165 ``` binomial logistic regression: ``` > df <- suppressWarnings(createDataFrame(iris)) > training <- df[df$Species %in% c("versicolor", "virginica"), ] > model <- spark.logit(training, Species ~ ., regParam = 0.5) > summary(model) $coefficients Estimate (Intercept) -6.053815 Sepal_Length 0.2449379 Sepal_Width 0.1648321 Petal_Length 0.4730718 Petal_Width 1.031947 ``` Author: Yanbo Liang <[email protected]> Closes apache#16117 from yanboliang/spark-18686.

Several cleanup and improvements for spark.logit.

e78adcd

wangmiao1981 reviewed Dec 2, 2016

View reviewed changes

Update docs

6fc9dec

felixcheung reviewed Dec 4, 2016

View reviewed changes

Remove suppressWarnings in example.

914c9ac

asfgit closed this in 90b59d1 Dec 7, 2016

yanboliang deleted the spark-18686 branch December 7, 2016 08:34

[SPARK-18686][SparkR][ML] Several cleanup and improvements for spark.logit. #16117

[SPARK-18686][SparkR][ML] Several cleanup and improvements for spark.logit. #16117

Uh oh!

Conversation

yanboliang commented Dec 2, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Dec 2, 2016

Uh oh!

yanboliang commented Dec 2, 2016

Uh oh!

wangmiao1981 left a comment

Choose a reason for hiding this comment

Uh oh!

wangmiao1981 Dec 2, 2016

Choose a reason for hiding this comment

Uh oh!

wangmiao1981 Dec 2, 2016

Choose a reason for hiding this comment

Uh oh!

yanboliang Dec 4, 2016

Choose a reason for hiding this comment

Uh oh!

wangmiao1981 Dec 2, 2016

Choose a reason for hiding this comment

Uh oh!

yanboliang Dec 4, 2016

Choose a reason for hiding this comment

Uh oh!

wangmiao1981 Dec 5, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 4, 2016

Uh oh!

felixcheung Dec 4, 2016

Choose a reason for hiding this comment

Uh oh!

felixcheung commented Dec 4, 2016

Uh oh!

SparkQA commented Dec 6, 2016

Uh oh!

yanboliang commented Dec 7, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yanboliang commented Dec 2, 2016 •

edited

Loading