[SPARK-18476][SPARKR][ML]:SparkR Logistic Regression should should support output original label. #15910

wangmiao1981 · 2016-11-17T00:35:09Z

What changes were proposed in this pull request?

Similar to SPARK-18401, as a classification algorithm, logistic regression should support output original label instead of supporting index label.

In this PR, original label output is supported and test cases are modified and added. Document is also modified.

How was this patch tested?

Unit tests.

wangmiao1981 · 2016-11-17T00:36:40Z

core/src/main/scala/org/apache/spark/SparkContext.scala

Removed unused import in this PR, because this one line change is not encouraged as a separate PR.

SparkQA · 2016-11-17T00:39:54Z

Test build #68737 has finished for PR 15910 at commit 7177dd3.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

shivaram · 2016-11-17T00:41:24Z

cc @yanboliang

SparkQA · 2016-11-17T03:05:41Z

Test build #68739 has finished for PR 15910 at commit 575eeda.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

wangmiao1981 · 2016-11-17T03:33:39Z

retest this please.

SparkQA · 2016-11-17T05:39:40Z

Test build #68741 has finished for PR 15910 at commit 575eeda.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

wangmiao1981 · 2016-11-17T06:14:27Z

the failure occurs in kafka-streaming.

retest this please.

wangmiao1981 · 2016-11-17T06:27:57Z

retest this please

SparkQA · 2016-11-17T06:32:30Z

Test build #68751 has started for PR 15910 at commit 575eeda.

yanboliang · 2016-11-17T14:00:33Z

Jenkins, test this please

SparkQA · 2016-11-17T16:46:33Z

Test build #68781 has finished for PR 15910 at commit 575eeda.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2016-11-19T00:25:12Z

R/pkg/inst/tests/testthat/test_mllib.R

nit: would be great to align the tolerance parameter with indentation

felixcheung · 2016-11-19T00:26:31Z

R/pkg/inst/tests/testthat/test_mllib.R

how reliable is this test? the order of rows is not guaranteed unless it is enforced by a sort or something, right?

Theoretically, the order is not guaranteed. However, we did similar work from the first test case of mllib.R, but never had a problem until now. I'd like to enforce the tests here and other places, but may be in a separate work should be better since it involves lots of other tests?

sounds good, separate JIRA then. If tests haven't been failing perhaps it is not huge problem

I will try to create follow-up jira for this.

yanboliang · 2016-11-19T10:19:16Z

mllib/src/main/scala/org/apache/spark/ml/r/LogisticRegressionWrapper.scala

off-topic, but I think it's a bug. We should not allow users pass fitIntercept to control whether to fit intercept, this should be handled by formula. For example, if users specify formula y ~ a + b + c - 1, then the model should be fitted w/o intercept. Could you please fix this bug as well? Thanks.

OK. Fix it in this PR

yanboliang · 2016-11-19T10:28:02Z

R/pkg/R/mllib.R

Usually we name it as features.

wangmiao1981 · 2016-11-22T07:06:58Z

I am on travel now. I will address the comments asap. Thanks!

SparkQA · 2016-11-28T21:53:58Z

Test build #69255 has finished for PR 15910 at commit 57cf430.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-29T01:53:37Z

Test build #69268 has finished for PR 15910 at commit de74906.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wangmiao1981 · 2016-11-29T19:57:06Z

@yanboliang @felixcheung I am back from vacation and made changes according to your comments.

Thanks!

yanboliang

LGTM

yanboliang · 2016-11-30T11:31:30Z

R/pkg/R/mllib.R

+#' features2 <- c(2.941319, 2.614812, 2.162451, 3.339474, 2.970987)
+#' features3 <- c(1.322733, 1.348044, 3.861237, 9.686976, 3.447130)
+#' features4 <- c(1.3246388, 0.5510444, 0.9225810, 1.2147881, 1.6020842)
+#' data <- as.data.frame(cbind(label, features1, features2, features3, features4))


Nit: Actually you should not change it, usually the whole feature column were called as features.

felixcheung · 2016-11-30T18:01:13Z

LGTM. we need to get this in branch-2.1 because of the signature change

SparkQA · 2016-11-30T20:12:25Z

Test build #69413 has finished for PR 15910 at commit b1f7b23.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang · 2016-12-01T04:31:08Z

Merged into master and branch-2.1. Thanks.

…pport output original label. ## What changes were proposed in this pull request? Similar to SPARK-18401, as a classification algorithm, logistic regression should support output original label instead of supporting index label. In this PR, original label output is supported and test cases are modified and added. Document is also modified. ## How was this patch tested? Unit tests. Author: [email protected] <[email protected]> Closes #15910 from wangmiao1981/audit. (cherry picked from commit 2eb6764) Signed-off-by: Yanbo Liang <[email protected]>

yanboliang · 2016-12-01T16:55:48Z

I found the summary of spark.logit return incorrect result when reviewing this PR. Actually it should return coefficients rather than binary logistic regression summary that R users may not be interested. Meanwhile, the binary logistic regression summary will ignore weightCol and treats all instance weights as 1.0, we are discussing to resolve this issue at Scala side. I will send a PR to correct summary return value for spark.logit and hope it can catch 2.1. Thanks.

wangmiao1981 · 2016-12-01T18:17:40Z

The summary returns the same as scala side summary, including roc, areaUnderROC, pr, fMeasureByThreshold etc. I think we can add coefficients as additional item.

…pport output original label. ## What changes were proposed in this pull request? Similar to SPARK-18401, as a classification algorithm, logistic regression should support output original label instead of supporting index label. In this PR, original label output is supported and test cases are modified and added. Document is also modified. ## How was this patch tested? Unit tests. Author: [email protected] <[email protected]> Closes apache#15910 from wangmiao1981/audit.

wangmiao1981 commented Nov 17, 2016

View reviewed changes

felixcheung reviewed Nov 19, 2016

View reviewed changes

R/pkg/inst/tests/testthat/test_mllib.R Outdated

Copy link

Member

felixcheung Nov 19, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: would be great to align the tolerance parameter with indentation

felixcheung reviewed Nov 19, 2016

View reviewed changes

yanboliang reviewed Nov 19, 2016

View reviewed changes

wangmiao1981 added 5 commits November 28, 2016 10:29

remove unused import

0ff84f0

add label StringIndexer

9a07aa1

modify test case; add output label

9d19284

fix scala style error

71f7de2

address review comments

57cf430

wangmiao1981 force-pushed the audit branch from 575eeda to 57cf430 Compare November 28, 2016 18:58

fix bug of fitintercept

de74906

yanboliang reviewed Nov 30, 2016

View reviewed changes

address review comments

b1f7b23

asfgit closed this in 2eb6764 Dec 1, 2016

[SPARK-18476][SPARKR][ML]:SparkR Logistic Regression should should support output original label. #15910

[SPARK-18476][SPARKR][ML]:SparkR Logistic Regression should should support output original label. #15910

Uh oh!

Conversation

wangmiao1981 commented Nov 17, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 17, 2016

Uh oh!

shivaram commented Nov 17, 2016

Uh oh!

SparkQA commented Nov 17, 2016

Uh oh!

wangmiao1981 commented Nov 17, 2016

Uh oh!

SparkQA commented Nov 17, 2016

Uh oh!

wangmiao1981 commented Nov 17, 2016

Uh oh!

wangmiao1981 commented Nov 17, 2016

Uh oh!

SparkQA commented Nov 17, 2016

Uh oh!

yanboliang commented Nov 17, 2016

Uh oh!

SparkQA commented Nov 17, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wangmiao1981 Nov 28, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wangmiao1981 commented Nov 22, 2016

Uh oh!

SparkQA commented Nov 28, 2016

Uh oh!

SparkQA commented Nov 29, 2016

Uh oh!

wangmiao1981 commented Nov 29, 2016

Uh oh!

yanboliang left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

felixcheung commented Nov 30, 2016

Uh oh!

SparkQA commented Nov 30, 2016

Uh oh!

yanboliang commented Dec 1, 2016

Uh oh!

yanboliang commented Dec 1, 2016

Uh oh!

wangmiao1981 commented Dec 1, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

wangmiao1981 Nov 28, 2016 •

edited

Loading