Skip to content

Conversation

@wangmiao1981
Copy link
Contributor

What changes were proposed in this pull request?

Similar to SPARK-18401, as a classification algorithm, logistic regression should support output original label instead of supporting index label.

In this PR, original label output is supported and test cases are modified and added. Document is also modified.

How was this patch tested?

Unit tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed unused import in this PR, because this one line change is not encouraged as a separate PR.

@SparkQA
Copy link

SparkQA commented Nov 17, 2016

Test build #68737 has finished for PR 15910 at commit 7177dd3.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shivaram
Copy link
Contributor

cc @yanboliang

@SparkQA
Copy link

SparkQA commented Nov 17, 2016

Test build #68739 has finished for PR 15910 at commit 575eeda.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangmiao1981
Copy link
Contributor Author

retest this please.

@SparkQA
Copy link

SparkQA commented Nov 17, 2016

Test build #68741 has finished for PR 15910 at commit 575eeda.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangmiao1981
Copy link
Contributor Author

the failure occurs in kafka-streaming.

retest this please.

@wangmiao1981
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Nov 17, 2016

Test build #68751 has started for PR 15910 at commit 575eeda.

@yanboliang
Copy link
Contributor

Jenkins, test this please

@SparkQA
Copy link

SparkQA commented Nov 17, 2016

Test build #68781 has finished for PR 15910 at commit 575eeda.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: would be great to align the tolerance parameter with indentation

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how reliable is this test? the order of rows is not guaranteed unless it is enforced by a sort or something, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theoretically, the order is not guaranteed. However, we did similar work from the first test case of mllib.R, but never had a problem until now. I'd like to enforce the tests here and other places, but may be in a separate work should be better since it involves lots of other tests?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good, separate JIRA then. If tests haven't been failing perhaps it is not huge problem

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try to create follow-up jira for this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

off-topic, but I think it's a bug. We should not allow users pass fitIntercept to control whether to fit intercept, this should be handled by formula. For example, if users specify formula y ~ a + b + c - 1, then the model should be fitted w/o intercept. Could you please fix this bug as well? Thanks.

Copy link
Contributor Author

@wangmiao1981 wangmiao1981 Nov 28, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Fix it in this PR

R/pkg/R/mllib.R Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually we name it as features.

@wangmiao1981
Copy link
Contributor Author

I am on travel now. I will address the comments asap. Thanks!

@SparkQA
Copy link

SparkQA commented Nov 28, 2016

Test build #69255 has finished for PR 15910 at commit 57cf430.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 29, 2016

Test build #69268 has finished for PR 15910 at commit de74906.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangmiao1981
Copy link
Contributor Author

@yanboliang @felixcheung I am back from vacation and made changes according to your comments.

Thanks!

Copy link
Contributor

@yanboliang yanboliang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

R/pkg/R/mllib.R Outdated
#' features2 <- c(2.941319, 2.614812, 2.162451, 3.339474, 2.970987)
#' features3 <- c(1.322733, 1.348044, 3.861237, 9.686976, 3.447130)
#' features4 <- c(1.3246388, 0.5510444, 0.9225810, 1.2147881, 1.6020842)
#' data <- as.data.frame(cbind(label, features1, features2, features3, features4))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Actually you should not change it, usually the whole feature column were called as features.

@felixcheung
Copy link
Member

LGTM. we need to get this in branch-2.1 because of the signature change

@SparkQA
Copy link

SparkQA commented Nov 30, 2016

Test build #69413 has finished for PR 15910 at commit b1f7b23.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yanboliang
Copy link
Contributor

Merged into master and branch-2.1. Thanks.

asfgit pushed a commit that referenced this pull request Dec 1, 2016
…pport output original label.

## What changes were proposed in this pull request?

Similar to SPARK-18401, as a classification algorithm, logistic regression should support output original label instead of supporting index label.

In this PR, original label output is supported and test cases are modified and added. Document is also modified.

## How was this patch tested?

Unit tests.

Author: [email protected] <[email protected]>

Closes #15910 from wangmiao1981/audit.

(cherry picked from commit 2eb6764)
Signed-off-by: Yanbo Liang <[email protected]>
@asfgit asfgit closed this in 2eb6764 Dec 1, 2016
@yanboliang
Copy link
Contributor

I found the summary of spark.logit return incorrect result when reviewing this PR. Actually it should return coefficients rather than binary logistic regression summary that R users may not be interested. Meanwhile, the binary logistic regression summary will ignore weightCol and treats all instance weights as 1.0, we are discussing to resolve this issue at Scala side. I will send a PR to correct summary return value for spark.logit and hope it can catch 2.1. Thanks.

@wangmiao1981
Copy link
Contributor Author

The summary returns the same as scala side summary, including roc, areaUnderROC, pr, fMeasureByThreshold etc. I think we can add coefficients as additional item.

robert3005 pushed a commit to palantir/spark that referenced this pull request Dec 2, 2016
…pport output original label.

## What changes were proposed in this pull request?

Similar to SPARK-18401, as a classification algorithm, logistic regression should support output original label instead of supporting index label.

In this PR, original label output is supported and test cases are modified and added. Document is also modified.

## How was this patch tested?

Unit tests.

Author: [email protected] <[email protected]>

Closes apache#15910 from wangmiao1981/audit.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…pport output original label.

## What changes were proposed in this pull request?

Similar to SPARK-18401, as a classification algorithm, logistic regression should support output original label instead of supporting index label.

In this PR, original label output is supported and test cases are modified and added. Document is also modified.

## How was this patch tested?

Unit tests.

Author: [email protected] <[email protected]>

Closes apache#15910 from wangmiao1981/audit.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants