-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-18686][SparkR][ML] Several cleanup and improvements for spark.logit. #16117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #69556 has finished for PR 16117 at commit
|
wangmiao1981
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, it looks good to me. I have some minor comments.
| new LogisticRegressionWrapper(pipeline, features, labels) | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new line?
R/pkg/R/mllib.R
Outdated
| #' | ||
| #' # save fitted model to input path | ||
| #' path <- "path/to/model" | ||
| #' write.ml(blr_model, path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you changed blr_model -> model, it should be model here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, updated. Thanks.
| features <- callJMethod(jobj, "rFeatures") | ||
| labels <- callJMethod(jobj, "labels") | ||
| coefficients <- callJMethod(jobj, "rCoefficients") | ||
| nCol <- length(coefficients) / length(features) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do the nCol calculation and column name on scala side? So, we don't have to call rFeatures and labels on R side, which makes the logic simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we could. The reason I did this way is we may add more model statistics which are different for binomial and multinomial logistic regression later, so we need to distinguish them at R side in any way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
|
Test build #69638 has finished for PR 16117 at commit
|
R/pkg/R/mllib.R
Outdated
| #' # summary of binary logistic regression | ||
| #' blr_summary <- summary(blr_model) | ||
| #' blr_fmeasure <- collect(select(blr_summary$fMeasureByThreshold, "threshold", "F-Measure")) | ||
| #' df <- suppressWarnings(createDataFrame(iris)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we tried not to have suppressWarnings in example - while a warning message is not great (we have a JIRA on columns with _...) IMO it's best not to have it in example which probably raises more questions for a causal reader.
|
LGTM except a minor comment on example. thanks! |
|
Test build #69718 has finished for PR 16117 at commit
|
|
Since this will involve breaking change if we merge it after 2.1 release, and actually I think it belongs to the scope of QA, so I'll merge it into master and branch-2.1, thanks for all your reviewing. |
…logit.
## What changes were proposed in this pull request?
Several cleanup and improvements for ```spark.logit```:
* ```summary``` should return coefficients matrix, and should output labels for each class if the model is multinomial logistic regression model.
* ```summary``` should not return ```areaUnderROC, roc, pr, ...```, since most of them are DataFrame which are less important for R users. Meanwhile, these metrics ignore instance weights (setting all to 1.0) which will be changed in later Spark version. In case it will introduce breaking changes, we do not expose them currently.
* SparkR test improvement: comparing the training result with native R glmnet.
* Remove argument ```aggregationDepth``` from ```spark.logit```, since it's an expert Param(related with Spark architecture and job execution) that would be used rarely by R users.
## How was this patch tested?
Unit tests.
The ```summary``` output after this change:
multinomial logistic regression:
```
> df <- suppressWarnings(createDataFrame(iris))
> model <- spark.logit(df, Species ~ ., regParam = 0.5)
> summary(model)
$coefficients
versicolor virginica setosa
(Intercept) 1.514031 -2.609108 1.095077
Sepal_Length 0.02511006 0.2649821 -0.2900921
Sepal_Width -0.5291215 -0.02016446 0.549286
Petal_Length 0.03647411 0.1544119 -0.190886
Petal_Width 0.000236092 0.4195804 -0.4198165
```
binomial logistic regression:
```
> df <- suppressWarnings(createDataFrame(iris))
> training <- df[df$Species %in% c("versicolor", "virginica"), ]
> model <- spark.logit(training, Species ~ ., regParam = 0.5)
> summary(model)
$coefficients
Estimate
(Intercept) -6.053815
Sepal_Length 0.2449379
Sepal_Width 0.1648321
Petal_Length 0.4730718
Petal_Width 1.031947
```
Author: Yanbo Liang <[email protected]>
Closes #16117 from yanboliang/spark-18686.
(cherry picked from commit 90b59d1)
Signed-off-by: Yanbo Liang <[email protected]>
…logit.
## What changes were proposed in this pull request?
Several cleanup and improvements for ```spark.logit```:
* ```summary``` should return coefficients matrix, and should output labels for each class if the model is multinomial logistic regression model.
* ```summary``` should not return ```areaUnderROC, roc, pr, ...```, since most of them are DataFrame which are less important for R users. Meanwhile, these metrics ignore instance weights (setting all to 1.0) which will be changed in later Spark version. In case it will introduce breaking changes, we do not expose them currently.
* SparkR test improvement: comparing the training result with native R glmnet.
* Remove argument ```aggregationDepth``` from ```spark.logit```, since it's an expert Param(related with Spark architecture and job execution) that would be used rarely by R users.
## How was this patch tested?
Unit tests.
The ```summary``` output after this change:
multinomial logistic regression:
```
> df <- suppressWarnings(createDataFrame(iris))
> model <- spark.logit(df, Species ~ ., regParam = 0.5)
> summary(model)
$coefficients
versicolor virginica setosa
(Intercept) 1.514031 -2.609108 1.095077
Sepal_Length 0.02511006 0.2649821 -0.2900921
Sepal_Width -0.5291215 -0.02016446 0.549286
Petal_Length 0.03647411 0.1544119 -0.190886
Petal_Width 0.000236092 0.4195804 -0.4198165
```
binomial logistic regression:
```
> df <- suppressWarnings(createDataFrame(iris))
> training <- df[df$Species %in% c("versicolor", "virginica"), ]
> model <- spark.logit(training, Species ~ ., regParam = 0.5)
> summary(model)
$coefficients
Estimate
(Intercept) -6.053815
Sepal_Length 0.2449379
Sepal_Width 0.1648321
Petal_Length 0.4730718
Petal_Width 1.031947
```
Author: Yanbo Liang <[email protected]>
Closes apache#16117 from yanboliang/spark-18686.
…logit.
## What changes were proposed in this pull request?
Several cleanup and improvements for ```spark.logit```:
* ```summary``` should return coefficients matrix, and should output labels for each class if the model is multinomial logistic regression model.
* ```summary``` should not return ```areaUnderROC, roc, pr, ...```, since most of them are DataFrame which are less important for R users. Meanwhile, these metrics ignore instance weights (setting all to 1.0) which will be changed in later Spark version. In case it will introduce breaking changes, we do not expose them currently.
* SparkR test improvement: comparing the training result with native R glmnet.
* Remove argument ```aggregationDepth``` from ```spark.logit```, since it's an expert Param(related with Spark architecture and job execution) that would be used rarely by R users.
## How was this patch tested?
Unit tests.
The ```summary``` output after this change:
multinomial logistic regression:
```
> df <- suppressWarnings(createDataFrame(iris))
> model <- spark.logit(df, Species ~ ., regParam = 0.5)
> summary(model)
$coefficients
versicolor virginica setosa
(Intercept) 1.514031 -2.609108 1.095077
Sepal_Length 0.02511006 0.2649821 -0.2900921
Sepal_Width -0.5291215 -0.02016446 0.549286
Petal_Length 0.03647411 0.1544119 -0.190886
Petal_Width 0.000236092 0.4195804 -0.4198165
```
binomial logistic regression:
```
> df <- suppressWarnings(createDataFrame(iris))
> training <- df[df$Species %in% c("versicolor", "virginica"), ]
> model <- spark.logit(training, Species ~ ., regParam = 0.5)
> summary(model)
$coefficients
Estimate
(Intercept) -6.053815
Sepal_Length 0.2449379
Sepal_Width 0.1648321
Petal_Length 0.4730718
Petal_Width 1.031947
```
Author: Yanbo Liang <[email protected]>
Closes apache#16117 from yanboliang/spark-18686.
What changes were proposed in this pull request?
Several cleanup and improvements for
spark.logit:summaryshould return coefficients matrix, and should output labels for each class if the model is multinomial logistic regression model.summaryshould not returnareaUnderROC, roc, pr, ..., since most of them are DataFrame which are less important for R users. Meanwhile, these metrics ignore instance weights (setting all to 1.0) which will be changed in later Spark version. In case it will introduce breaking changes, we do not expose them currently.aggregationDepthfromspark.logit, since it's an expert Param(related with Spark architecture and job execution) that would be used rarely by R users.How was this patch tested?
Unit tests.
The
summaryoutput after this change:multinomial logistic regression:
binomial logistic regression: