-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-18291][SparkR][ML] SparkR glm predict should output original label when family = binomial. #15788
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #68228 has finished for PR 15788 at commit
|
|
Test build #68243 has finished for PR 15788 at commit
|
| } | ||
|
|
||
| /** | ||
| * This utility transformer converts the predicted value of GeneralizedLinearRegressionModel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps this could be reusable and should go to RWrapperUtils.scala?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an inherent feature of other classification algorithms who extends Classifier currently, only GLM use the separate conversion since GeneralizedLinearRegression extends Regressor. So it should not be reused, I'm open to move it to RWrapperUtils.scala when it's necessary. Thanks.
|
LGTM |
|
Merged into master and branch-2.1. Thanks for review. |
…abel when family = binomial.
## What changes were proposed in this pull request?
SparkR ```spark.glm``` predict should output original label when family = "binomial".
## How was this patch tested?
Add unit test.
You can also run the following code to test:
```R
training <- suppressWarnings(createDataFrame(iris))
training <- training[training$Species %in% c("versicolor", "virginica"), ]
model <- spark.glm(training, Species ~ Sepal_Length + Sepal_Width,family = binomial(link = "logit"))
showDF(predict(model, training))
```
Before this change:
```
+------------+-----------+------------+-----------+----------+-----+-------------------+
|Sepal_Length|Sepal_Width|Petal_Length|Petal_Width| Species|label| prediction|
+------------+-----------+------------+-----------+----------+-----+-------------------+
| 7.0| 3.2| 4.7| 1.4|versicolor| 0.0| 0.8271421517601544|
| 6.4| 3.2| 4.5| 1.5|versicolor| 0.0| 0.6044595910413112|
| 6.9| 3.1| 4.9| 1.5|versicolor| 0.0| 0.7916340858281998|
| 5.5| 2.3| 4.0| 1.3|versicolor| 0.0|0.16080518180591158|
| 6.5| 2.8| 4.6| 1.5|versicolor| 0.0| 0.6112229217050189|
| 5.7| 2.8| 4.5| 1.3|versicolor| 0.0| 0.2555087295500885|
| 6.3| 3.3| 4.7| 1.6|versicolor| 0.0| 0.5681507664364834|
| 4.9| 2.4| 3.3| 1.0|versicolor| 0.0|0.05990570219972002|
| 6.6| 2.9| 4.6| 1.3|versicolor| 0.0| 0.6644434078306246|
| 5.2| 2.7| 3.9| 1.4|versicolor| 0.0|0.11293577405862379|
| 5.0| 2.0| 3.5| 1.0|versicolor| 0.0|0.06152372321585971|
| 5.9| 3.0| 4.2| 1.5|versicolor| 0.0|0.35250697207602555|
| 6.0| 2.2| 4.0| 1.0|versicolor| 0.0|0.32267018290814303|
| 6.1| 2.9| 4.7| 1.4|versicolor| 0.0| 0.433391153814592|
| 5.6| 2.9| 3.6| 1.3|versicolor| 0.0| 0.2280744262436993|
| 6.7| 3.1| 4.4| 1.4|versicolor| 0.0| 0.7219848389339459|
| 5.6| 3.0| 4.5| 1.5|versicolor| 0.0|0.23527698971404695|
| 5.8| 2.7| 4.1| 1.0|versicolor| 0.0| 0.285024533520016|
| 6.2| 2.2| 4.5| 1.5|versicolor| 0.0| 0.4107047877447493|
| 5.6| 2.5| 3.9| 1.1|versicolor| 0.0|0.20083561961645083|
+------------+-----------+------------+-----------+----------+-----+-------------------+
```
After this change:
```
+------------+-----------+------------+-----------+----------+-----+----------+
|Sepal_Length|Sepal_Width|Petal_Length|Petal_Width| Species|label|prediction|
+------------+-----------+------------+-----------+----------+-----+----------+
| 7.0| 3.2| 4.7| 1.4|versicolor| 0.0| virginica|
| 6.4| 3.2| 4.5| 1.5|versicolor| 0.0| virginica|
| 6.9| 3.1| 4.9| 1.5|versicolor| 0.0| virginica|
| 5.5| 2.3| 4.0| 1.3|versicolor| 0.0|versicolor|
| 6.5| 2.8| 4.6| 1.5|versicolor| 0.0| virginica|
| 5.7| 2.8| 4.5| 1.3|versicolor| 0.0|versicolor|
| 6.3| 3.3| 4.7| 1.6|versicolor| 0.0| virginica|
| 4.9| 2.4| 3.3| 1.0|versicolor| 0.0|versicolor|
| 6.6| 2.9| 4.6| 1.3|versicolor| 0.0| virginica|
| 5.2| 2.7| 3.9| 1.4|versicolor| 0.0|versicolor|
| 5.0| 2.0| 3.5| 1.0|versicolor| 0.0|versicolor|
| 5.9| 3.0| 4.2| 1.5|versicolor| 0.0|versicolor|
| 6.0| 2.2| 4.0| 1.0|versicolor| 0.0|versicolor|
| 6.1| 2.9| 4.7| 1.4|versicolor| 0.0|versicolor|
| 5.6| 2.9| 3.6| 1.3|versicolor| 0.0|versicolor|
| 6.7| 3.1| 4.4| 1.4|versicolor| 0.0| virginica|
| 5.6| 3.0| 4.5| 1.5|versicolor| 0.0|versicolor|
| 5.8| 2.7| 4.1| 1.0|versicolor| 0.0|versicolor|
| 6.2| 2.2| 4.5| 1.5|versicolor| 0.0|versicolor|
| 5.6| 2.5| 3.9| 1.1|versicolor| 0.0|versicolor|
+------------+-----------+------------+-----------+----------+-----+----------+
```
Author: Yanbo Liang <[email protected]>
Closes #15788 from yanboliang/spark-18291.
(cherry picked from commit daa975f)
Signed-off-by: Yanbo Liang <[email protected]>
|
@felixcheung @yanboliang + @shivaram @mengxr Just saw this change. IIRC we intentionally made spark.glm match R's glm behavior for the binomial family, though I see it was a little different: glm outputs log-space predictions (rawPrediction) by default, and spark.glm output probabilities by default before this PR. This PR makes spark.glm match Scala/Java/Python MLlib APIs instead of R libraries. We should pick a clear strategy here: match the rest of MLlib or match familiar R libraries. |
| pipeline.transform(dataset).drop(glm.getFeaturesCol) | ||
| if (rFamily == "binomial") { | ||
| pipeline.transform(dataset) | ||
| .drop(PREDICTED_LABEL_PROB_COL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to make R models incompatible with other languages when you persist them. (They already are because of having hidden Pipelines, but that is fixable.) This, however, will encode special behavior which is only respected when the model is loaded from R, not from other languages.
One option is to encode this in a SQLTransformer.
I'm also worried that these hard-coded columns names will lead to future bug reports about conflicting input column names.
It looks like this same issue appears in other PRs for R, such as [https://issues.apache.org/jira/browse/SPARK-18401]. Given the pervasiveness and that we're in QA right now, I'd recommend we not worry about it for 2.1 and delay fixing it until 2.2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I totally agree the hard-coded column names issues should be fixed, and already have some ideas in my mind to improve SparkR ML wrappers(which include this). This can be placed in the plan of next release version and I will write simple design documents for reviewing.
For the ProbabilityToPrediction issue, the idea of SQLTransformer sounds good and I will try to fix it in follow-up PR. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds great--thanks!
|
@jkbradley Thanks for your comments. I fully understand and also carefully considered your concerns. I totally agree we should make Let’s look at how R data <- iris[iris$Species %in% c("versicolor", "virginica"), ]
model <- glm(Species ~ ., data = data, family = binomial(link = "logit”))
predict(model, data, type = "response”)
f <- factor(data$Species)
prediction <- round(predict(model, data, type = "response"))
factor(prediction, labels = levels(f))The label is string and R In Spark, we use |
|
I agree that R makes it much easier to convert back to the string label. My main worries are:
One alternative would be to output both the string prediction and the probability as 2 columns, which would make the SparkR API a little closer to the other language APIs.
|
|
@jkbradley I agree that exposing probabilities is also useful to R users. Think more about this issue, I think we can add a new argument |
…abel when family = binomial.
## What changes were proposed in this pull request?
SparkR ```spark.glm``` predict should output original label when family = "binomial".
## How was this patch tested?
Add unit test.
You can also run the following code to test:
```R
training <- suppressWarnings(createDataFrame(iris))
training <- training[training$Species %in% c("versicolor", "virginica"), ]
model <- spark.glm(training, Species ~ Sepal_Length + Sepal_Width,family = binomial(link = "logit"))
showDF(predict(model, training))
```
Before this change:
```
+------------+-----------+------------+-----------+----------+-----+-------------------+
|Sepal_Length|Sepal_Width|Petal_Length|Petal_Width| Species|label| prediction|
+------------+-----------+------------+-----------+----------+-----+-------------------+
| 7.0| 3.2| 4.7| 1.4|versicolor| 0.0| 0.8271421517601544|
| 6.4| 3.2| 4.5| 1.5|versicolor| 0.0| 0.6044595910413112|
| 6.9| 3.1| 4.9| 1.5|versicolor| 0.0| 0.7916340858281998|
| 5.5| 2.3| 4.0| 1.3|versicolor| 0.0|0.16080518180591158|
| 6.5| 2.8| 4.6| 1.5|versicolor| 0.0| 0.6112229217050189|
| 5.7| 2.8| 4.5| 1.3|versicolor| 0.0| 0.2555087295500885|
| 6.3| 3.3| 4.7| 1.6|versicolor| 0.0| 0.5681507664364834|
| 4.9| 2.4| 3.3| 1.0|versicolor| 0.0|0.05990570219972002|
| 6.6| 2.9| 4.6| 1.3|versicolor| 0.0| 0.6644434078306246|
| 5.2| 2.7| 3.9| 1.4|versicolor| 0.0|0.11293577405862379|
| 5.0| 2.0| 3.5| 1.0|versicolor| 0.0|0.06152372321585971|
| 5.9| 3.0| 4.2| 1.5|versicolor| 0.0|0.35250697207602555|
| 6.0| 2.2| 4.0| 1.0|versicolor| 0.0|0.32267018290814303|
| 6.1| 2.9| 4.7| 1.4|versicolor| 0.0| 0.433391153814592|
| 5.6| 2.9| 3.6| 1.3|versicolor| 0.0| 0.2280744262436993|
| 6.7| 3.1| 4.4| 1.4|versicolor| 0.0| 0.7219848389339459|
| 5.6| 3.0| 4.5| 1.5|versicolor| 0.0|0.23527698971404695|
| 5.8| 2.7| 4.1| 1.0|versicolor| 0.0| 0.285024533520016|
| 6.2| 2.2| 4.5| 1.5|versicolor| 0.0| 0.4107047877447493|
| 5.6| 2.5| 3.9| 1.1|versicolor| 0.0|0.20083561961645083|
+------------+-----------+------------+-----------+----------+-----+-------------------+
```
After this change:
```
+------------+-----------+------------+-----------+----------+-----+----------+
|Sepal_Length|Sepal_Width|Petal_Length|Petal_Width| Species|label|prediction|
+------------+-----------+------------+-----------+----------+-----+----------+
| 7.0| 3.2| 4.7| 1.4|versicolor| 0.0| virginica|
| 6.4| 3.2| 4.5| 1.5|versicolor| 0.0| virginica|
| 6.9| 3.1| 4.9| 1.5|versicolor| 0.0| virginica|
| 5.5| 2.3| 4.0| 1.3|versicolor| 0.0|versicolor|
| 6.5| 2.8| 4.6| 1.5|versicolor| 0.0| virginica|
| 5.7| 2.8| 4.5| 1.3|versicolor| 0.0|versicolor|
| 6.3| 3.3| 4.7| 1.6|versicolor| 0.0| virginica|
| 4.9| 2.4| 3.3| 1.0|versicolor| 0.0|versicolor|
| 6.6| 2.9| 4.6| 1.3|versicolor| 0.0| virginica|
| 5.2| 2.7| 3.9| 1.4|versicolor| 0.0|versicolor|
| 5.0| 2.0| 3.5| 1.0|versicolor| 0.0|versicolor|
| 5.9| 3.0| 4.2| 1.5|versicolor| 0.0|versicolor|
| 6.0| 2.2| 4.0| 1.0|versicolor| 0.0|versicolor|
| 6.1| 2.9| 4.7| 1.4|versicolor| 0.0|versicolor|
| 5.6| 2.9| 3.6| 1.3|versicolor| 0.0|versicolor|
| 6.7| 3.1| 4.4| 1.4|versicolor| 0.0| virginica|
| 5.6| 3.0| 4.5| 1.5|versicolor| 0.0|versicolor|
| 5.8| 2.7| 4.1| 1.0|versicolor| 0.0|versicolor|
| 6.2| 2.2| 4.5| 1.5|versicolor| 0.0|versicolor|
| 5.6| 2.5| 3.9| 1.1|versicolor| 0.0|versicolor|
+------------+-----------+------------+-----------+----------+-----+----------+
```
Author: Yanbo Liang <[email protected]>
Closes apache#15788 from yanboliang/spark-18291.
What changes were proposed in this pull request?
SparkR
spark.glmpredict should output original label when family = "binomial".How was this patch tested?
Add unit test.
You can also run the following code to test:
Before this change:
After this change: