-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-24101][ML][MLLIB] ML Evaluators should use weight column - added weight column for multiclass classification evaluator #17086
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #73529 has finished for PR 17086 at commit
|
|
@sethah @Lewuathe @thunterdb @WeichenXu123 @jkbradley @actuaryzhang @srowen would you be able to take a look? I've split the larger pull request into three parts as suggested. |
|
ping @sethah @Lewuathe @thunterdb @WeichenXu123 @jkbradley @actuaryzhang @srowen could you please take a look? thank you! |
|
Test build #78372 has started for PR 17086 at commit |
cf6a5ab to
4a0debf
Compare
|
jenkins, retest this please |
|
Test build #89407 has finished for PR 17086 at commit
|
WeichenXu123
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I made an initial rough review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Since("2.4.0")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line seems useless ? dataset.select(pred, label)...values.countByValues()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch -- hmm that shouldn't be there, not sure why I added it, removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to define two constructor as:
this(predAndLabelsWithOptWeight: RDD[(Double, Double, Double)]
this(predAndLabels: RDD[(Double, Double)])
so it will do more strict type checking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea, this also simplifies the calculation of the confusions, fpByClass, tpByClass and labelCountByClass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm the build fails here though with an error indicating that the methods are the same after type erasure, perhaps I should revert this code back:
[error] /home/jenkins/workspace/SparkPullRequestBuilder@2/mllib/src/main/scala/org/apache/spark/mllib/evaluation/MulticlassMetrics.scala:35: double definition:
[error] constructor MulticlassMetrics: (predLabelsWeight: org.apache.spark.rdd.RDD[(Double, Double, Double)])org.apache.spark.mllib.evaluation.MulticlassMetrics at line 33 and
[error] constructor MulticlassMetrics: (predAndLabels: org.apache.spark.rdd.RDD[(Double, Double)])org.apache.spark.mllib.evaluation.MulticlassMetrics at line 35
[error] have same type after erasure: (predLabelsWeight: org.apache.spark.rdd.RDD)org.apache.spark.mllib.evaluation.MulticlassMetrics
[error] def this(predAndLabels: RDD[(Double, Double)]) =
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can add a member var into the class like val predAndLabelsWithOptWeight: RDD[(Double, Double, Double), and ctor assign this member var. so the following calculation code will be easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea, done!
|
Test build #89483 has finished for PR 17086 at commit
|
|
Test build #89551 has finished for PR 17086 at commit
|
|
@WeichenXu123 I've updated the PR, resolved all comments and the build passes - would you be able to take another look when you have time? Thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The .mapValues(weight => weight) is redundant, it generate the same RDD.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch! removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use 1E-7 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use operator A ~== B absTol delta like other tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There're many repeated expressions here such as (2 * w1 + 1 * w2 + 1 * w1) / tw, could you store them in variables first ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, I was trying to follow the format of the other existing test, made the change in both test cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use assert(metrics.labels === labels) like other tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't toArray, use assert(metrics.confusionMatrix ~== confusionMatrix relTol e)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks like I needed to change this to an ML matrix instead of MLLIB matrix in order to make this ~== work, so I used .asML
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, that's because you use Matrices in mllib, change it to Matrices in ml, i.e., import org.apache.spark.ml.linalg.Matrices
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
however, I still need to do asML on the metrics.confusionMatrix as that property is from mllib (in MulticlassMetrics class)
112bba9 to
47c45cb
Compare
|
Test build #89890 has finished for PR 17086 at commit
|
47c45cb to
089d64b
Compare
|
Test build #89891 has finished for PR 17086 at commit
|
089d64b to
6906dc4
Compare
|
Test build #89894 has finished for PR 17086 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think maybe relTol will be better than absTol, except the cases that one side is zero. What do you think of it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea, I've replaced absTol with relTol, done
|
overall good, @jkbradley Would you mind take a look ? |
6906dc4 to
f209bb4
Compare
|
jenkins, retest this please |
1 similar comment
|
jenkins, retest this please |
|
Test build #90000 has finished for PR 17086 at commit
|
|
looks like a random failure in spark R, unrelated |
|
jenkins, retest this please |
|
Test build #90002 has finished for PR 17086 at commit
|
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am also pretty OK with this one; straightforward
mllib/src/test/scala/org/apache/spark/mllib/evaluation/MulticlassMetricsSuite.scala
Outdated
Show resolved
Hide resolved
mllib/src/main/scala/org/apache/spark/ml/evaluation/MulticlassClassificationEvaluator.scala
Outdated
Show resolved
Hide resolved
mllib/src/main/scala/org/apache/spark/mllib/evaluation/MulticlassMetrics.scala
Outdated
Show resolved
Hide resolved
a416fa0 to
0ca102a
Compare
|
Test build #98503 has finished for PR 17086 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, wait a sec, this changed the signature. I think you have to retain both. The RDD[(Double, Double)] constructor should stay, one way or the other, and add a new RDD[(Double, Double, Double)] constructor, with appropriate Since tags on each.
Below there's a DataFrame constructor and I'm not sure how to handle that. It should also handle the case where there's a weight col, but I'm not sure how to do that cleanly. There can be a second argument like hasWeightCol but that's starting to feel hacky.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srowen hmm, this was already suggested, please see this comment: #17086 (comment)
the build fails with an error due to Java type erasure, so this wouldn't work... you can't have two constructors with the same type erased signature... maybe I am misunderstanding something, and you meant something else? Are you sure this changes the signature in a way that breaks others, it should still allow RDD with a tuple of 2 Double values.
The error I get is:
[error] /home/jenkins/workspace/SparkPullRequestBuilder@2/mllib/src/main/scala/org/apache/spark/mllib/evaluation/MulticlassMetrics.scala:35: double definition:
[error] constructor MulticlassMetrics: (predLabelsWeight: org.apache.spark.rdd.RDD[(Double, Double, Double)])org.apache.spark.mllib.evaluation.MulticlassMetrics at line 33 and
[error] constructor MulticlassMetrics: (predAndLabels: org.apache.spark.rdd.RDD[(Double, Double)])org.apache.spark.mllib.evaluation.MulticlassMetrics at line 35
[error] have same type after erasure: (predLabelsWeight: org.apache.spark.rdd.RDD)org.apache.spark.mllib.evaluation.MulticlassMetrics
[error] def this(predAndLabels: RDD[(Double, Double)]) =
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Darn, OK. Hm, so this doesn't actually cause a source or binary change? OK, that could be fine. I guess MiMa didn't complain. I guess you can now do weird things like pass RDD[String] here and it'll fail quickly. I'm a little uneasy about it but it's probably acceptable. Any other opinions?
I am not sure what to do about the DataFrame issue though. I suspect most people will want to call with a DataFrame now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"I am not sure what to do about the DataFrame issue though", ah, I think I see your concern.
But, isn't this dataframe constructor private anyway, so it can't be used by anyone outside mllib:
private[mllib] def this(predictionAndLabels: DataFrame) =
this(predictionAndLabels.rdd.map(r => (r.getDouble(0), r.getDouble(1))))
I only modified the RDD part because that is what is used by the ML evaluator and it is what users outside spark can access. This is to add weight column for the evaluators.
However, even if we wanted to add weight column support for the private API, I'm unsure about how to add this. Should I just check if there are 3 columns or two, and if there are 3 use the third one as the weight column? I guess I am on the fence about this, I could change it but I don't think it is absolutely necessary, since it's not used anywhere outside spark MLLIB anyway.
Actually, this constructor is a bit weird, it looks like it was added as part of this PR:
https://github.com/apache/spark/pull/6011/files
It is only used here in the python API:
https://github.com/apache/spark/pull/6011/files#diff-443f766289f8090078531c3e1a1d6027R186
But I don't see why we couldn't just get the rdd there and remove the private constructor altogether (?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The python API takes an RDD, creates a DF, and then calls this private constructor with the DF, but I would think we could just pass the RDD directly
mllib/src/main/scala/org/apache/spark/mllib/evaluation/MulticlassMetrics.scala
Outdated
Show resolved
Hide resolved
mllib/src/main/scala/org/apache/spark/mllib/evaluation/MulticlassMetrics.scala
Outdated
Show resolved
Hide resolved
mllib/src/main/scala/org/apache/spark/mllib/evaluation/MulticlassMetrics.scala
Outdated
Show resolved
Hide resolved
|
Test build #98529 has finished for PR 17086 at commit
|
|
Test build #98533 has finished for PR 17086 at commit
|
|
@srowen would you be able to take another look at this PR? Also tagging @WeichenXu123 . Thank you! |
mllib/src/main/scala/org/apache/spark/mllib/evaluation/MulticlassMetrics.scala
Outdated
Show resolved
Hide resolved
mllib/src/main/scala/org/apache/spark/mllib/evaluation/MulticlassMetrics.scala
Outdated
Show resolved
Hide resolved
|
Test build #98575 has finished for PR 17086 at commit
|
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a merge conflict now, but that's looking good to me. I'd still like another reviewer, but am personally pretty comfortable with the change.
88b4bad to
d54cc55
Compare
|
Test build #98618 has finished for PR 17086 at commit
|
|
Test build #98619 has finished for PR 17086 at commit
|
|
@srowen thanks, I've fixed the merge conflict and updated to latest |
|
Merged to master |
|
thank you @srowen ! I will try to update the other two PRs as soon as possible. Really exciting to see this get in. |
…ed weight column for multiclass classification evaluator ## What changes were proposed in this pull request? The evaluators BinaryClassificationEvaluator, RegressionEvaluator, and MulticlassClassificationEvaluator and the corresponding metrics classes BinaryClassificationMetrics, RegressionMetrics and MulticlassMetrics should use sample weight data. I've closed the PR: apache#16557 as recommended in favor of creating three pull requests, one for each of the evaluators (binary/regression/multiclass) to make it easier to review/update. Note: I've updated the JIRA to: https://issues.apache.org/jira/browse/SPARK-24101 Which is a child of JIRA: https://issues.apache.org/jira/browse/SPARK-18693 ## How was this patch tested? I added tests to the metrics class. Closes apache#17086 from imatiach-msft/ilmat/multiclass-evaluate. Authored-by: Ilya Matiach <[email protected]> Signed-off-by: Sean Owen <[email protected]>
What changes were proposed in this pull request?
The evaluators BinaryClassificationEvaluator, RegressionEvaluator, and MulticlassClassificationEvaluator and the corresponding metrics classes BinaryClassificationMetrics, RegressionMetrics and MulticlassMetrics should use sample weight data.
I've closed the PR: #16557
as recommended in favor of creating three pull requests, one for each of the evaluators (binary/regression/multiclass) to make it easier to review/update.
Note: I've updated the JIRA to:
https://issues.apache.org/jira/browse/SPARK-24101
Which is a child of JIRA:
https://issues.apache.org/jira/browse/SPARK-18693
How was this patch tested?
I added tests to the metrics class.