[SPARK-24101][ML][MLLIB] ML Evaluators should use weight column - added weight column for multiclass classification evaluator #17086

imatiach-msft · 2017-02-27T18:31:25Z

What changes were proposed in this pull request?

The evaluators BinaryClassificationEvaluator, RegressionEvaluator, and MulticlassClassificationEvaluator and the corresponding metrics classes BinaryClassificationMetrics, RegressionMetrics and MulticlassMetrics should use sample weight data.

I've closed the PR: #16557
as recommended in favor of creating three pull requests, one for each of the evaluators (binary/regression/multiclass) to make it easier to review/update.

Note: I've updated the JIRA to:
https://issues.apache.org/jira/browse/SPARK-24101
Which is a child of JIRA:
https://issues.apache.org/jira/browse/SPARK-18693

How was this patch tested?

I added tests to the metrics class.

SparkQA · 2017-02-27T19:23:27Z

Test build #73529 has finished for PR 17086 at commit cf6a5ab.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class MulticlassMetrics @Since(\"1.1.0\") (predAndLabelsWithOptWeight: RDD[_])

imatiach-msft · 2017-02-27T22:40:18Z

@sethah @Lewuathe @thunterdb @WeichenXu123 @jkbradley @actuaryzhang @srowen would you be able to take a look? I've split the larger pull request into three parts as suggested.

imatiach-msft · 2017-03-16T04:52:26Z

ping @sethah @Lewuathe @thunterdb @WeichenXu123 @jkbradley @actuaryzhang @srowen could you please take a look? thank you!

SparkQA · 2017-06-21T06:52:43Z

Test build #78372 has started for PR 17086 at commit cf6a5ab.

imatiach-msft · 2018-04-16T16:55:19Z

jenkins, retest this please

SparkQA · 2018-04-16T18:03:12Z

Test build #89407 has finished for PR 17086 at commit 4a0debf.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class MulticlassMetrics @Since(\"1.1.0\") (predAndLabelsWithOptWeight: RDD[_])

WeichenXu123

Thanks! I made an initial rough review.

WeichenXu123 · 2018-04-17T09:04:58Z

mllib/src/main/scala/org/apache/spark/ml/evaluation/MulticlassClassificationEvaluator.scala

@Since("2.4.0")

WeichenXu123 · 2018-04-17T09:10:03Z

mllib/src/main/scala/org/apache/spark/ml/evaluation/MulticlassClassificationEvaluator.scala

This line seems useless ? dataset.select(pred, label)...values.countByValues()

good catch -- hmm that shouldn't be there, not sure why I added it, removed

WeichenXu123 · 2018-04-17T09:12:42Z

mllib/src/main/scala/org/apache/spark/mllib/evaluation/MulticlassMetrics.scala

I prefer to define two constructor as:

this(predAndLabelsWithOptWeight: RDD[(Double, Double, Double)] this(predAndLabels: RDD[(Double, Double)])

so it will do more strict type checking.

good idea, this also simplifies the calculation of the confusions, fpByClass, tpByClass and labelCountByClass

hmm the build fails here though with an error indicating that the methods are the same after type erasure, perhaps I should revert this code back:

[error] /home/jenkins/workspace/SparkPullRequestBuilder@2/mllib/src/main/scala/org/apache/spark/mllib/evaluation/MulticlassMetrics.scala:35: double definition:
[error] constructor MulticlassMetrics: (predLabelsWeight: org.apache.spark.rdd.RDD[(Double, Double, Double)])org.apache.spark.mllib.evaluation.MulticlassMetrics at line 33 and
[error] constructor MulticlassMetrics: (predAndLabels: org.apache.spark.rdd.RDD[(Double, Double)])org.apache.spark.mllib.evaluation.MulticlassMetrics at line 35
[error] have same type after erasure: (predLabelsWeight: org.apache.spark.rdd.RDD)org.apache.spark.mllib.evaluation.MulticlassMetrics
[error] def this(predAndLabels: RDD[(Double, Double)]) =

You can add a member var into the class like val predAndLabelsWithOptWeight: RDD[(Double, Double, Double), and ctor assign this member var. so the following calculation code will be easier.

good idea, done!

SparkQA · 2018-04-18T04:02:19Z

Test build #89483 has finished for PR 17086 at commit cf941af.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class MulticlassMetrics @Since(\"2.4.0\") (predLabelsWeight: RDD[(Double, Double, Double)])

SparkQA · 2018-04-19T05:54:07Z

Test build #89551 has finished for PR 17086 at commit 4a5be4a.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class MulticlassMetrics @Since(\"2.4.0\") (predAndLabelsWithOptWeight: RDD[_])

imatiach-msft · 2018-04-20T19:15:58Z

@WeichenXu123 I've updated the PR, resolved all comments and the build passes - would you be able to take another look when you have time? Thank you!

WeichenXu123 · 2018-04-24T08:22:23Z

mllib/src/main/scala/org/apache/spark/mllib/evaluation/MulticlassMetrics.scala

The .mapValues(weight => weight) is redundant, it generate the same RDD.

good catch! removed

WeichenXu123 · 2018-04-24T08:27:53Z

mllib/src/test/scala/org/apache/spark/mllib/evaluation/MulticlassMetricsSuite.scala

use 1E-7 ?

WeichenXu123 · 2018-04-24T08:29:14Z

mllib/src/test/scala/org/apache/spark/mllib/evaluation/MulticlassMetricsSuite.scala

Use operator A ~== B absTol delta like other tests.

WeichenXu123 · 2018-04-24T08:31:46Z

mllib/src/test/scala/org/apache/spark/mllib/evaluation/MulticlassMetricsSuite.scala

There're many repeated expressions here such as (2 * w1 + 1 * w2 + 1 * w1) / tw, could you store them in variables first ?

sure, I was trying to follow the format of the other existing test, made the change in both test cases

WeichenXu123 · 2018-04-24T08:33:39Z

mllib/src/test/scala/org/apache/spark/mllib/evaluation/MulticlassMetricsSuite.scala

use assert(metrics.labels === labels) like other tests.

WeichenXu123 · 2018-04-24T08:35:25Z

mllib/src/test/scala/org/apache/spark/mllib/evaluation/MulticlassMetricsSuite.scala

don't toArray, use assert(metrics.confusionMatrix ~== confusionMatrix relTol e)

it looks like I needed to change this to an ML matrix instead of MLLIB matrix in order to make this ~== work, so I used .asML

Oh, that's because you use Matrices in mllib, change it to Matrices in ml, i.e., import org.apache.spark.ml.linalg.Matrices

however, I still need to do asML on the metrics.confusionMatrix as that property is from mllib (in MulticlassMetrics class)

SparkQA · 2018-04-26T15:36:29Z

Test build #89890 has finished for PR 17086 at commit 47c45cb.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-04-26T15:51:54Z

Test build #89891 has finished for PR 17086 at commit 089d64b.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-04-26T17:40:19Z

Test build #89894 has finished for PR 17086 at commit 6906dc4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

WeichenXu123 · 2018-04-27T03:35:10Z

mllib/src/test/scala/org/apache/spark/mllib/evaluation/MulticlassMetricsSuite.scala

I think maybe relTol will be better than absTol, except the cases that one side is zero. What do you think of it ?

good idea, I've replaced absTol with relTol, done

WeichenXu123 · 2018-04-27T03:36:15Z

overall good, @jkbradley Would you mind take a look ?

imatiach-msft · 2018-05-01T15:26:42Z

jenkins, retest this please

imatiach-msft · 2018-05-01T19:04:18Z

jenkins, retest this please

SparkQA · 2018-05-01T20:14:44Z

Test build #90000 has finished for PR 17086 at commit 7fc14c0.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

imatiach-msft · 2018-05-01T20:25:30Z

looks like a random failure in spark R, unrelated

imatiach-msft · 2018-05-01T20:25:35Z

jenkins, retest this please

SparkQA · 2018-05-01T21:32:19Z

Test build #90002 has finished for PR 17086 at commit 7fc14c0.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

srowen

I am also pretty OK with this one; straightforward

mllib/src/test/scala/org/apache/spark/mllib/evaluation/MulticlassMetricsSuite.scala

mllib/src/main/scala/org/apache/spark/ml/evaluation/MulticlassClassificationEvaluator.scala

mllib/src/main/scala/org/apache/spark/mllib/evaluation/MulticlassMetrics.scala

SparkQA · 2018-11-06T05:01:07Z

Test build #98503 has finished for PR 17086 at commit 0ca102a.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class MulticlassMetrics @Since(\"3.0.0\") (predAndLabelsWithOptWeight: RDD[_])

srowen · 2018-11-06T13:37:25Z

mllib/src/main/scala/org/apache/spark/mllib/evaluation/MulticlassMetrics.scala

Oh, wait a sec, this changed the signature. I think you have to retain both. The RDD[(Double, Double)] constructor should stay, one way or the other, and add a new RDD[(Double, Double, Double)] constructor, with appropriate Since tags on each.

Below there's a DataFrame constructor and I'm not sure how to handle that. It should also handle the case where there's a weight col, but I'm not sure how to do that cleanly. There can be a second argument like hasWeightCol but that's starting to feel hacky.

@srowen hmm, this was already suggested, please see this comment: #17086 (comment)
the build fails with an error due to Java type erasure, so this wouldn't work... you can't have two constructors with the same type erased signature... maybe I am misunderstanding something, and you meant something else? Are you sure this changes the signature in a way that breaks others, it should still allow RDD with a tuple of 2 Double values.
The error I get is:
[error] /home/jenkins/workspace/SparkPullRequestBuilder@2/mllib/src/main/scala/org/apache/spark/mllib/evaluation/MulticlassMetrics.scala:35: double definition:
[error] constructor MulticlassMetrics: (predLabelsWeight: org.apache.spark.rdd.RDD[(Double, Double, Double)])org.apache.spark.mllib.evaluation.MulticlassMetrics at line 33 and
[error] constructor MulticlassMetrics: (predAndLabels: org.apache.spark.rdd.RDD[(Double, Double)])org.apache.spark.mllib.evaluation.MulticlassMetrics at line 35
[error] have same type after erasure: (predLabelsWeight: org.apache.spark.rdd.RDD)org.apache.spark.mllib.evaluation.MulticlassMetrics
[error] def this(predAndLabels: RDD[(Double, Double)]) =

Darn, OK. Hm, so this doesn't actually cause a source or binary change? OK, that could be fine. I guess MiMa didn't complain. I guess you can now do weird things like pass RDD[String] here and it'll fail quickly. I'm a little uneasy about it but it's probably acceptable. Any other opinions?

I am not sure what to do about the DataFrame issue though. I suspect most people will want to call with a DataFrame now.

"I am not sure what to do about the DataFrame issue though", ah, I think I see your concern.
But, isn't this dataframe constructor private anyway, so it can't be used by anyone outside mllib:

private[mllib] def this(predictionAndLabels: DataFrame) =
this(predictionAndLabels.rdd.map(r => (r.getDouble(0), r.getDouble(1))))

I only modified the RDD part because that is what is used by the ML evaluator and it is what users outside spark can access. This is to add weight column for the evaluators.

However, even if we wanted to add weight column support for the private API, I'm unsure about how to add this. Should I just check if there are 3 columns or two, and if there are 3 use the third one as the weight column? I guess I am on the fence about this, I could change it but I don't think it is absolutely necessary, since it's not used anywhere outside spark MLLIB anyway.

Actually, this constructor is a bit weird, it looks like it was added as part of this PR:
https://github.com/apache/spark/pull/6011/files
It is only used here in the python API:
https://github.com/apache/spark/pull/6011/files#diff-443f766289f8090078531c3e1a1d6027R186
But I don't see why we couldn't just get the rdd there and remove the private constructor altogether (?)

The python API takes an RDD, creates a DF, and then calls this private constructor with the DF, but I would think we could just pass the RDD directly

mllib/src/main/scala/org/apache/spark/mllib/evaluation/MulticlassMetrics.scala

SparkQA · 2018-11-06T20:43:52Z

Test build #98529 has finished for PR 17086 at commit 3f0aac6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-11-06T22:26:52Z

Test build #98533 has finished for PR 17086 at commit 49a879e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

imatiach-msft · 2018-11-07T23:40:01Z

@srowen would you be able to take another look at this PR? Also tagging @WeichenXu123 . Thank you!

mllib/src/main/scala/org/apache/spark/mllib/evaluation/MulticlassMetrics.scala

SparkQA · 2018-11-08T05:27:21Z

Test build #98575 has finished for PR 17086 at commit 88b4bad.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class MulticlassMetrics @Since(\"1.1.0\") (predAndLabelsWithOptWeight: RDD[_ <: Product])

srowen

There's a merge conflict now, but that's looking good to me. I'd still like another reviewer, but am personally pretty comfortable with the change.

…xed constructor

SparkQA · 2018-11-08T19:09:27Z

Test build #98618 has finished for PR 17086 at commit d54cc55.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class MulticlassMetrics @Since(\"1.1.0\") (predAndLabelsWithOptWeight: RDD[_ <: Product])

SparkQA · 2018-11-08T20:15:04Z

Test build #98619 has finished for PR 17086 at commit 5086449.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

imatiach-msft · 2018-11-08T21:11:35Z

@srowen thanks, I've fixed the merge conflict and updated to latest

srowen · 2018-11-09T21:40:37Z

Merged to master

imatiach-msft · 2018-11-09T21:57:02Z

thank you @srowen ! I will try to update the other two PRs as soon as possible. Really exciting to see this get in.

…ed weight column for multiclass classification evaluator ## What changes were proposed in this pull request? The evaluators BinaryClassificationEvaluator, RegressionEvaluator, and MulticlassClassificationEvaluator and the corresponding metrics classes BinaryClassificationMetrics, RegressionMetrics and MulticlassMetrics should use sample weight data. I've closed the PR: apache#16557 as recommended in favor of creating three pull requests, one for each of the evaluators (binary/regression/multiclass) to make it easier to review/update. Note: I've updated the JIRA to: https://issues.apache.org/jira/browse/SPARK-24101 Which is a child of JIRA: https://issues.apache.org/jira/browse/SPARK-18693 ## How was this patch tested? I added tests to the metrics class. Closes apache#17086 from imatiach-msft/ilmat/multiclass-evaluate. Authored-by: Ilya Matiach <[email protected]> Signed-off-by: Sean Owen <[email protected]>

imatiach-msft mentioned this pull request Feb 27, 2017

[SPARK-18693][ML][MLLIB] ML Evaluators should use weight column #16557

Closed

imatiach-msft force-pushed the ilmat/multiclass-evaluate branch from cf6a5ab to 4a0debf Compare April 16, 2018 16:54

WeichenXu123 reviewed Apr 17, 2018

View reviewed changes

WeichenXu123 reviewed Apr 24, 2018

View reviewed changes

imatiach-msft force-pushed the ilmat/multiclass-evaluate branch from 112bba9 to 47c45cb Compare April 26, 2018 15:26

imatiach-msft force-pushed the ilmat/multiclass-evaluate branch from 47c45cb to 089d64b Compare April 26, 2018 15:41

imatiach-msft changed the title ~~[SPARK-18693][ML][MLLIB] ML Evaluators should use weight column - added weight column for multiclass classification evaluator~~ [SPARK-24101][ML][MLLIB] ML Evaluators should use weight column - added weight column for multiclass classification evaluator Apr 26, 2018

imatiach-msft force-pushed the ilmat/multiclass-evaluate branch from 089d64b to 6906dc4 Compare April 26, 2018 16:32

WeichenXu123 reviewed Apr 27, 2018

View reviewed changes

imatiach-msft force-pushed the ilmat/multiclass-evaluate branch from 6906dc4 to f209bb4 Compare May 1, 2018 04:05

srowen approved these changes Nov 5, 2018

View reviewed changes

imatiach-msft force-pushed the ilmat/multiclass-evaluate branch from a416fa0 to 0ca102a Compare November 6, 2018 03:52

srowen reviewed Nov 6, 2018

View reviewed changes

srowen reviewed Nov 8, 2018

View reviewed changes

mllib/src/main/scala/org/apache/spark/mllib/evaluation/MulticlassMetrics.scala Outdated Show resolved Hide resolved

mllib/src/main/scala/org/apache/spark/mllib/evaluation/MulticlassMetrics.scala Outdated Show resolved Hide resolved

srowen reviewed Nov 8, 2018

View reviewed changes

imatiach-msft added 9 commits November 8, 2018 13:54

Added weight column for multiclass classification evaluator

b5c9765

Updating based on comments, fixed since tag, removed useless code, fi…

ef9440f

…xed constructor

updated based on comment, fixed build failure

f181eb0

updated based on comments

fcf333e

updated based on latest comments

a95e33f

updated based on comments

32734a0

updated based on new comments

aff0f51

undid filter

07382f0

reverted version back, added constraint and validation

d54cc55

imatiach-msft force-pushed the ilmat/multiclass-evaluate branch from 88b4bad to d54cc55 Compare November 8, 2018 18:58

merge with latest

5086449

asfgit closed this in 8e5f3c6 Nov 9, 2018

imatiach-msft mentioned this pull request Dec 11, 2018

[SPARK-24102][ML][MLLIB] ML Evaluators should use weight column - added weight column for regression evaluator #17085

Closed

imatiach-msft mentioned this pull request Feb 22, 2019

[SPARK-24103][ML][MLLIB] ML Evaluators should use weight column - added weight column for binary classification evaluator #17084

Closed

[SPARK-24101][ML][MLLIB] ML Evaluators should use weight column - added weight column for multiclass classification evaluator #17086

[SPARK-24101][ML][MLLIB] ML Evaluators should use weight column - added weight column for multiclass classification evaluator #17086

Uh oh!

Conversation

imatiach-msft commented Feb 27, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Feb 27, 2017

Uh oh!

imatiach-msft commented Feb 27, 2017

Uh oh!

imatiach-msft commented Mar 16, 2017

Uh oh!

SparkQA commented Jun 21, 2017

Uh oh!

imatiach-msft commented Apr 16, 2018

Uh oh!

SparkQA commented Apr 16, 2018

Uh oh!

WeichenXu123 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 18, 2018

Uh oh!

SparkQA commented Apr 19, 2018

Uh oh!

imatiach-msft commented Apr 20, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

imatiach-msft commented Feb 27, 2017 •

edited

Loading