CLN: Refactor f_scores and f_test #502

Squadrick · 2019-09-11T14:28:35Z

Add threshold param to f-scores
Tests now compare with sklearn
Add sklearn to requirements

Fixes #490

Squadrick · 2019-09-11T14:29:46Z

SSaishruthi · 2019-09-11T15:22:27Z

Hi @Squadrick
Thanks.
I think I have provided the same threshold solution. Any reason for closing my current work and creating a new one?

Squadrick · 2019-09-11T18:13:34Z

It's extended to include a case when threshold=None that it defaults to using the max. The reason for closing your PR was more to do with the fact that this includes a pretty major refactor as well as the threshold functionality.

SSaishruthi · 2019-09-11T18:41:22Z

Sounds good.

Reason I asked is I had the same threshold functionality (except None), test cases and was waiting for review to make sure if we are ok with the fix before adding further tests/changes. Was confused when I saw this one without any comments on existing PR.

Thanks for the modifications.

Squadrick · 2019-09-11T19:03:59Z

Hey, sorry about that, my bad. Should've checked before opening this PR.

seanpmorgan

Looks very good thanks @Squadrick. Haven't been able to do a full review yet -- but was wondering if @SSaishruthi you wouldn't mind taking a pass at a review seeing as this is familiar to you?

build_deps/requirements.txt

seanpmorgan · 2019-09-12T01:51:24Z

Also ping @PhilipMay for review

SSaishruthi · 2019-09-12T14:19:41Z

@seanpmorgan Sure will do

PhilipMay · 2019-09-12T19:11:14Z

In case of "single-label categorial classification" where one sample belongs to exactly one class of many possible classes this looks good in my downstream task.

Will test binary classification next. Somehow I have the feeling that this might not work correctly.

PhilipMay · 2019-09-12T20:04:16Z

Binary classification is working (but a little bit ugly):

import tensorflow as tf
import numpy as np
import f_scores
from sklearn.metrics import f1_score

actuals = np.array([[0], [1], [1], [1]])
preds = np.array([[0.2], [0.3], [0.7], [0.9]])

f1 = f_scores.F1Score(num_classes=1, 
                   average='micro',  # the value here does not matter in binary case
                   threshold=0.5)
f1.update_state(actuals, preds)

f1_result = f1.result().numpy() 

print('F1 from metric:', f1_result)

ytrue = actuals
ypred = np.rint(preds)

f1_result = f1_score(ytrue, ypred, average='binary', pos_label=1)

print("F1 from sklearn:", f1_result)

Has this output (which is good):

F1 from metric: 0.8
F1 from sklearn: 0.8

What is ugly is that num_classes has to be set to 1 which is wrong and feels hacky. Maybe num_classes should be changed to label_length or something like this.

Also I see a problem with average in this binary case. No value makes sense. micro, macro, weighted and None are all unuseful for binary classification.

PhilipMay · 2019-09-12T20:50:50Z

tensorflow_addons/metrics/f_scores.py

-    def __init__(self, num_classes, average, name='f1_score',
+    def __init__(self,
+                 num_classes,
+                 average,


average has no default value here. So it is inconsistent with FBetaScore which has average=None as default.

SSaishruthi · 2019-09-12T20:53:49Z

@Squadrick
I am thinking of removing num_classes parameter and adding code inside to infer that.
What do you think about that?

May be use tf.shape?
This will be a lot cleaner I guess

PhilipMay · 2019-09-12T21:05:48Z

A small test for ``F1Score` would be good. That is still missing.

IMO a test for binary case should be added. Maybe just use this here: #502 (comment)

tensorflow_addons/metrics/f_scores.py

Squadrick · 2019-09-13T10:53:22Z

@PhilipMay I test for F-1 score implicitly by testing for F-Beta with beta=1.0, but a test just in case for API breakage would be useful. Binary classification with num_classes=1 is still misleading, and at the very least, renaming num_classes to label_length would make it slightly easier. Like @SSaishruthi mentioned, tf.shape can be used to infer num_classes/label_length from y_true and y_pred, but initialization of the weights (for tracking false positives, etc.) will happen in update_states instead.

The current implementation of keeping track of the weights, and doing the final calculation in result can't be extended to include sample_weights. The alternative would be to use a stateless f_beta_score and wrapping it in MeanMetricWrapper, but this would be slightly slower but adding support for sample_weights would be a lot easier.

Squadrick · 2019-09-13T12:01:32Z

I've added a very simple F1Test to check that it the same as FBetaScore(beta=1.0), and since we test FBeta pretty extensively, imo, it should be fine.

SSaishruthi · 2019-09-13T18:56:06Z

@Squadrick I will start with sample weight addition after this PR gets merged

PhilipMay · 2019-09-13T18:58:06Z

I've added a very simple F1Test to check that it the same as FBetaScore(beta=1.0), and since we test FBeta pretty extensively, imo, it should be fine.

Yes. Thanks. That's what I mean. A small "smoke test".

PhilipMay · 2019-09-13T19:05:50Z

LGTM - the bug seems to be fixed now.

A different thought: On tf.keras.metrics there are already many implemented basic metrics. Confusion matrix, precision, recall and so on. Wouldn't it be a good idear to build on them?

SSaishruthi · 2019-09-13T19:12:29Z

Both works in the same way. We change calculation according to the type. This seems to the better way after investigation.

Squadrick · 2019-09-13T19:29:43Z

Both works in the same way. We change calculation according to the type. This seems to the better way after investigation.

I don't quite understand. What are the alternatives that work the same way?

Squadrick · 2019-09-13T19:31:46Z

@Squadrick I will start with sample weight addition after this PR gets merged

Like I mentioned above, I think a better approach would be to use MeanMetricWrapper and let it handle the sample_weight. I currently don't see an easy way of adding sample weight functionality to F-scores.

facaiy · 2019-09-15T00:15:47Z

Hi, thank everyone:-) I haven't did a full review yet

For implimentation:

@PhilipMay A different thought: On tf.keras.metrics there are already many implemented basic metrics. Confusion matrix, precision, recall and so on.
@Squadrick a better approach would be to use MeanMetricWrapper and let it handle the sample_weight.

+1, I'm wondering if we can refer to AUC metric.

Tests now compare with sklearn
Add sklearn to requirements

I prefer to keep dependency minimal, and F1Score is not so complex that we cannot calculate it easily. Maybe we can learn test cases in #466

In case of "single-label categorial classification"
actuals = np.array([[0], [1], [1], [1]])

If I'm not wrong, tf.keras use one-hot encoding for labels y_true and logits for y_pred by default. We can clarify the requirement in the document (like AUC metric). If we really care about it, please refer to subclass solution for accuracy: Accuracy, BinaryAccuracy, CategoricalAccuracy, SparseCategoricalAccuracy, etc.

What do you think, Philip, Saishruthi, Dheeraj? Thank all for your contribution

tensorflow_addons/metrics/f_test.py

PhilipMay · 2019-09-28T12:50:31Z

tensorflow_addons/metrics/f_scores.py

+            y_pred = y_pred > self.threshold
+
        y_true = tf.cast(y_true, tf.int32)
        y_pred = tf.cast(y_pred, tf.int32)


This line is redundant to line number 124 where the same is executed.

Squadrick · 2019-09-30T13:46:50Z

I'd prefer using sklearn as the ground truth rather than hard-coding the values for two reasons:

Adopting sklearn across tests for tfa.metrics will guarantee that any param style we borrow from sklearn will behave as expected. Example: Type of average in F-scores.
Specifically, in this case, rewriting to include hard-coded values makes the code much longer with a lot of duplication, since we need to test different type of averages.

I'm open to hard-coding the results if you think that's the better approach. What do you all think? @WindQAQ

PhilipMay · 2019-10-08T07:44:15Z

@Squadrick for me both ways are good. Both have pro and cons.

facaiy · 2019-10-08T09:36:40Z

I'm afraid that sklearn is too heavy, what do you think @seanpmorgan @WindQAQ ?

PhilipMay · 2019-10-08T11:19:48Z

@facaiy "too heavy" sounds very abstract for me. Can you explain what you think is the concrete disadvantage? Download needs too much time, installing docker image for testing needs too much time? What is it that makes you think "too heavy"?

facaiy · 2019-10-08T13:18:42Z

@PhilipMay Hi, Philip, it's easy to add a new dependency while difficult (sometimes impossible) to remove one, that's why I suggest to act conservatively. Moreover, sklearn is a quite complicated python wheel which has many dependencies on its own.

PhilipMay · 2019-10-08T16:32:21Z

Would it be a solution to split into test and install dependencies? See here: https://stackoverflow.com/questions/15422527/best-practices-how-do-you-list-required-dependencies-in-your-setup-py

facaiy · 2019-10-18T06:16:23Z

Sorry for the delay, Philip. I'm referring to test dependencies when I use 'dependency' above. Anyway, I'm not against the sklearn proposal if you insist :-) What do you think, Dheeraj @Squadrick ?

WindQAQ · 2019-10-18T06:50:26Z

I'm afraid that sklearn is too heavy, what do you think @seanpmorgan @WindQAQ ?

Agree +1. As a plugin/addons package, it would be great if we could make the wheel lightweight. So in this case, if we could do unittests even without sklearn, I would think this dependency is not a must.

PhilipMay · 2019-10-18T10:46:52Z

Ok. So let’s do this without Sklearn. For me finishing this PR has priority anyway.

seanpmorgan · 2019-10-18T13:55:54Z

Ok. So let’s do this without Sklearn. For me finishing this PR has priority anyway.

Agree we've had a similar discussion before... both options have pros and cons, though we have precedent throughout the repo of using pre-calucated values.

* Add `threshold` param to f-scores * Tests now compare with sklearn * Add sklearn to requirements

* Register FBetaScore and F1Score as Keras custom objects * Update readme to separate both metrics

Resort to using hard coded test cases rather than comparing with sklearn

…tric-fixes

Squadrick · 2019-11-01T12:07:57Z

Sorry about the delay, hardcoded the tests.

@PhilipMay @SSaishruthi @seanpmorgan

seanpmorgan

LGTM thanks for the refactor!

PhilipMay · 2019-11-06T20:00:26Z

@Squadrick thanks for finalizing this. :-)

Squadrick requested a review from a team as a code owner September 11, 2019 14:28

googlebot added the cla: yes label Sep 11, 2019

Squadrick added the metrics label Sep 11, 2019

Squadrick mentioned this pull request Sep 11, 2019

Adding threshold parameter to f1 #499

Closed

seanpmorgan reviewed Sep 12, 2019

View reviewed changes

build_deps/requirements.txt Outdated Show resolved Hide resolved

PhilipMay suggested changes Sep 12, 2019

View reviewed changes

PhilipMay reviewed Sep 13, 2019

View reviewed changes

tensorflow_addons/metrics/f_scores.py Show resolved Hide resolved

Squadrick added the wip Work in-progress label Sep 13, 2019

facaiy reviewed Sep 15, 2019

View reviewed changes

tensorflow_addons/metrics/f_test.py Show resolved Hide resolved

facaiy reviewed Sep 15, 2019

View reviewed changes

tensorflow_addons/metrics/f_test.py Show resolved Hide resolved

PhilipMay reviewed Sep 28, 2019

View reviewed changes

Squadrick dismissed PhilipMay’s stale review via 3f9b8f4 September 30, 2019 13:49

Squadrick force-pushed the metric-fixes branch from 3f9b8f4 to 5f8a135 Compare September 30, 2019 14:35

Squadrick added 7 commits October 26, 2019 12:43

CLN: Refactor f_scores and f_test

afa8c50

* Add `threshold` param to f-scores * Tests now compare with sklearn * Add sklearn to requirements

Format files

00d7867

Add F1 score test

cb2b60c

* Register FBetaScore and F1Score as Keras custom objects * Update readme to separate both metrics

Add test for F1-score get_config

23250f8

FIX: Use sk_score for true value

89a97eb

Remove sklearn from f_test

52f7b34

Resort to using hard coded test cases rather than comparing with sklearn

Merge branch 'master' of https://github.com/tensorflow/addons into me…

fb1883a

…tric-fixes

Squadrick force-pushed the metric-fixes branch from 5f8a135 to fb1883a Compare November 1, 2019 12:05

Squadrick added 2 commits November 1, 2019 17:49

Remove unused import

85223c2

Rename test_keras_model -> _get_model

0844912

seanpmorgan approved these changes Nov 5, 2019

View reviewed changes

seanpmorgan merged commit c3aba08 into tensorflow:master Nov 5, 2019

Squadrick deleted the metric-fixes branch November 5, 2019 03:53

seanpmorgan mentioned this pull request Nov 5, 2019

[F1Score] average should be None by default #602

Closed

Squadrick mentioned this pull request Jan 2, 2020

Problem with using Tensorflow addons' metrics correctly in functional API #818

Closed

CLN: Refactor f_scores and f_test #502

CLN: Refactor f_scores and f_test #502

Uh oh!

Conversation

Squadrick commented Sep 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Squadrick commented Sep 11, 2019

Uh oh!

SSaishruthi commented Sep 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Squadrick commented Sep 11, 2019

Uh oh!

SSaishruthi commented Sep 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Squadrick commented Sep 11, 2019

Uh oh!

seanpmorgan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

seanpmorgan commented Sep 12, 2019

Uh oh!

SSaishruthi commented Sep 12, 2019

Uh oh!

PhilipMay commented Sep 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PhilipMay commented Sep 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PhilipMay Sep 12, 2019

Choose a reason for hiding this comment

Uh oh!

PhilipMay Sep 13, 2019

Choose a reason for hiding this comment

Uh oh!

SSaishruthi commented Sep 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PhilipMay commented Sep 12, 2019

Uh oh!

Uh oh!

Squadrick commented Sep 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Squadrick commented Sep 13, 2019

Uh oh!

SSaishruthi commented Sep 13, 2019

Uh oh!

PhilipMay commented Sep 13, 2019

Uh oh!

PhilipMay commented Sep 13, 2019

Uh oh!

SSaishruthi commented Sep 13, 2019

Uh oh!

Squadrick commented Sep 13, 2019

Uh oh!

Squadrick commented Sep 13, 2019

Uh oh!

facaiy commented Sep 15, 2019

Uh oh!

Uh oh!

Uh oh!

PhilipMay Sep 28, 2019

Choose a reason for hiding this comment

Uh oh!

Squadrick commented Sep 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PhilipMay commented Oct 8, 2019

Uh oh!

facaiy commented Oct 8, 2019

Uh oh!

PhilipMay commented Oct 8, 2019

Uh oh!

facaiy commented Oct 8, 2019

Uh oh!

Squadrick commented Sep 11, 2019 •

edited

Loading

SSaishruthi commented Sep 11, 2019 •

edited

Loading

SSaishruthi commented Sep 11, 2019 •

edited

Loading

PhilipMay commented Sep 12, 2019 •

edited

Loading

PhilipMay commented Sep 12, 2019 •

edited

Loading

SSaishruthi commented Sep 12, 2019 •

edited

Loading

Squadrick commented Sep 13, 2019 •

edited

Loading

Squadrick commented Sep 30, 2019 •

edited

Loading