Skip to content

Conversation

Squadrick
Copy link
Member

@Squadrick Squadrick commented Sep 11, 2019

  • Add threshold param to f-scores
  • Tests now compare with sklearn
  • Add sklearn to requirements

Fixes #490

@Squadrick Squadrick requested a review from a team as a code owner September 11, 2019 14:28
@Squadrick
Copy link
Member Author

cc: @SSaishruthi

@SSaishruthi
Copy link
Contributor

SSaishruthi commented Sep 11, 2019

Hi @Squadrick
Thanks.
I think I have provided the same threshold solution. Any reason for closing my current work and creating a new one?

@Squadrick
Copy link
Member Author

It's extended to include a case when threshold=None that it defaults to using the max. The reason for closing your PR was more to do with the fact that this includes a pretty major refactor as well as the threshold functionality.

@SSaishruthi
Copy link
Contributor

SSaishruthi commented Sep 11, 2019

Sounds good.

Reason I asked is I had the same threshold functionality (except None), test cases and was waiting for review to make sure if we are ok with the fix before adding further tests/changes. Was confused when I saw this one without any comments on existing PR.

Thanks for the modifications.

@Squadrick
Copy link
Member Author

Hey, sorry about that, my bad. Should've checked before opening this PR.

Copy link
Member

@seanpmorgan seanpmorgan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very good thanks @Squadrick. Haven't been able to do a full review yet -- but was wondering if @SSaishruthi you wouldn't mind taking a pass at a review seeing as this is familiar to you?

@seanpmorgan
Copy link
Member

Also ping @PhilipMay for review

@SSaishruthi
Copy link
Contributor

@seanpmorgan Sure will do

@PhilipMay
Copy link
Contributor

PhilipMay commented Sep 12, 2019

In case of "single-label categorial classification" where one sample belongs to exactly one class of many possible classes this looks good in my downstream task.

Will test binary classification next. Somehow I have the feeling that this might not work correctly.

@PhilipMay
Copy link
Contributor

PhilipMay commented Sep 12, 2019

Binary classification is working (but a little bit ugly):

import tensorflow as tf
import numpy as np
import f_scores
from sklearn.metrics import f1_score

actuals = np.array([[0], [1], [1], [1]])
preds = np.array([[0.2], [0.3], [0.7], [0.9]])

f1 = f_scores.F1Score(num_classes=1, 
                   average='micro',  # the value here does not matter in binary case
                   threshold=0.5)
f1.update_state(actuals, preds)

f1_result = f1.result().numpy() 

print('F1 from metric:', f1_result)

ytrue = actuals
ypred = np.rint(preds)

f1_result = f1_score(ytrue, ypred, average='binary', pos_label=1)

print("F1 from sklearn:", f1_result)  

Has this output (which is good):

F1 from metric: 0.8
F1 from sklearn: 0.8

What is ugly is that num_classes has to be set to 1 which is wrong and feels hacky. Maybe num_classes should be changed to label_length or something like this.

Also I see a problem with average in this binary case. No value makes sense. micro, macro, weighted and None are all unuseful for binary classification.

def __init__(self, num_classes, average, name='f1_score',
def __init__(self,
num_classes,
average,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

average has no default value here. So it is inconsistent with FBetaScore which has average=None as default.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok now.

@SSaishruthi
Copy link
Contributor

SSaishruthi commented Sep 12, 2019

@Squadrick
I am thinking of removing num_classes parameter and adding code inside to infer that.
What do you think about that?

May be use tf.shape?
This will be a lot cleaner I guess

@PhilipMay
Copy link
Contributor

A small test for ``F1Score` would be good. That is still missing.

IMO a test for binary case should be added. Maybe just use this here: #502 (comment)

@Squadrick
Copy link
Member Author

Squadrick commented Sep 13, 2019

@PhilipMay I test for F-1 score implicitly by testing for F-Beta with beta=1.0, but a test just in case for API breakage would be useful. Binary classification with num_classes=1 is still misleading, and at the very least, renaming num_classes to label_length would make it slightly easier. Like @SSaishruthi mentioned, tf.shape can be used to infer num_classes/label_length from y_true and y_pred, but initialization of the weights (for tracking false positives, etc.) will happen in update_states instead.

The current implementation of keeping track of the weights, and doing the final calculation in result can't be extended to include sample_weights. The alternative would be to use a stateless f_beta_score and wrapping it in MeanMetricWrapper, but this would be slightly slower but adding support for sample_weights would be a lot easier.

@Squadrick
Copy link
Member Author

I've added a very simple F1Test to check that it the same as FBetaScore(beta=1.0), and since we test FBeta pretty extensively, imo, it should be fine.

@SSaishruthi
Copy link
Contributor

@Squadrick I will start with sample weight addition after this PR gets merged

@PhilipMay
Copy link
Contributor

I've added a very simple F1Test to check that it the same as FBetaScore(beta=1.0), and since we test FBeta pretty extensively, imo, it should be fine.

Yes. Thanks. That's what I mean. A small "smoke test".

@PhilipMay
Copy link
Contributor

LGTM - the bug seems to be fixed now.

A different thought: On tf.keras.metrics there are already many implemented basic metrics. Confusion matrix, precision, recall and so on. Wouldn't it be a good idear to build on them?

@SSaishruthi
Copy link
Contributor

Both works in the same way. We change calculation according to the type. This seems to the better way after investigation.

@Squadrick Squadrick added the wip Work in-progress label Sep 13, 2019
@Squadrick
Copy link
Member Author

Both works in the same way. We change calculation according to the type. This seems to the better way after investigation.

I don't quite understand. What are the alternatives that work the same way?

@Squadrick
Copy link
Member Author

@Squadrick I will start with sample weight addition after this PR gets merged

Like I mentioned above, I think a better approach would be to use MeanMetricWrapper and let it handle the sample_weight. I currently don't see an easy way of adding sample weight functionality to F-scores.

@facaiy
Copy link
Member

facaiy commented Sep 15, 2019

Hi, thank everyone:-) I haven't did a full review yet

  1. For implimentation:

@PhilipMay A different thought: On tf.keras.metrics there are already many implemented basic metrics. Confusion matrix, precision, recall and so on.
@Squadrick a better approach would be to use MeanMetricWrapper and let it handle the sample_weight.

+1, I'm wondering if we can refer to AUC metric.

Tests now compare with sklearn
Add sklearn to requirements

I prefer to keep dependency minimal, and F1Score is not so complex that we cannot calculate it easily. Maybe we can learn test cases in #466

In case of "single-label categorial classification"
actuals = np.array([[0], [1], [1], [1]])

If I'm not wrong, tf.keras use one-hot encoding for labels y_true and logits for y_pred by default. We can clarify the requirement in the document (like AUC metric). If we really care about it, please refer to subclass solution for accuracy: Accuracy, BinaryAccuracy, CategoricalAccuracy, SparseCategoricalAccuracy, etc.

What do you think, Philip, Saishruthi, Dheeraj? Thank all for your contribution

y_pred = y_pred > self.threshold

y_true = tf.cast(y_true, tf.int32)
y_pred = tf.cast(y_pred, tf.int32)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is redundant to line number 124 where the same is executed.

@Squadrick
Copy link
Member Author

Squadrick commented Sep 30, 2019

I'd prefer using sklearn as the ground truth rather than hard-coding the values for two reasons:

  1. Adopting sklearn across tests for tfa.metrics will guarantee that any param style we borrow from sklearn will behave as expected. Example: Type of average in F-scores.
  2. Specifically, in this case, rewriting to include hard-coded values makes the code much longer with a lot of duplication, since we need to test different type of averages.

I'm open to hard-coding the results if you think that's the better approach. What do you all think? @WindQAQ

@PhilipMay
Copy link
Contributor

@Squadrick for me both ways are good. Both have pro and cons.

@facaiy
Copy link
Member

facaiy commented Oct 8, 2019

I'm afraid that sklearn is too heavy, what do you think @seanpmorgan @WindQAQ ?

@PhilipMay
Copy link
Contributor

@facaiy "too heavy" sounds very abstract for me. Can you explain what you think is the concrete disadvantage? Download needs too much time, installing docker image for testing needs too much time? What is it that makes you think "too heavy"?

@facaiy
Copy link
Member

facaiy commented Oct 8, 2019

@PhilipMay Hi, Philip, it's easy to add a new dependency while difficult (sometimes impossible) to remove one, that's why I suggest to act conservatively. Moreover, sklearn is a quite complicated python wheel which has many dependencies on its own.

@PhilipMay
Copy link
Contributor

Would it be a solution to split into test and install dependencies? See here: https://stackoverflow.com/questions/15422527/best-practices-how-do-you-list-required-dependencies-in-your-setup-py

@facaiy
Copy link
Member

facaiy commented Oct 18, 2019

Sorry for the delay, Philip. I'm referring to test dependencies when I use 'dependency' above. Anyway, I'm not against the sklearn proposal if you insist :-) What do you think, Dheeraj @Squadrick ?

@WindQAQ
Copy link
Member

WindQAQ commented Oct 18, 2019

I'm afraid that sklearn is too heavy, what do you think @seanpmorgan @WindQAQ ?

Agree +1. As a plugin/addons package, it would be great if we could make the wheel lightweight. So in this case, if we could do unittests even without sklearn, I would think this dependency is not a must.

@PhilipMay
Copy link
Contributor

Ok. So let’s do this without Sklearn. For me finishing this PR has priority anyway.

@seanpmorgan
Copy link
Member

Ok. So let’s do this without Sklearn. For me finishing this PR has priority anyway.

Agree we've had a similar discussion before... both options have pros and cons, though we have precedent throughout the repo of using pre-calucated values.

* Add `threshold` param to f-scores
* Tests now compare with sklearn
* Add sklearn to requirements
* Register FBetaScore and F1Score as Keras custom objects
* Update readme to separate both metrics
Resort to using hard coded test cases
rather than comparing with sklearn
@Squadrick
Copy link
Member Author

Sorry about the delay, hardcoded the tests.

@PhilipMay @SSaishruthi @seanpmorgan

Copy link
Member

@seanpmorgan seanpmorgan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks for the refactor!

@seanpmorgan seanpmorgan merged commit c3aba08 into tensorflow:master Nov 5, 2019
@Squadrick Squadrick deleted the metric-fixes branch November 5, 2019 03:53
@PhilipMay
Copy link
Contributor

@Squadrick thanks for finalizing this. :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug in F1 and FBeta implementation and test

7 participants