[Discussion] There should be a Metrics package

## 🚀 Feature
Introduce a metrics package that contains easy to use reference implementations of metrics people care about. The package should take care of the following:

1. Logging and printing in the right format. The metric object should know how it wants to be printed, depending on the available surface (terminal vs notebook vs TensorBoard). For example, a `ConfusionMatrix` metric can use Pandas for terminal (where it will print a nicely tabulated representation) and maybe a color-coded map for notebooks and TensorBoard.
2. Handle the actual computation in all cases: cpu, single-gpu, single-box and multi-box DDP.
3. Support plugging into a `MetricsReporter` of some sort that will generate a full report of all the metrics you care about.

This would be a rather large change, and I'm not sure what is the best way to do it. This issue is really meant to spur discussion on the topic. The actual solution might require some stuff to be [landed](https://github.com/pytorch/pytorch/issues/22439) on PyTorch, and that's fine.

### Motivation
Metrics are a big component of reproducibility. They satisfy all the requirements you can think about to justify standardizing them:

1. They are very hard to test. Kinda reminds me of the [barber paradox](https://en.wikipedia.org/wiki/Barber_paradox): who measures the metrics? :) The only way you test them is really by checking corner cases and depending on someone else's implementation (eg sklearn's), hoping you did your piping right.
2. They are unsexy to implement so nobody wants to really spend time dealing with them. They are boilerplate, but necessary boilerplate.
3. They are ambiguous. Even something as trivial as a variance is actually not trivial to compare! Do you divide by `n` or `n-1`? (aka: do you use [Bessel's correction](https://en.wikipedia.org/wiki/Bessel%27s_correction)?). Numpy defaults to not using it, MATLAB and [PyTorch](https://pytorch.org/docs/stable/torch.html#torch.std) default to using it. It's unsurprising to see threads like [this](https://stackoverflow.com/questions/27600207/why-does-numpy-std-give-a-different-result-to-matlab-std).
4. They are quite hard to implement for all cases: how do we compute a multi-node, multi-GPU ROC AUC score?

### Pitch

I think Lightning should take a page from [Ignite's book](https://pytorch.org/ignite/metrics.html) and add a metrics package. 

I also propose that Lightning take care of the following:

1. Pretty printing. Lightning is the interface to Tensorboard and the loggers, so it should bridge the gap. For example, torch should provide something similar to the [sklearn.metrics](https://scikit-learn.org/stable/modules/classes.html#sklearn-metrics-metrics) package: given a tensor x and a tensor y, compute f1 score or whatever. Lightning should wrap that to handle attaching the result on Tensorboard, printing a metrics report table, etc.
2. Attaching to events. PyTorch has no notion of events, so Lightning should handle that. This is essentially akin to writing the `validation_step()` and `validation_end()` steps for each metric, so that they can be computed efficiently per-batch.
3. Handle DDP (not sure about this one). On the one hand, it's breaking the abstraction that PyTorch computes and Lightning wraps, but on the other hand I don't see a way out: how you compute the metrics will necessarily change once you introduce DDP into the mix. For example, an accuracy can just keep a running average and how many samples it has seen so far, but something like ROC AUC will necessarily require that all processes store their predictions.

In this proposal, the API for the LightningModule will be simplified significantly. For example, something like this:

```python
class CoolSystem(pl.LightningModule):

    def __init__(self):
        super(CoolSystem, self).__init__()
        # not the best model...
        self.l1 = torch.nn.Linear(28 * 28, 10)

    def forward(self, x):
        return torch.relu(self.l1(x.view(x.size(0), -1)))

    def training_step(self, batch, batch_idx, training_metrics):
        # REQUIRED
        x, y = batch
        y_hat = self.forward(x)
        loss = F.cross_entropy(y_hat, y)
        return {'loss': loss,}

    def configure_metrics(self):
        # OPTIONAL
        training_metrics = MetricsLog([Accuracy(), Loss()])
        eval_metrics = MetricsLog([F1Score('micro'), F1Score('macro'), F1Score(None), Accuracy(), PrecisionAtRecall(80)])
        return {'train_metrics': training_metrics, 'eval_metrics': eval_metrics}

    def configure_optimizers(self):
        # REQUIRED
        # can return multiple optimizers and learning_rate schedulers
        # (LBFGS it is automatically supported, no need for closure function)
        return torch.optim.Adam(self.parameters(), lr=0.02)

    @pl.data_loader
    def train_dataloader(self):
        # REQUIRED
        return DataLoader(MNIST(os.getcwd(), train=True, download=True, transform=transforms.ToTensor()), batch_size=32)

    @pl.data_loader
    def val_dataloader(self):
        # OPTIONAL
        return DataLoader(MNIST(os.getcwd(), train=True, download=True, transform=transforms.ToTensor()), batch_size=32)

```


### Alternative implementation
If we find a way to have all metrics computation code done in PyTorch, even for DDP, it would be highly preferable I think. I just don't know if it's possible - maybe if we formulate metrics as a layer of sorts we might be able to do that? All standard layers have a state that persists & gets updated across batches (its weights :D ) so maybe we can implement metrics as a sort-of `nn.Module`?


### Additional context

There is a separate discussion about providing the underlying muscle for this directly in torch (see https://github.com/pytorch/pytorch/issues/22439). 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Discussion] There should be a Metrics package #973

🚀 Feature

Motivation

Pitch

Alternative implementation

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Discussion] There should be a Metrics package #973

Description

🚀 Feature

Motivation

Pitch

Alternative implementation

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions