-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
🐛 Bug
I want to let the training step also return the probabilities that were predicted in the step in order to calculate the f1-score for the entire epoch. In order to do so I let the training step return the predictions and the ground truth values. I follow the instructions here: https://pytorch-lightning.readthedocs.io/en/latest/common/lightning_module.html#train-epoch-level-operations.
It mentions the "loss" in the code sample, yet nowhere did I read that you have to specify the loss in there. But you do have to specify that, otherwise an exception will be trown:
> Traceback (most recent call last):
File "/Users/rob/PycharmProjects/bauteilerkennung/DeepLearningPart/bte/models/image.py", line 163, in <module>
model.fit(train_ds, valid_ds)
File "/Users/rob/PycharmProjects/bauteilerkennung/DeepLearningPart/bte/models/image.py", line 74, in fit
self.trainer.fit(self.model, train_loader, val_loader)
File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 458, in fit
self._run(model)
File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 756, in _run
self.dispatch()
File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 797, in dispatch
self.accelerator.start_training(self)
File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 96, in start_training
self.training_type_plugin.start_training(trainer)
File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 144, in start_training
self._results = trainer.run_stage()
File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 807, in run_stage
return self.run_train()
File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 869, in run_train
self.train_loop.run_training_epoch()
File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 490, in run_training_epoch
batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 731, in run_training_batch
self.optimizer_step(optimizer, opt_idx, batch_idx, train_step_and_backward_closure)
File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 432, in optimizer_step
using_lbfgs=is_lbfgs,
File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/core/lightning.py", line 1403, in optimizer_step
optimizer.step(closure=optimizer_closure)
File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/core/optimizer.py", line 214, in step
self.__optimizer_step(*args, closure=closure, profiler_name=profiler_name, **kwargs)
File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/core/optimizer.py", line 134, in __optimizer_step
trainer.accelerator.optimizer_step(optimizer, self._optimizer_idx, lambda_closure=closure, **kwargs)
File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 329, in optimizer_step
self.run_optimizer_step(optimizer, opt_idx, lambda_closure, **kwargs)
File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 336, in run_optimizer_step
self.training_type_plugin.optimizer_step(optimizer, lambda_closure=lambda_closure, **kwargs)
File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 193, in optimizer_step
optimizer.step(closure=lambda_closure, **kwargs)
File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
return func(*args, **kwargs)
File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/torch/optim/adam.py", line 66, in step
loss = closure()
File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 726, in train_step_and_backward_closure
split_batch, batch_idx, opt_idx, optimizer, self.trainer.hiddens
File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 814, in training_step_and_backward
result = self.training_step(split_batch, batch_idx, opt_idx, hiddens)
File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 301, in training_step
closure_loss = training_step_output.minimize / self.trainer.accumulate_grad_batches
TypeError: unsupported operand type(s) for /: 'NoneType' and 'int'
The code that produces this bug:
class TransferLearner(pl.LightningModule):
...
def training_step(self, train_batch, batch_idx):
x, y = train_batch
logits = self.forward(x)
loss = self.cross_entropy_loss(logits, y)
self.log('train_loss', loss)
preds = F.softmax(logits, dim=1)
return {"preds": preds, "gt": y}
def training_epoch_end(self, outputs):
preds = torch.cat([output["preds"] for output in outputs])
gt = torch.cat([output["gt"] for output in outputs])
f1_score = torchmetrics.functional.f1(preds, gt, num_classes=self.num_classes)
self.log("train/f1_score", f1_score)The only change I needed to make is to add the loss in the dictionary that returns the items:
def training_step(self, train_batch, batch_idx):
x, y = train_batch
logits = self.forward(x)
loss = self.cross_entropy_loss(logits, y)
self.log('train_loss', loss)
preds = F.softmax(logits, dim=1)
return {"loss": loss, "preds": preds, "gt": y}
def training_epoch_end(self, outputs):
preds = torch.cat([output["preds"] for output in outputs])
gt = torch.cat([output["gt"] for output in outputs])
f1_score = torchmetrics.functional.f1(preds, gt, num_classes=self.num_classes)
self.log("train/f1_score", f1_score)But I only stumbled upon this very randomly. Neither the docs nor the error message made it very clear to me what the problem was.
The versions that I am using:
pytorch_lightning.__version__
'1.3.1'
torch.__version__
'1.7.1'I hope this information suffices, otherwise please let me know what other information I should provide.
Thanks for this great repository by the way!!