Skip to content

When returning value in "training_step" function, one must specify "loss", otherwise results in error #7750

@Dehde

Description

@Dehde

🐛 Bug

I want to let the training step also return the probabilities that were predicted in the step in order to calculate the f1-score for the entire epoch. In order to do so I let the training step return the predictions and the ground truth values. I follow the instructions here: https://pytorch-lightning.readthedocs.io/en/latest/common/lightning_module.html#train-epoch-level-operations.

It mentions the "loss" in the code sample, yet nowhere did I read that you have to specify the loss in there. But you do have to specify that, otherwise an exception will be trown:

> Traceback (most recent call last):
  File "/Users/rob/PycharmProjects/bauteilerkennung/DeepLearningPart/bte/models/image.py", line 163, in <module>
    model.fit(train_ds, valid_ds)
  File "/Users/rob/PycharmProjects/bauteilerkennung/DeepLearningPart/bte/models/image.py", line 74, in fit
    self.trainer.fit(self.model, train_loader, val_loader)
  File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 458, in fit
    self._run(model)
  File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 756, in _run
    self.dispatch()
  File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 797, in dispatch
    self.accelerator.start_training(self)
  File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 96, in start_training
    self.training_type_plugin.start_training(trainer)
  File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 144, in start_training
    self._results = trainer.run_stage()
  File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 807, in run_stage
    return self.run_train()
  File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 869, in run_train
    self.train_loop.run_training_epoch()
  File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 490, in run_training_epoch
    batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
  File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 731, in run_training_batch
    self.optimizer_step(optimizer, opt_idx, batch_idx, train_step_and_backward_closure)
  File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 432, in optimizer_step
    using_lbfgs=is_lbfgs,
  File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/core/lightning.py", line 1403, in optimizer_step
    optimizer.step(closure=optimizer_closure)
  File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/core/optimizer.py", line 214, in step
    self.__optimizer_step(*args, closure=closure, profiler_name=profiler_name, **kwargs)
  File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/core/optimizer.py", line 134, in __optimizer_step
    trainer.accelerator.optimizer_step(optimizer, self._optimizer_idx, lambda_closure=closure, **kwargs)
  File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 329, in optimizer_step
    self.run_optimizer_step(optimizer, opt_idx, lambda_closure, **kwargs)
  File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 336, in run_optimizer_step
    self.training_type_plugin.optimizer_step(optimizer, lambda_closure=lambda_closure, **kwargs)
  File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 193, in optimizer_step
    optimizer.step(closure=lambda_closure, **kwargs)
  File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
    return func(*args, **kwargs)
  File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/torch/optim/adam.py", line 66, in step
    loss = closure()
  File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 726, in train_step_and_backward_closure
    split_batch, batch_idx, opt_idx, optimizer, self.trainer.hiddens
  File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 814, in training_step_and_backward
    result = self.training_step(split_batch, batch_idx, opt_idx, hiddens)
  File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 301, in training_step
    closure_loss = training_step_output.minimize / self.trainer.accumulate_grad_batches
TypeError: unsupported operand type(s) for /: 'NoneType' and 'int'

The code that produces this bug:

    class TransferLearner(pl.LightningModule):
      ...
      def training_step(self, train_batch, batch_idx):
          x, y = train_batch
          logits = self.forward(x)
          loss = self.cross_entropy_loss(logits, y)
          self.log('train_loss', loss)
          preds = F.softmax(logits, dim=1)
          return {"preds": preds, "gt": y}
  
      def training_epoch_end(self, outputs):
          preds = torch.cat([output["preds"] for output in outputs])
          gt = torch.cat([output["gt"] for output in outputs])
          f1_score = torchmetrics.functional.f1(preds, gt, num_classes=self.num_classes)
          self.log("train/f1_score", f1_score)

The only change I needed to make is to add the loss in the dictionary that returns the items:

    def training_step(self, train_batch, batch_idx):
        x, y = train_batch
        logits = self.forward(x)
        loss = self.cross_entropy_loss(logits, y)
        self.log('train_loss', loss)
        preds = F.softmax(logits, dim=1)
        return {"loss": loss, "preds": preds, "gt": y}

    def training_epoch_end(self, outputs):
        preds = torch.cat([output["preds"] for output in outputs])
        gt = torch.cat([output["gt"] for output in outputs])
        f1_score = torchmetrics.functional.f1(preds, gt, num_classes=self.num_classes)
        self.log("train/f1_score", f1_score)

But I only stumbled upon this very randomly. Neither the docs nor the error message made it very clear to me what the problem was.

The versions that I am using:
pytorch_lightning.__version__
'1.3.1'
torch.__version__
'1.7.1'

I hope this information suffices, otherwise please let me know what other information I should provide.
Thanks for this great repository by the way!!

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingdocsDocumentation relatedhelp wantedOpen to be worked onrefactor

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions