Skip to content

trainer.scale_batch_size() throws exception due to LRScheduler  #1889

@HansBambel

Description

@HansBambel

🐛 Bug

I tried finding the biggest possible batch_size for my training, but PL raises a MisconfigurationException saying that my LRScheduler (ReduceLROnPlateau) is conditioned on a metric that is only available after validation_epoch_end. The available metrics are: loss, val_loss.

I assume the LRScheduler requires a metric from the training loop for this to work? Why is this neccessary?

To Reproduce

Steps to reproduce the behavior:

  1. Have a model with a metric that only exists in validation_epoch_end
  2. Have a LRScheduler which monitors that metric
  3. Use trainer.scale_batch_size
  4. See error
File "C:\ProgramData\Anaconda3\envs\ml\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 779, in update_learning_rates
    raise MisconfigurationException(
pytorch_lightning.utilities.exceptions.MisconfigurationException: ReduceLROnPlateau conditioned on metric meanIoU which is not available. Available metrics are: loss,train_loss. Condition can be set using `monitor` key in lr scheduler dict

Code sample

trainer = pl.Trainer(gpus=hparams.gpus)
new_batch_size = trainer.scale_batch_size(net, mode='binsearch', init_val=8)

and in my model:

    def configure_optimizers(self):
        opt = optim.Adam(self.parameters(), lr=self.hparams.learning_rate)
        scheduler = {
         'scheduler': optim.lr_scheduler.ReduceLROnPlateau(opt, mode="max", factor=0.5, patience=5),
         'monitor': 'meanIoU',  # Default: val_loss
        }
        return [opt], [scheduler]

    def validation_epoch_end(self, outputs):
        avg_loss = torch.stack([x["val_loss"] for x in outputs]).mean()
        iou_class, mean_iou = self.iou_metric.value()
        mean_iou = torch.tensor(mean_iou)
        self.iou_metric.reset()
        logs = {"val_loss": avg_loss, "meanIoU": mean_iou}
        return {"meanIoU": mean_iou, "log": logs,
                "progress_bar": {"val_loss": avg_loss, "meanIoU": mean_iou}}

Expected behavior

No Exception and the maximum batch_size for my model.

Environment

  • CUDA:
    • GPU:
      • GeForce RTX 2070 SUPER
    • available: True
    • version: 10.1
  • Packages:
    • numpy: 1.18.1
    • pyTorch_debug: False
    • pyTorch_version: 1.4.0
    • pytorch-lightning: 0.7.6
    • tensorboard: 2.1.0
    • tqdm: 4.45.0
  • System:
    • OS: Windows
    • architecture:
      • 64bit
      • WindowsPE
    • processor: AMD64 Family 23 Model 113 Stepping 0, AuthenticAMD
    • python: 3.8.2
    • version: 10.0.18362

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinghelp wantedOpen to be worked on

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions