-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Closed
Labels
bugSomething isn't workingSomething isn't workinghelp wantedOpen to be worked onOpen to be worked on
Description
🐛 Bug
I tried finding the biggest possible batch_size for my training, but PL raises a MisconfigurationException saying that my LRScheduler (ReduceLROnPlateau) is conditioned on a metric that is only available after validation_epoch_end. The available metrics are: loss, val_loss.
I assume the LRScheduler requires a metric from the training loop for this to work? Why is this neccessary?
To Reproduce
Steps to reproduce the behavior:
- Have a model with a metric that only exists in
validation_epoch_end - Have a LRScheduler which monitors that metric
- Use
trainer.scale_batch_size - See error
File "C:\ProgramData\Anaconda3\envs\ml\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 779, in update_learning_rates
raise MisconfigurationException(
pytorch_lightning.utilities.exceptions.MisconfigurationException: ReduceLROnPlateau conditioned on metric meanIoU which is not available. Available metrics are: loss,train_loss. Condition can be set using `monitor` key in lr scheduler dict
Code sample
trainer = pl.Trainer(gpus=hparams.gpus)
new_batch_size = trainer.scale_batch_size(net, mode='binsearch', init_val=8)
and in my model:
def configure_optimizers(self):
opt = optim.Adam(self.parameters(), lr=self.hparams.learning_rate)
scheduler = {
'scheduler': optim.lr_scheduler.ReduceLROnPlateau(opt, mode="max", factor=0.5, patience=5),
'monitor': 'meanIoU', # Default: val_loss
}
return [opt], [scheduler]
def validation_epoch_end(self, outputs):
avg_loss = torch.stack([x["val_loss"] for x in outputs]).mean()
iou_class, mean_iou = self.iou_metric.value()
mean_iou = torch.tensor(mean_iou)
self.iou_metric.reset()
logs = {"val_loss": avg_loss, "meanIoU": mean_iou}
return {"meanIoU": mean_iou, "log": logs,
"progress_bar": {"val_loss": avg_loss, "meanIoU": mean_iou}}
Expected behavior
No Exception and the maximum batch_size for my model.
Environment
- CUDA:
- GPU:
- GeForce RTX 2070 SUPER
- available: True
- version: 10.1
- GPU:
- Packages:
- numpy: 1.18.1
- pyTorch_debug: False
- pyTorch_version: 1.4.0
- pytorch-lightning: 0.7.6
- tensorboard: 2.1.0
- tqdm: 4.45.0
- System:
- OS: Windows
- architecture:
- 64bit
- WindowsPE
- processor: AMD64 Family 23 Model 113 Stepping 0, AuthenticAMD
- python: 3.8.2
- version: 10.0.18362
Additional context
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workinghelp wantedOpen to be worked onOpen to be worked on