-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
🐛 Bug
I have just come across what I consider to be a bug whereby the trainer will continue training past max_epochs if fit is called multiple times but not with max_steps. E.g. if max_epochs is specified as 2, each fit call will train another 2 epochs. But with max_steps only the first fit call will do any training.
To Reproduce
Reproduced on Colab using the Boring Model. Simply call the Trainer.fit method multiple times and observe that training happens on subsequent calls when max_epochs is specified but not when max_steps is specified
Expected behavior
I think whatever behaviour is decided as correct should be consistent whether the number of iterations has been specified in terms of epochs or steps. I personally think that multiple fit calls (which actually result in training) should be supported (related: #9636) so I think the behaviour for max_steps should be changed such that it trains another max_steps number of steps every fit call.
Environment
- CUDA:
- GPU:
- available: False
- version: None - Packages:
- numpy: 1.22.0
- pyTorch_debug: False
- pyTorch_version: 1.10.1
- pytorch-lightning: 1.5.8
- tqdm: 4.62.3 - System:
- OS: Darwin
- architecture:
- 64bit
- processor: i386
- python: 3.8.12
- version: Darwin Kernel Version 20.6.0: Mon Aug 30 06:12:20 PDT 2021; root:xnu-7195.141.6~3/RELEASE_ARM64_T8101
Additional context
Also related to #7629 and #11426
cc @tchaton @rohitgr7 @carmocca @justusschock @ananthsub @ninginthecloud