-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
🐛 Bug
When passing resume_from_checkpoint to Trainer, and then training (e.g. call to trainer.fit()), the state used for trainer.test() is always the checkpoint initially given to resume_from_checkpoint, and never the newer, better one.
trainer = Trainer(resume_from_checkpoint="path_to_ckpt") # pass ckpt to Trainer for resuming
trainer.fit() # do some fine-tuning/resume training
trainer.test() # should make use of "best" checkpoint, however uses ckpt passed to resume_from_checkpoint
Please reproduce using the BoringModel and post here
https://colab.research.google.com/drive/1ABXnUP10QUqHeUQmFy-FX26cV2w1JILA?usp=sharing
Expected behavior
After fine-tuning, the best model state is looked up internally as introduced by #2190 before running on the test dataset.
Environment
- CUDA:
- GPU:
- Tesla T4
- available: True
- version: 10.1
- GPU:
- Packages:
- numpy: 1.18.5
- pyTorch_debug: True
- pyTorch_version: 1.7.0+cu101
- pytorch-lightning: 1.1.0
- tqdm: 4.41.1
- System:
- OS: Linux
- architecture:
- 64bit
- processor: x86_64
- python: 3.6.9
- version: 1 SMP Thu Jul 23 08:00:38 PDT 2020
Additional context
A hotfix is to manually set trainer.resume_from_checkpoint = None between calls to trainer.fit() and trainer.test().
trainer = Trainer(resume_from_checkpoint="path_to_ckpt") # pass ckpt to Trainer for resuming
trainer.fit()
trainer.resume_from_checkpoint = None
trainer.test()
The cause behind the issue is that Trainer.test() is performed internally by calling to Trainer.fit() for all configurations.
Long term, the checkpoint passed by resume_from_checkpoint should most likely be consumed internally (i.e. reset to None) after the state is restored. Alternatively, one could make use of the Trainer.testing attribute to limit the utilization of Trainer.resume_from_checkpoint by CheckpointConnector to the training state only.