Skip to content

Trainer.test() in combination with resume_from_checkpoint is broken #5091

@ORippler

Description

@ORippler

🐛 Bug

When passing resume_from_checkpoint to Trainer, and then training (e.g. call to trainer.fit()), the state used for trainer.test() is always the checkpoint initially given to resume_from_checkpoint, and never the newer, better one.

trainer = Trainer(resume_from_checkpoint="path_to_ckpt") # pass ckpt to Trainer for resuming
trainer.fit() # do some fine-tuning/resume training
trainer.test() # should make use of "best" checkpoint, however uses ckpt passed to resume_from_checkpoint

Please reproduce using the BoringModel and post here

https://colab.research.google.com/drive/1ABXnUP10QUqHeUQmFy-FX26cV2w1JILA?usp=sharing

Expected behavior

After fine-tuning, the best model state is looked up internally as introduced by #2190 before running on the test dataset.

Environment

  • CUDA:
    • GPU:
      • Tesla T4
    • available: True
    • version: 10.1
  • Packages:
    • numpy: 1.18.5
    • pyTorch_debug: True
    • pyTorch_version: 1.7.0+cu101
    • pytorch-lightning: 1.1.0
    • tqdm: 4.41.1
  • System:
    • OS: Linux
    • architecture:
      • 64bit
    • processor: x86_64
    • python: 3.6.9
    • version: 1 SMP Thu Jul 23 08:00:38 PDT 2020

Additional context

A hotfix is to manually set trainer.resume_from_checkpoint = None between calls to trainer.fit() and trainer.test().

trainer = Trainer(resume_from_checkpoint="path_to_ckpt") # pass ckpt to Trainer for resuming
trainer.fit()
trainer.resume_from_checkpoint = None
trainer.test()

The cause behind the issue is that Trainer.test() is performed internally by calling to Trainer.fit() for all configurations.

Long term, the checkpoint passed by resume_from_checkpoint should most likely be consumed internally (i.e. reset to None) after the state is restored. Alternatively, one could make use of the Trainer.testing attribute to limit the utilization of Trainer.resume_from_checkpoint by CheckpointConnector to the training state only.

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingcheckpointingRelated to checkpointinghelp wantedOpen to be worked onpriority: 0High priority taskwaiting on authorWaiting on user action, correction, or update

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions