Skip to content

ModelCheckpoint not working when monitor is logged in training_epoch_end #4797

@rohitgr7

Description

@rohitgr7

🐛 Bug

checkpoint_callback = ModelCheckpoint(monitor='something logged in training_epoch_end')

if validation_step is overridden:
    if something is logged in training_step:
        checkpoint_callback throws an error
    else:
        works fine
else:
    works fine
ERROR

---------------------------------------------------------------------------

MisconfigurationException                 Traceback (most recent call last)

<ipython-input-12-1f9f6fbe4f6c> in <module>()
----> 1 test_x(tmpdir)

12 frames

/usr/local/lib/python3.6/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py in _validate_monitor_key(self, trainer)
    482                 f"HINT: Did you call self.log('{self.monitor}', tensor) in the LightningModule?"
    483             )
--> 484             raise MisconfigurationException(m)
    485 
    486     def _get_metric_interpolated_filepath_name(self, ckpt_name_metrics: Dict[str, Any], epoch: int, step: int):

MisconfigurationException: ModelCheckpoint(monitor='epoch_end_train_loss') not found in the returned metrics: ['train_loss']. HINT: Did you call self.log('epoch_end_train_loss', tensor) in the LightningModule?

To Reproduce

https://colab.research.google.com/drive/1-koLiMLUfl5GwzMroflkOS-p_-tuhFab?usp=sharing

Expected behavior

checkpoint_callback should monitor metric logged in training_epoch_end.

Environment

  • CUDA:
    • GPU:
      • Tesla T4
    • available: True
    • version: 10.1
  • Packages:
    • numpy: 1.18.5
    • pyTorch_debug: True
    • pyTorch_version: 1.7.0+cu101
    • pytorch-lightning: 1.1.0-dev (master) or 1.0.7
    • tqdm: 4.41.1
  • System:
    • OS: Linux
    • architecture:
      • 64bit
    • processor: x86_64
    • python: 3.6.9
    • version: Proposal for help #1 SMP Thu Jul 23 08:00:38 PDT 2020

Additional context

Metadata

Metadata

Assignees

Labels

bugSomething isn't workinghelp wantedOpen to be worked onloggingRelated to the `LoggerConnector` and `log()`priority: 1Medium priority task

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions