After I restore training with
pl_model = LightningModel.load_from_checkpoint(str(ckpt_path))
trainer = Trainer(
resume_from_checkpoint=str(ckpt_path.name),
logger=instantiate(cfg.logger, experiment_key=cfg.experiment_id),
checkpoint_callback=instantiate(cfg.checkpoint.model_checkpoint),
callbacks=[instantiate(callback) for callback in cfg.callbacks],
)
trainer.fit(pl_model)
instantiate from facebook/hydra
The training process doesnt restore the best metric value and starts from scratch so I'm getting on the first restored epoch
INFO:lightning:Epoch 1129: val_mae reached 0.10418 (best 0.10418), saving model to
When there was a better checkpoint before interruption