Skip to content

Resuming training throws the mid-epoch warning everytime #11029

@rohitgr7

Description

@rohitgr7

Proposed refactor

Getting this:

UserWarning: You're resuming from a checkpoint that ended mid-epoch. Training will start from the beginning of the next epoch. This can cause unreliable results if further training is done, consider using an end of epoch checkpoint.

even when checkpoints saved at epoch end are being used to resume the training.

The reason is we set total train batches to inf here:
https://github.com/PyTorchLightning/pytorch-lightning/blob/5576fbc5f9a7d0bc71ad26b8b54775110e675808/pytorch_lightning/trainer/trainer.py#L647

and reload dataloaders within fit_loop here:
https://github.com/PyTorchLightning/pytorch-lightning/blob/5576fbc5f9a7d0bc71ad26b8b54775110e675808/pytorch_lightning/loops/fit_loop.py#L190-L193

so, num_training_batches is always inf as this point always.
https://github.com/PyTorchLightning/pytorch-lightning/blob/5576fbc5f9a7d0bc71ad26b8b54775110e675808/pytorch_lightning/trainer/connectors/checkpoint_connector.py#L246-L253

Pitch

Either remove the warning, since doesn't seem to resolve with the current logic
, or start adding a flag in all the checkpoints being saved indicating whether it was saved mid-epoch or not.
or any better solutions??

Else it will lead to false-positive warnings for users.

Additional context


If you enjoy Lightning, check out our other projects! ⚡

  • Metrics: Machine learning metrics for distributed, scalable PyTorch applications.

  • Lite: enables pure PyTorch users to scale their existing code on any kind of device while retaining full control over their own loops and optimization logic.

  • Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, fine-tuning, and solving problems with deep learning.

  • Bolts: Pretrained SOTA Deep Learning models, callbacks, and more for research and production with PyTorch Lightning and PyTorch.

  • Lightning Transformers: Flexible interface for high-performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.

cc @justusschock @awaelchli @akihironitta @ananthsub @ninginthecloud

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions