Skip to content

Weird default filename of ModelCheckpoint : epoch number starts from 0 while step number starts from 1 #16636

@zhong-yy

Description

@zhong-yy

Bug description

As described in the doc, the default filename of ModelCheckpoint is {epoch}-{step}. But it seems that step number starts from 1, which is weird.

ModelCheckpoint

checkpoint_callback = ModelCheckpoint(
      save_top_k=-1, 
      filename="{epoch}-{step}",  # default filename
      every_n_train_steps=1,      # save every step
      monitor="train_loss", 
      mode="min"
  )

Output:

epoch=0-step=1.ckpt
epoch=0-step=2.ckpt
epoch=0-step=3.ckpt
epoch=0-step=4.ckpt
epoch=0-step=5.ckpt
...

I found a related discussion posted in 2022-Feb, but in that post, the step number seemed to start from 0. Is there any change to the naming convention since that time? I am not sure whether I miss something because I am still new to pytorch-lightning

Environment

Current environment
#- Lightning Component (e.g. Trainer, LightningModule, ModelCheckpoint):
#- PyTorch Lightning Version: 1.9.0
#- PyTorch Version: 1.13.1
#- Python version: 3.9
#- OS: Ubuntu 22.04
#- How you installed Lightning(`conda`, `pip`, source): pip

cc @awaelchli

Metadata

Metadata

Assignees

No one assigned

    Labels

    checkpointingRelated to checkpointingquestionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions