Skip to content

ModelCheckpoint does NOT save anything if every_n_train_steps is greater than the number of training steps in a epoch #11979

@ShaneTian

Description

@ShaneTian

🐛 Bug

ModelCheckpoint does NOT save anything if every_n_train_steps is greater than the number of training steps in an epoch.
The pl does NOT call validation loop if val_check_interval is greater than the number of training steps in an epoch.

To Reproduce

In my experiment, the number of training steps in an epoch is about 110.

  • If I set up the every_n_train_steps and val_check_interval to 100, the ModelCheckpoint and validation loop work well.
log_steps = 100
valid_steps = 100
save_steps = 100

ckpt_callback = pl.callbacks.ModelCheckpoint(
    filename="T5-{epoch}-{step}-{val_loss:.2f}-{val_ppl:.2f}",
    monitor="val_loss",
    save_top_k=-1,
    every_n_train_steps=100
)

trainer = pl.Trainer(
    gpus=-1,
    accelerator="gpu",
    strategy="ddp",
    logger=logger,
    callbacks=[ckpt_callback],
    max_epochs=200,
    log_every_n_steps=log_steps,
    val_check_interval=valid_steps
)
  • If I set up the every_n_train_steps and val_check_interval to 120, the ModelCheckpoint and validation loop fail. ModelCheckpoint does not save anything, the pl does NOT call validation loop.
log_steps = 120
valid_steps = 120
save_steps = 120

ckpt_callback = pl.callbacks.ModelCheckpoint(
    filename="T5-{epoch}-{step}-{val_loss:.2f}-{val_ppl:.2f}",
    monitor="val_loss",
    save_top_k=-1,
    every_n_train_steps=100
)

trainer = pl.Trainer(
    gpus=-1,
    accelerator="gpu",
    strategy="ddp",
    logger=logger,
    callbacks=[ckpt_callback],
    max_epochs=200,
    log_every_n_steps=log_steps,
    val_check_interval=valid_steps
)

Expected behavior

Is this expected? Can't I set these two parameters more than one epoch?

Environment

  • PyTorch Lightning Version (e.g., 1.5.0): 1.5.10
  • PyTorch Version (e.g., 1.10): 1.10.2+cu113
  • Python version (e.g., 3.9): 3.7.10
  • OS (e.g., Linux): Linux
  • CUDA/cuDNN version: CUDA V11.0.221
  • GPU models and configuration:
  • How you installed PyTorch (conda, pip, source): pip

cc @carmocca @awaelchli @ninginthecloud @jjenniferdai @rohitgr7

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions