-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Closed
Labels
Milestone
Description
🐛 Bug
ModelCheckpoint does NOT save anything if every_n_train_steps is greater than the number of training steps in an epoch.
The pl does NOT call validation loop if val_check_interval is greater than the number of training steps in an epoch.
To Reproduce
In my experiment, the number of training steps in an epoch is about 110.
- If I set up the
every_n_train_stepsandval_check_intervalto100, theModelCheckpointand validation loop work well.
log_steps = 100
valid_steps = 100
save_steps = 100
ckpt_callback = pl.callbacks.ModelCheckpoint(
filename="T5-{epoch}-{step}-{val_loss:.2f}-{val_ppl:.2f}",
monitor="val_loss",
save_top_k=-1,
every_n_train_steps=100
)
trainer = pl.Trainer(
gpus=-1,
accelerator="gpu",
strategy="ddp",
logger=logger,
callbacks=[ckpt_callback],
max_epochs=200,
log_every_n_steps=log_steps,
val_check_interval=valid_steps
)- If I set up the
every_n_train_stepsandval_check_intervalto120, theModelCheckpointand validation loop fail.ModelCheckpointdoes not save anything, thepldoes NOT call validation loop.
log_steps = 120
valid_steps = 120
save_steps = 120
ckpt_callback = pl.callbacks.ModelCheckpoint(
filename="T5-{epoch}-{step}-{val_loss:.2f}-{val_ppl:.2f}",
monitor="val_loss",
save_top_k=-1,
every_n_train_steps=100
)
trainer = pl.Trainer(
gpus=-1,
accelerator="gpu",
strategy="ddp",
logger=logger,
callbacks=[ckpt_callback],
max_epochs=200,
log_every_n_steps=log_steps,
val_check_interval=valid_steps
)Expected behavior
Is this expected? Can't I set these two parameters more than one epoch?
Environment
- PyTorch Lightning Version (e.g., 1.5.0): 1.5.10
- PyTorch Version (e.g., 1.10): 1.10.2+cu113
- Python version (e.g., 3.9): 3.7.10
- OS (e.g., Linux): Linux
- CUDA/cuDNN version: CUDA V11.0.221
- GPU models and configuration:
- How you installed PyTorch (
conda,pip, source):pip
cc @carmocca @awaelchli @ninginthecloud @jjenniferdai @rohitgr7
cosine0 and jianglongye