-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Closed
Labels
bugSomething isn't workingSomething isn't workingcheckpointingRelated to checkpointingRelated to checkpointingdistributedGeneric distributed-related topicGeneric distributed-related topichelp wantedOpen to be worked onOpen to be worked onpriority: 1Medium priority taskMedium priority task
Milestone
Description
🐛 Bug
When using DDP with 2 GPUs and logging validation loss in validation_step with self.log('val_loss', loss, sync_dist=True) , ModelCheckpoint callback embeds validation loss that is multiplied by 2 (number of GPUs?) in the filename. This happens in Lightning 1.2.0.
This is a message printed by ModelCheckpoint callback:
Epoch 0, global step 0: val_loss reached 2.20627 (best 2.20627), saving model to "some_path/epoch=0-val_loss=4.41254.ckpt" as top 1
To Reproduce
def test_run():
from pytorch_lightning.callbacks import ModelCheckpoint
class TestModel(BoringModel):
def validation_step(self, batch, batch_idx) -> None:
output = self.layer(batch)
loss = self.loss(batch, output)
self.log('val_loss', loss, sync_dist=True)
def validation_epoch_end(self, outputs) -> None:
pass
# fake data
train_data = torch.utils.data.DataLoader(RandomDataset(32, 64))
val_data = torch.utils.data.DataLoader(RandomDataset(32, 64))
# model
model = TestModel()
trainer = Trainer(
default_root_dir=os.getcwd(),
limit_train_batches=1,
limit_val_batches=1,
max_epochs=1,
weights_summary=None,
accelerator='ddp',
gpus=-1,
callbacks=[ModelCheckpoint(dirpath=os.getcwd(), filename='{epoch}-{val_loss:.5f}', monitor='val_loss',
verbose=True)]
)
trainer.fit(model, train_data, val_data)
Expected behavior
The loss embedded in the filename should be the same as the loss in the message and logger.
Environment
- PyTorch Version (e.g., 1.0): 1.7.1
- OS (e.g., Linux): Linux
- How you installed PyTorch (
conda,pip, source): pip - Python version: 3.8.6
- CUDA/cuDNN version: 11.0
- GPU models and configuration: 2 * GeForce RTX 2080 Ti
karthi0804
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingcheckpointingRelated to checkpointingRelated to checkpointingdistributedGeneric distributed-related topicGeneric distributed-related topichelp wantedOpen to be worked onOpen to be worked onpriority: 1Medium priority taskMedium priority task