Skip to content

Validation loss saved in filename by ModelCheckpoint is incorrect when using DDP with multiple GPUs #6138

@dpieczynski

Description

@dpieczynski

🐛 Bug

When using DDP with 2 GPUs and logging validation loss in validation_step with self.log('val_loss', loss, sync_dist=True) , ModelCheckpoint callback embeds validation loss that is multiplied by 2 (number of GPUs?) in the filename. This happens in Lightning 1.2.0.

This is a message printed by ModelCheckpoint callback:

Epoch 0, global step 0: val_loss reached 2.20627 (best 2.20627), saving model to "some_path/epoch=0-val_loss=4.41254.ckpt" as top 1

To Reproduce

def test_run():
    from pytorch_lightning.callbacks import ModelCheckpoint

    class TestModel(BoringModel):

        def validation_step(self, batch, batch_idx) -> None:
            output = self.layer(batch)
            loss = self.loss(batch, output)
            self.log('val_loss', loss, sync_dist=True)

        def validation_epoch_end(self, outputs) -> None:
            pass

    # fake data
    train_data = torch.utils.data.DataLoader(RandomDataset(32, 64))
    val_data = torch.utils.data.DataLoader(RandomDataset(32, 64))

    # model
    model = TestModel()
    trainer = Trainer(
        default_root_dir=os.getcwd(),
        limit_train_batches=1,
        limit_val_batches=1,
        max_epochs=1,
        weights_summary=None,
        accelerator='ddp',
        gpus=-1,
        callbacks=[ModelCheckpoint(dirpath=os.getcwd(), filename='{epoch}-{val_loss:.5f}', monitor='val_loss',
                                   verbose=True)]
    )

    trainer.fit(model, train_data, val_data)

Expected behavior

The loss embedded in the filename should be the same as the loss in the message and logger.

Environment

  • PyTorch Version (e.g., 1.0): 1.7.1
  • OS (e.g., Linux): Linux
  • How you installed PyTorch (conda, pip, source): pip
  • Python version: 3.8.6
  • CUDA/cuDNN version: 11.0
  • GPU models and configuration: 2 * GeForce RTX 2080 Ti

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingcheckpointingRelated to checkpointingdistributedGeneric distributed-related topichelp wantedOpen to be worked onpriority: 1Medium priority task

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions