Skip to content

Hang when using Lightning CLI from config file and DDP #11158

@gau-nernst

Description

@gau-nernst

🐛 Bug

When I use Lightning CLI to instantiate training from a YAML config file with DDP, I got the error below and the process will hang.

RuntimeError: SaveConfigCallback expected xxxx/lightning_logs/version_5/config.yaml to NOT exist. Aborting to avoid overwriting results of a previous run. You can delete the previous config file, set `LightningCLI(save_config_callback=None)` to disable config saving, or set `LightningCLI(save_config_overwrite=True)` to overwrite the config file.

Ctrl + C will not work. I have to kill the process manually.

I suspect this is because each DDP process tries to write the same config file to disk. Passing either save_config_callback=None or save_config_overwrite=True like suggested in the error message solves the issue.

To Reproduce

# bug.py
import torch
from torch.utils.data import Dataset, DataLoader

from pytorch_lightning import LightningModule
from pytorch_lightning.utilities.cli import LightningCLI

class RandomDataset(Dataset):
    def __init__(self, size, num_samples):
        self.len = num_samples
        self.data = torch.randn(num_samples, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len

class BoringModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        return self.layer(x)

    def train_dataloader(self):
        return DataLoader(RandomDataset(32, 64), batch_size=2)

    def val_dataloader(self):
        return DataLoader(RandomDataset(32, 64), batch_size=2)

    def training_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("train_loss", loss)
        return {"loss": loss}

    def validation_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("valid_loss", loss)

    def test_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("test_loss", loss)

    def configure_optimizers(self):
        return torch.optim.SGD(self.layer.parameters(), lr=0.1)

if __name__ == "__main__":
    LightningCLI(BoringModel)
# bug.yaml
trainer:
  gpus: 2
  strategy: ddp
  max_epochs: 10

Run Lightning CLI

python bug.py fit --config bug.yaml

Expected behavior

There should be no error by default, without explicitly passing save_config_callback=None or save_config_overwrite=True to Lightning CLI.

Environment

  • CUDA:
    - GPU:
    - GeForce RTX 3090
    - GeForce RTX 3090
    - GeForce RTX 3090
    - GeForce RTX 3090
    - available: True
    - version: 11.3
  • Packages:
    - numpy: 1.21.2
    - pyTorch_debug: False
    - pyTorch_version: 1.10.0
    - pytorch-lightning: 1.5.5
    - tqdm: 4.62.3
  • System:
    - OS: Linux
    - architecture:
    - 64bit
    - ELF
    - processor: x86_64
    - python: 3.8.12
    - version: Dataset only available when the trainer is instantiated #86~18.04.1-Ubuntu SMP Fri Jun 18 01:23:22 UTC 2021

Additional context

cc @carmocca @mauvilsa

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinglightningclipl.cli.LightningCLI

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions