-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
🐛 Bug
When I use Lightning CLI to instantiate training from a YAML config file with DDP, I got the error below and the process will hang.
RuntimeError: SaveConfigCallback expected xxxx/lightning_logs/version_5/config.yaml to NOT exist. Aborting to avoid overwriting results of a previous run. You can delete the previous config file, set `LightningCLI(save_config_callback=None)` to disable config saving, or set `LightningCLI(save_config_overwrite=True)` to overwrite the config file.
Ctrl + C will not work. I have to kill the process manually.
I suspect this is because each DDP process tries to write the same config file to disk. Passing either save_config_callback=None or save_config_overwrite=True like suggested in the error message solves the issue.
To Reproduce
# bug.py
import torch
from torch.utils.data import Dataset, DataLoader
from pytorch_lightning import LightningModule
from pytorch_lightning.utilities.cli import LightningCLI
class RandomDataset(Dataset):
def __init__(self, size, num_samples):
self.len = num_samples
self.data = torch.randn(num_samples, size)
def __getitem__(self, index):
return self.data[index]
def __len__(self):
return self.len
class BoringModel(LightningModule):
def __init__(self):
super().__init__()
self.layer = torch.nn.Linear(32, 2)
def forward(self, x):
return self.layer(x)
def train_dataloader(self):
return DataLoader(RandomDataset(32, 64), batch_size=2)
def val_dataloader(self):
return DataLoader(RandomDataset(32, 64), batch_size=2)
def training_step(self, batch, batch_idx):
loss = self(batch).sum()
self.log("train_loss", loss)
return {"loss": loss}
def validation_step(self, batch, batch_idx):
loss = self(batch).sum()
self.log("valid_loss", loss)
def test_step(self, batch, batch_idx):
loss = self(batch).sum()
self.log("test_loss", loss)
def configure_optimizers(self):
return torch.optim.SGD(self.layer.parameters(), lr=0.1)
if __name__ == "__main__":
LightningCLI(BoringModel)# bug.yaml
trainer:
gpus: 2
strategy: ddp
max_epochs: 10Run Lightning CLI
python bug.py fit --config bug.yamlExpected behavior
There should be no error by default, without explicitly passing save_config_callback=None or save_config_overwrite=True to Lightning CLI.
Environment
- CUDA:
- GPU:
- GeForce RTX 3090
- GeForce RTX 3090
- GeForce RTX 3090
- GeForce RTX 3090
- available: True
- version: 11.3 - Packages:
- numpy: 1.21.2
- pyTorch_debug: False
- pyTorch_version: 1.10.0
- pytorch-lightning: 1.5.5
- tqdm: 4.62.3 - System:
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.8.12
- version: Dataset only available when the trainer is instantiated #86~18.04.1-Ubuntu SMP Fri Jun 18 01:23:22 UTC 2021