-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
🐛 Bug
An user from our supercomputing center run into an issue which I think turned out to be a bug in PyTorch-Lightning.
When using the DDP accelerator together with a logger, multiple processes will try creating the logger directory tree, causing some errors about already existing directories or files.
Troubleshooting
PyTorch-Lightning uses extensively the rank_zero_only function to ensure that some actions are only performed by the process with rank 0:
https://github.com/PyTorchLightning/pytorch-lightning/blob/b3b8f95e2a1ac040f6ff8f848542a1e5a27edfee/pytorch_lightning/utilities/distributed.py#L35-L42
rank_zero_only.rank default value is set there:
https://github.com/PyTorchLightning/pytorch-lightning/blob/b3b8f95e2a1ac040f6ff8f848542a1e5a27edfee/pytorch_lightning/utilities/distributed.py#L45-L46
but can be set in other modules, for example in our case DDP:
https://github.com/PyTorchLightning/pytorch-lightning/blob/b3b8f95e2a1ac040f6ff8f848542a1e5a27edfee/pytorch_lightning/plugins/training_type/ddp.py#L227-L228
Unfortunately it seems that the initialization by the DDP module happens too late, I think because of commit da6dbc8:
self.setup_trainer(model)gets called on line 467 effectively initializing the logger and creating the logger directory tree- DDP initialization and thus
rank_zero_only.rankgetting the correct value only happens at line 477 when callingself.training_type_plugin.pre_training().
To Reproduce
I have attached the code the user provided together the Slurm script: only_rank_zero.tar.gz.
I understand that you would prefer a BoringModel and Collab based reproducer but I am from the HPC world and I am not used to those. Let me know if I can help in any other way. I hope that my own digging into the code will hep.
Environment (probably not relevant in this case)
- PyTorch Version: 1.7.1
- OS: Linux (Red Hat 8.1)
- How you installed PyTorch:
conda, I tried the latest version of PyTorch-Lightning available oncondabut also tested installing the current master branch from source and the behavior is still the same. - Python version: 3.7.10
- CUDA/cuDNN version: 11.0.221/8.0.5
- GPU models and configuration: NVIDIA V100