Skip to content

DDP: Multiple processes try to create the logger directory tree #6364

@RemiLacroix-IDRIS

Description

@RemiLacroix-IDRIS

🐛 Bug

An user from our supercomputing center run into an issue which I think turned out to be a bug in PyTorch-Lightning.

When using the DDP accelerator together with a logger, multiple processes will try creating the logger directory tree, causing some errors about already existing directories or files.

Troubleshooting

PyTorch-Lightning uses extensively the rank_zero_only function to ensure that some actions are only performed by the process with rank 0:
https://github.com/PyTorchLightning/pytorch-lightning/blob/b3b8f95e2a1ac040f6ff8f848542a1e5a27edfee/pytorch_lightning/utilities/distributed.py#L35-L42

rank_zero_only.rank default value is set there:
https://github.com/PyTorchLightning/pytorch-lightning/blob/b3b8f95e2a1ac040f6ff8f848542a1e5a27edfee/pytorch_lightning/utilities/distributed.py#L45-L46
but can be set in other modules, for example in our case DDP:
https://github.com/PyTorchLightning/pytorch-lightning/blob/b3b8f95e2a1ac040f6ff8f848542a1e5a27edfee/pytorch_lightning/plugins/training_type/ddp.py#L227-L228

Unfortunately it seems that the initialization by the DDP module happens too late, I think because of commit da6dbc8:

  • self.setup_trainer(model) gets called on line 467 effectively initializing the logger and creating the logger directory tree
  • DDP initialization and thus rank_zero_only.rank getting the correct value only happens at line 477 when calling self.training_type_plugin.pre_training().

To Reproduce

I have attached the code the user provided together the Slurm script: only_rank_zero.tar.gz.

I understand that you would prefer a BoringModel and Collab based reproducer but I am from the HPC world and I am not used to those. Let me know if I can help in any other way. I hope that my own digging into the code will hep.

Environment (probably not relevant in this case)

  • PyTorch Version: 1.7.1
  • OS: Linux (Red Hat 8.1)
  • How you installed PyTorch: conda, I tried the latest version of PyTorch-Lightning available on conda but also tested installing the current master branch from source and the behavior is still the same.
  • Python version: 3.7.10
  • CUDA/cuDNN version: 11.0.221/8.0.5
  • GPU models and configuration: NVIDIA V100

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingdistributedGeneric distributed-related topichelp wantedOpen to be worked onpriority: 1Medium priority task

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions