DDP: Multiple processes try to create the logger directory tree

## 🐛 Bug

An user from our supercomputing center run into an issue which I think turned out to be a bug in PyTorch-Lightning.

When using the DDP accelerator together with a logger, multiple processes will try creating the logger directory tree, causing some errors about already existing directories or files.

## Troubleshooting

PyTorch-Lightning uses extensively the `rank_zero_only` function to ensure that some actions are only performed by the process with rank 0:
https://github.com/PyTorchLightning/pytorch-lightning/blob/b3b8f95e2a1ac040f6ff8f848542a1e5a27edfee/pytorch_lightning/utilities/distributed.py#L35-L42

`rank_zero_only.rank` default value is set there:
https://github.com/PyTorchLightning/pytorch-lightning/blob/b3b8f95e2a1ac040f6ff8f848542a1e5a27edfee/pytorch_lightning/utilities/distributed.py#L45-L46
but can be set in other modules, for example in our case DDP:
https://github.com/PyTorchLightning/pytorch-lightning/blob/b3b8f95e2a1ac040f6ff8f848542a1e5a27edfee/pytorch_lightning/plugins/training_type/ddp.py#L227-L228

Unfortunately it seems that the initialization by the DDP module happens too late, I think because of commit da6dbc8d1d128cf783d7151b012a5502bbd52bf5:
- `self.setup_trainer(model)` gets called on line 467 effectively initializing the logger and creating the logger directory tree
- DDP initialization and thus `rank_zero_only.rank` getting the correct value only happens at line 477 when calling `self.training_type_plugin.pre_training()`.

### To Reproduce

I have attached the code the user provided together the Slurm script: [only_rank_zero.tar.gz](https://github.com/PyTorchLightning/pytorch-lightning/files/6092665/only_rank_zero.tar.gz).

I understand that you would prefer a `BoringModel` and Collab based reproducer but I am from the HPC world and I am not used to those. Let me know if I can help in any other way. I hope that my own digging into the code will hep.

### Environment (probably not relevant in this case)

 - PyTorch Version: 1.7.1
 - OS: Linux (Red Hat 8.1)
 - How you installed PyTorch: `conda`, I tried the latest version of  PyTorch-Lightning available on `conda` but also tested installing the current master branch from source and the behavior is still the same.
 - Python version: 3.7.10
 - CUDA/cuDNN version: 11.0.221/8.0.5
 - GPU models and configuration: NVIDIA V100

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DDP: Multiple processes try to create the logger directory tree #6364

🐛 Bug

Troubleshooting

To Reproduce

Environment (probably not relevant in this case)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

DDP: Multiple processes try to create the logger directory tree #6364

Description

🐛 Bug

Troubleshooting

To Reproduce

Environment (probably not relevant in this case)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions