global process count incorrect with elastic, fault tolerant training

## 🐛 Bug


#### Problem
Count of the total number of processes incorrectly set.

#### Context

I am trying to run elastic training with ``torchelastic``. I have tried with both `gloo` and `nccl` backends.

#### Error message 
Error coming from gloo backend:

```
Traceback (most recent call last):
  File "train_hydra.py", line 20, in hydra_main
    train(cfg)
  File "/bdata/bdata1/sribkain/learnseis/learnseis/training.py", line 39, in train
    t.fit(module, data_module)
  File "/ldata/Code/salt-identification/SRIBKAIN_ENVS/pl_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 496, in fit
    self.pre_dispatch()
  File "/ldata/Code/salt-identification/SRIBKAIN_ENVS/pl_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 525, in pre_dispatch
    self.accelerator.pre_dispatch()
  File "/ldata/Code/salt-identification/SRIBKAIN_ENVS/pl_env/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 83, in pre_dispatch
    self.training_type_plugin.pre_dispatch()
  File "/ldata/Code/salt-identification/SRIBKAIN_ENVS/pl_env/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 243, in pre_dispatch
    self.init_ddp_connection(self.global_rank, self.world_size)
  File "/ldata/Code/salt-identification/SRIBKAIN_ENVS/pl_env/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 226, in init_ddp_connection
    torch_distrib.init_process_group(self.torch_distributed_backend, rank=global_rank, world_size=world_size)
  File "/ldata/Code/salt-identification/SRIBKAIN_ENVS/pl_env/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 432, in init_process_group
    timeout=timeout)
  File "/ldata/Code/salt-identification/SRIBKAIN_ENVS/pl_env/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 503, in _new_process_group_helper
    timeout=timeout)
RuntimeError: [enforce fail at /pytorch/third_party/gloo/gloo/context.cc:27] rank < size. 13 vs 8
```

NCCL backend gives this error: https://github.com/pytorch/pytorch/issues/20313

## Please reproduce using the BoringModel
I am running imagenet example from `pl` using `torchvision.models.resnet34`. Happy to reproduce with `BoringModel` if needed. 

Before launching, I have exported the variable `GLOO_SOCKET_IFNAME` and set it to the appropriate interface name. 

On node 0:

```bash
PL_TORCH_DISTRIBUTED_BACKEND=gloo python -m torchelastic.distributed.launch --nnodes=1:5 --rdzv_id='nodockertestelasticlaunch7' --rdzv_backend=etcd --rdzv_endpoint=10.18.0.15:2379 train_hydra.py +experiment=elastic_config.yaml
```

On node 1:
```bash
PL_TORCH_DISTRIBUTED_BACKEND=gloo python -m torchelastic.distributed.launch --nnodes=1:5 --rdzv_id='nodockertestelasticlaunch7' --rdzv_backend=etcd --rdzv_endpoint=10.18.0.15:2379 train_hydra.py +experiment=elastic_config.yaml
```







### To Reproduce

Use following [**BoringModel**](https://colab.research.google.com/drive/1HvWVVTK8j2Nj52qU4Q4YCyzOm0_aLQF3?usp=sharing) and post here



### Expected behavior

To be able to run distributed fault tolerant training :)

### Environment

**Note**: `Bugs with code` are solved faster ! `Colab Notebook` should be made `public` !

* `IDE`: Please, use our python [bug_report_model.py](https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pl_examples/bug_report_model.py
) template.

* `Colab Notebook`: Please copy and paste the output from our [environment collection script](https://raw.githubusercontent.com/PyTorchLightning/pytorch-lightning/master/tests/collect_env_details.py) (or fill out the checklist below manually).

Output of `collect_env_details.py`:

```bash
* CUDA:
        - GPU:
                - GeForce RTX 2080 Ti
                - GeForce RTX 2080 Ti
                - GeForce RTX 2080 Ti
                - GeForce RTX 2080 Ti
                - GeForce RTX 2080 Ti
                - GeForce RTX 2080 Ti
                - GeForce RTX 2080 Ti
                - GeForce RTX 2080 Ti
        - available:         True
        - version:           10.2
* Packages:
        - numpy:             1.19.2
        - pyTorch_debug:     False
        - pyTorch_version:   1.6.0
        - pytorch-lightning: 1.2.6
        - tqdm:              4.48.2
        - torchelastic:    0.2.0
* System:
        - OS:                Linux
        - architecture:
                - 64bit
                -
        - processor:         x86_64
        - python:            3.7.7
        - version:           #88-Ubuntu SMP Tue Feb 11 20:11:34 UTC 2020
```
### Additional context

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

global process count incorrect with elastic, fault tolerant training #6853

🐛 Bug

Problem

Context

Error message

Please reproduce using the BoringModel

To Reproduce

Expected behavior

Environment

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

global process count incorrect with elastic, fault tolerant training #6853

Description

🐛 Bug

Problem

Context

Error message

Please reproduce using the BoringModel

To Reproduce

Expected behavior

Environment

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions