Skip to content

global process count incorrect with elastic, fault tolerant training #6853

@srib

Description

@srib

🐛 Bug

Problem

Count of the total number of processes incorrectly set.

Context

I am trying to run elastic training with torchelastic. I have tried with both gloo and nccl backends.

Error message

Error coming from gloo backend:

Traceback (most recent call last):
  File "train_hydra.py", line 20, in hydra_main
    train(cfg)
  File "/bdata/bdata1/sribkain/learnseis/learnseis/training.py", line 39, in train
    t.fit(module, data_module)
  File "/ldata/Code/salt-identification/SRIBKAIN_ENVS/pl_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 496, in fit
    self.pre_dispatch()
  File "/ldata/Code/salt-identification/SRIBKAIN_ENVS/pl_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 525, in pre_dispatch
    self.accelerator.pre_dispatch()
  File "/ldata/Code/salt-identification/SRIBKAIN_ENVS/pl_env/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 83, in pre_dispatch
    self.training_type_plugin.pre_dispatch()
  File "/ldata/Code/salt-identification/SRIBKAIN_ENVS/pl_env/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 243, in pre_dispatch
    self.init_ddp_connection(self.global_rank, self.world_size)
  File "/ldata/Code/salt-identification/SRIBKAIN_ENVS/pl_env/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 226, in init_ddp_connection
    torch_distrib.init_process_group(self.torch_distributed_backend, rank=global_rank, world_size=world_size)
  File "/ldata/Code/salt-identification/SRIBKAIN_ENVS/pl_env/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 432, in init_process_group
    timeout=timeout)
  File "/ldata/Code/salt-identification/SRIBKAIN_ENVS/pl_env/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 503, in _new_process_group_helper
    timeout=timeout)
RuntimeError: [enforce fail at /pytorch/third_party/gloo/gloo/context.cc:27] rank < size. 13 vs 8

NCCL backend gives this error: pytorch/pytorch#20313

Please reproduce using the BoringModel

I am running imagenet example from pl using torchvision.models.resnet34. Happy to reproduce with BoringModel if needed.

Before launching, I have exported the variable GLOO_SOCKET_IFNAME and set it to the appropriate interface name.

On node 0:

PL_TORCH_DISTRIBUTED_BACKEND=gloo python -m torchelastic.distributed.launch --nnodes=1:5 --rdzv_id='nodockertestelasticlaunch7' --rdzv_backend=etcd --rdzv_endpoint=10.18.0.15:2379 train_hydra.py +experiment=elastic_config.yaml

On node 1:

PL_TORCH_DISTRIBUTED_BACKEND=gloo python -m torchelastic.distributed.launch --nnodes=1:5 --rdzv_id='nodockertestelasticlaunch7' --rdzv_backend=etcd --rdzv_endpoint=10.18.0.15:2379 train_hydra.py +experiment=elastic_config.yaml

To Reproduce

Use following BoringModel and post here

Expected behavior

To be able to run distributed fault tolerant training :)

Environment

Note: Bugs with code are solved faster ! Colab Notebook should be made public !

Output of collect_env_details.py:

* CUDA:
        - GPU:
                - GeForce RTX 2080 Ti
                - GeForce RTX 2080 Ti
                - GeForce RTX 2080 Ti
                - GeForce RTX 2080 Ti
                - GeForce RTX 2080 Ti
                - GeForce RTX 2080 Ti
                - GeForce RTX 2080 Ti
                - GeForce RTX 2080 Ti
        - available:         True
        - version:           10.2
* Packages:
        - numpy:             1.19.2
        - pyTorch_debug:     False
        - pyTorch_version:   1.6.0
        - pytorch-lightning: 1.2.6
        - tqdm:              4.48.2
        - torchelastic:    0.2.0
* System:
        - OS:                Linux
        - architecture:
                - 64bit
                -
        - processor:         x86_64
        - python:            3.7.7
        - version:           #88-Ubuntu SMP Tue Feb 11 20:11:34 UTC 2020

Additional context

Metadata

Metadata

Labels

bugSomething isn't workinghelp wantedOpen to be worked onpriority: 0High priority taskwaiting on authorWaiting on user action, correction, or update

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions