-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
🐛 Bug
Problem
Count of the total number of processes incorrectly set.
Context
I am trying to run elastic training with torchelastic. I have tried with both gloo and nccl backends.
Error message
Error coming from gloo backend:
Traceback (most recent call last):
File "train_hydra.py", line 20, in hydra_main
train(cfg)
File "/bdata/bdata1/sribkain/learnseis/learnseis/training.py", line 39, in train
t.fit(module, data_module)
File "/ldata/Code/salt-identification/SRIBKAIN_ENVS/pl_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 496, in fit
self.pre_dispatch()
File "/ldata/Code/salt-identification/SRIBKAIN_ENVS/pl_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 525, in pre_dispatch
self.accelerator.pre_dispatch()
File "/ldata/Code/salt-identification/SRIBKAIN_ENVS/pl_env/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 83, in pre_dispatch
self.training_type_plugin.pre_dispatch()
File "/ldata/Code/salt-identification/SRIBKAIN_ENVS/pl_env/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 243, in pre_dispatch
self.init_ddp_connection(self.global_rank, self.world_size)
File "/ldata/Code/salt-identification/SRIBKAIN_ENVS/pl_env/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 226, in init_ddp_connection
torch_distrib.init_process_group(self.torch_distributed_backend, rank=global_rank, world_size=world_size)
File "/ldata/Code/salt-identification/SRIBKAIN_ENVS/pl_env/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 432, in init_process_group
timeout=timeout)
File "/ldata/Code/salt-identification/SRIBKAIN_ENVS/pl_env/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 503, in _new_process_group_helper
timeout=timeout)
RuntimeError: [enforce fail at /pytorch/third_party/gloo/gloo/context.cc:27] rank < size. 13 vs 8
NCCL backend gives this error: pytorch/pytorch#20313
Please reproduce using the BoringModel
I am running imagenet example from pl using torchvision.models.resnet34. Happy to reproduce with BoringModel if needed.
Before launching, I have exported the variable GLOO_SOCKET_IFNAME and set it to the appropriate interface name.
On node 0:
PL_TORCH_DISTRIBUTED_BACKEND=gloo python -m torchelastic.distributed.launch --nnodes=1:5 --rdzv_id='nodockertestelasticlaunch7' --rdzv_backend=etcd --rdzv_endpoint=10.18.0.15:2379 train_hydra.py +experiment=elastic_config.yamlOn node 1:
PL_TORCH_DISTRIBUTED_BACKEND=gloo python -m torchelastic.distributed.launch --nnodes=1:5 --rdzv_id='nodockertestelasticlaunch7' --rdzv_backend=etcd --rdzv_endpoint=10.18.0.15:2379 train_hydra.py +experiment=elastic_config.yamlTo Reproduce
Use following BoringModel and post here
Expected behavior
To be able to run distributed fault tolerant training :)
Environment
Note: Bugs with code are solved faster ! Colab Notebook should be made public !
-
IDE: Please, use our python bug_report_model.py template. -
Colab Notebook: Please copy and paste the output from our environment collection script (or fill out the checklist below manually).
Output of collect_env_details.py:
* CUDA:
- GPU:
- GeForce RTX 2080 Ti
- GeForce RTX 2080 Ti
- GeForce RTX 2080 Ti
- GeForce RTX 2080 Ti
- GeForce RTX 2080 Ti
- GeForce RTX 2080 Ti
- GeForce RTX 2080 Ti
- GeForce RTX 2080 Ti
- available: True
- version: 10.2
* Packages:
- numpy: 1.19.2
- pyTorch_debug: False
- pyTorch_version: 1.6.0
- pytorch-lightning: 1.2.6
- tqdm: 4.48.2
- torchelastic: 0.2.0
* System:
- OS: Linux
- architecture:
- 64bit
-
- processor: x86_64
- python: 3.7.7
- version: #88-Ubuntu SMP Tue Feb 11 20:11:34 UTC 2020