Skip to content

RuntimeError: Address already in use on 'ddp' mode pl 0.8.0 #2081

@dvirginz

Description

@dvirginz

Trainer configuration:

    trainer = pl.Trainer(
        logger= CometLogger( api_key="ID"),
        auto_select_gpus=True,
        gpus=3,
        distributed_backend="ddp",
   )

The error:

GPU available: True, used: True
No environment variable for node rank defined. Set as 0.
CUDA_VISIBLE_DEVICES: [0,1,2]
CometLogger will be initialized in online mode
CometLogger will be initialized in online mode
initializing ddp: LOCAL_RANK: 0/2 WORLD_SIZE:3
Traceback (most recent call last):
  File "train.py", line 156, in <module>
    main()
  File "train.py", line 64, in main
    main_train(model_class_pointer, hyperparams, logger)
  File "train.py", line 148, in main_train
    trainer.fit(model)
  File "/user/anaconda3/envs/docBert/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 884, in fit
    self.spawn_ddp_children(model)
  File "/user/anaconda3/envs/docBert/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 395, in spawn_ddp_children
    self.ddp_train(local_rank, model, is_master=True)
  File "/user/anaconda3/envs/docBert/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 425, in ddp_train
    model.init_ddp_connection(self.proc_rank, self.world_size, self.is_slurm_managing_tasks)
  File "/user/anaconda3/envs/docBert/lib/python3.7/site-packages/pytorch_lightning/core/lightning.py", line 962, in init_ddp_connection
    torch_distrib.init_process_group(torch_backend, rank=proc_rank, world_size=world_size)
  File "/user/anaconda3/envs/docBert/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 393, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/user/anaconda3/envs/docBert/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 172, in _env_rendezvous_handler
    store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
RuntimeError: Address already in use

Env

* CUDA:
        - available:         True
        - version:           10.1
* Packages:
        - numpy:             1.18.4
        - pyTorch_debug:     False
        - pyTorch_version:   1.5.0
        - pytorch-lightning: 0.8.0-dev
        - tensorboard:       2.1.0
        - tqdm:              4.46.0
* System:
        - OS:                Linux
        - architecture:
                - 64bit
                - 
        - processor:         x86_64
        - python:            3.7.7
        - version:           #97-Ubuntu SMP Wed Apr 1 03:25:46 UTC 2020

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions