-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Closed
Labels
help wantedOpen to be worked onOpen to be worked on
Description
Trainer configuration:
trainer = pl.Trainer(
logger= CometLogger( api_key="ID"),
auto_select_gpus=True,
gpus=3,
distributed_backend="ddp",
)
The error:
GPU available: True, used: True
No environment variable for node rank defined. Set as 0.
CUDA_VISIBLE_DEVICES: [0,1,2]
CometLogger will be initialized in online mode
CometLogger will be initialized in online mode
initializing ddp: LOCAL_RANK: 0/2 WORLD_SIZE:3
Traceback (most recent call last):
File "train.py", line 156, in <module>
main()
File "train.py", line 64, in main
main_train(model_class_pointer, hyperparams, logger)
File "train.py", line 148, in main_train
trainer.fit(model)
File "/user/anaconda3/envs/docBert/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 884, in fit
self.spawn_ddp_children(model)
File "/user/anaconda3/envs/docBert/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 395, in spawn_ddp_children
self.ddp_train(local_rank, model, is_master=True)
File "/user/anaconda3/envs/docBert/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 425, in ddp_train
model.init_ddp_connection(self.proc_rank, self.world_size, self.is_slurm_managing_tasks)
File "/user/anaconda3/envs/docBert/lib/python3.7/site-packages/pytorch_lightning/core/lightning.py", line 962, in init_ddp_connection
torch_distrib.init_process_group(torch_backend, rank=proc_rank, world_size=world_size)
File "/user/anaconda3/envs/docBert/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 393, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/user/anaconda3/envs/docBert/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 172, in _env_rendezvous_handler
store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
RuntimeError: Address already in use
Env
* CUDA:
- available: True
- version: 10.1
* Packages:
- numpy: 1.18.4
- pyTorch_debug: False
- pyTorch_version: 1.5.0
- pytorch-lightning: 0.8.0-dev
- tensorboard: 2.1.0
- tqdm: 4.46.0
* System:
- OS: Linux
- architecture:
- 64bit
-
- processor: x86_64
- python: 3.7.7
- version: #97-Ubuntu SMP Wed Apr 1 03:25:46 UTC 2020
Metadata
Metadata
Assignees
Labels
help wantedOpen to be worked onOpen to be worked on