-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Closed
Labels
bugSomething isn't workingSomething isn't workingneeds triageWaiting to be triaged by maintainersWaiting to be triaged by maintainers
Description
Bug description
Torch distributed on-prem run error. The env was set on the master was
2024-08-20 18:18:44 WORLD_SIZE:8 NODE_RANK:0 MASTER_ADDR:192.168.1.1 MASTER_PORT:9001
and on the client was seeing the same master IP env with its node rank:
2024-08-20 18:18:WORLD_SIZE:8 NODE_RANK:1 MASTER_ADDR:92.168.1.1 MASTER_PORT:9001
SDK versions:
torch==2.4.0
pytorch-lightning==1.9.5
What version are you seeing the problem on?
v1.x
How to reproduce the bug
hparams = {"learning_rate": float(learning_rate),
"margin": float(margin),
"strategy": "batch all",
"batch_size": int(int(batch_size)//2),
"epochs": int(epochs)}
train_dataset = TensorDataset(torch.tensor(preprocess.pairs))
trainloader = DataLoader(train_dataset,
batch_size = hparams["batch_size"],
collate_fn = my_collate,
drop_last = False,
shuffle=True,
num_workers=psutil.cpu_count(),
pin_memory=True)
if train_strategy=='ddp':
train_strategy="ddp_find_unused_parameters_false"
# Initialize a trainer
trainer = pl.Trainer(logger=logger,
callbacks=[checkpoint_callback],
max_epochs=hparams["epochs"],
devices=devices,
accelerator=accelerator,
strategy=train_strategy)
# Train the model
log_message(f'training on {accelerator} with {num_of_gpus} gpu')
trainer.fit(model, trainloader)Error messages and logs
Initializing distributed: GLOBAL_RANK: 6, MEMBER: 7/4
2024-08-20 17:30:09 training on gpu with 4 gpu
Initializing distributed: GLOBAL_RANK: 7, MEMBER: 8/4
2024-08-20 17:30:09 training on gpu with 4 gpu
Initializing distributed: GLOBAL_RANK: 5, MEMBER: 6/4
and later the dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation error:
[1] NCCL INFO cudaDriverVersion 12020
[2] NCCL INFO cudaDriverVersion 12020
[3] NCCL INFO cudaDriverVersion 12020
[2] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
[1] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
[3] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
[2] init.cc:1726 NCCL WARN Invalid rank requested : 6/4
[1] init.cc:1726 NCCL WARN Invalid rank requested : 5/4
[3] init.cc:1726 NCCL WARN Invalid rank requested : 7/4
[2] NCCL INFO init.cc:1872 -> 4
[1] NCCL INFO init.cc:1872 -> 4
[3] NCCL INFO init.cc:1872 -> 4
[1] NCCL INFO init.cc:1876 -> 4
[3] NCCL INFO init.cc:1876 -> 4
[2] NCCL INFO init.cc:1876 -> 4
[0] NCCL INFO cudaDriverVersion 12020
[0] NCCL INFO Bootstrap : Using eth0:10.0.146.34<0>
[0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
[0] init.cc:1726 NCCL WARN Invalid rank requested : 4/4
[0] NCCL INFO init.cc:1872 -> 4
[0] NCCL INFO init.cc:1876 -> 4
2024-08-20 17:30:09 NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:275, invalid argument (run with NCCL_DEBUG=WARN for details), NCCL version 2.20.5
ncclInvalidArgument: Invalid value for an argument.Stacktrace:
Last error:
Invalid rank requested : 7/4. traceback: Traceback (most recent call last):
File "/home/coder/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 608, in fit
call._call_and_handle_interrupt(
File "/home/coder/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/coder/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File "/home/coder/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1049, in _run
self.__setup_profiler()
File "/home/coder/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1509, in __setup_profiler
self.profiler.setup(stage=self.state.fn, local_rank=local_rank, log_dir=self.log_dir)
File "/home/coder/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1826, in log_dir
dirpath = self.strategy.broadcast(dirpath)
File "/home/coder/.local/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 314, in broadcast
torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
File "/home/coder/.local/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
return func(*args, **kwargs)
File "/home/coder/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2901, in broadcast_object_list
broadcast(object_sizes_tensor, src=src, group=group)
File "/home/coder/.local/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
return func(*args, **kwargs)
File "/home/coder/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2205, in broadcast
work = default_pg.broadcast([tensor], opts)Environment
Current environment
#- PyTorch Lightning Version (e.g., 2.4.0): 1.9.5
#- PyTorch Version (e.g., 2.4): 2.4.0
#- Python version (e.g., 3.12): 3.8.10
#- OS (e.g., Linux): Ubuntu
#- CUDA/cuDNN version: 12.2
#- GPU models and configuration: 4x NVIDIA A10G
#- How you installed Lightning(`conda`, `pip`, source): pip
More info
No response
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingneeds triageWaiting to be triaged by maintainersWaiting to be triaged by maintainers