Skip to content

NCCL error: Invalid rank requested #20219

@loretoparisi

Description

@loretoparisi

Bug description

Torch distributed on-prem run error. The env was set on the master was

2024-08-20 18:18:44 WORLD_SIZE:8 NODE_RANK:0 MASTER_ADDR:192.168.1.1 MASTER_PORT:9001

and on the client was seeing the same master IP env with its node rank:

2024-08-20 18:18:WORLD_SIZE:8 NODE_RANK:1 MASTER_ADDR:92.168.1.1 MASTER_PORT:9001

SDK versions:

torch==2.4.0
pytorch-lightning==1.9.5

What version are you seeing the problem on?

v1.x

How to reproduce the bug

hparams = {"learning_rate": float(learning_rate),
               "margin": float(margin),
               "strategy": "batch all",
               "batch_size":  int(int(batch_size)//2),
               "epochs": int(epochs)}

    train_dataset = TensorDataset(torch.tensor(preprocess.pairs))
    trainloader = DataLoader(train_dataset,
                             batch_size = hparams["batch_size"],
                             collate_fn = my_collate,
                             drop_last = False,
                             shuffle=True,
                             num_workers=psutil.cpu_count(),
                             pin_memory=True)

if train_strategy=='ddp':
        train_strategy="ddp_find_unused_parameters_false"

    # Initialize a trainer
    trainer = pl.Trainer(logger=logger,
                        callbacks=[checkpoint_callback],
                         max_epochs=hparams["epochs"],
                         devices=devices,
                         accelerator=accelerator,
                         strategy=train_strategy)

    # Train the model
    log_message(f'training on {accelerator} with {num_of_gpus} gpu')
    trainer.fit(model, trainloader)

Error messages and logs

Initializing distributed: GLOBAL_RANK: 6, MEMBER: 7/4
2024-08-20 17:30:09 training on gpu with 4 gpu
Initializing distributed: GLOBAL_RANK: 7, MEMBER: 8/4
2024-08-20 17:30:09 training on gpu with 4 gpu
Initializing distributed: GLOBAL_RANK: 5, MEMBER: 6/4

and later the dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation error:

[1] NCCL INFO cudaDriverVersion 12020
[2] NCCL INFO cudaDriverVersion 12020
 [3] NCCL INFO cudaDriverVersion 12020
 [2] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
 [1] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
 [3] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation

 [2] init.cc:1726 NCCL WARN Invalid rank requested : 6/4
 [1] init.cc:1726 NCCL WARN Invalid rank requested : 5/4

 [3] init.cc:1726 NCCL WARN Invalid rank requested : 7/4
[2] NCCL INFO init.cc:1872 -> 4
[1] NCCL INFO init.cc:1872 -> 4
 [3] NCCL INFO init.cc:1872 -> 4
 [1] NCCL INFO init.cc:1876 -> 4
 [3] NCCL INFO init.cc:1876 -> 4
 [2] NCCL INFO init.cc:1876 -> 4
 [0] NCCL INFO cudaDriverVersion 12020
 [0] NCCL INFO Bootstrap : Using eth0:10.0.146.34<0>
 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation

 [0] init.cc:1726 NCCL WARN Invalid rank requested : 4/4
 [0] NCCL INFO init.cc:1872 -> 4
 [0] NCCL INFO init.cc:1876 -> 4
2024-08-20 17:30:09 NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:275, invalid argument (run with NCCL_DEBUG=WARN for details), NCCL version 2.20.5
ncclInvalidArgument: Invalid value for an argument.

Stacktrace:

Last error:
Invalid rank requested : 7/4. traceback: Traceback (most recent call last):
  File "/home/coder/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 608, in fit
    call._call_and_handle_interrupt(
  File "/home/coder/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/coder/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _fit_impl
    self._run(model, ckpt_path=self.ckpt_path)
  File "/home/coder/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1049, in _run
    self.__setup_profiler()
  File "/home/coder/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1509, in __setup_profiler
    self.profiler.setup(stage=self.state.fn, local_rank=local_rank, log_dir=self.log_dir)
  File "/home/coder/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1826, in log_dir
    dirpath = self.strategy.broadcast(dirpath)
  File "/home/coder/.local/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 314, in broadcast
    torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
  File "/home/coder/.local/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
    return func(*args, **kwargs)
  File "/home/coder/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2901, in broadcast_object_list
    broadcast(object_sizes_tensor, src=src, group=group)
  File "/home/coder/.local/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
    return func(*args, **kwargs)
  File "/home/coder/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2205, in broadcast
    work = default_pg.broadcast([tensor], opts)

Environment

Current environment
#- PyTorch Lightning Version (e.g., 2.4.0): 1.9.5
#- PyTorch Version (e.g., 2.4): 2.4.0
#- Python version (e.g., 3.12): 3.8.10
#- OS (e.g., Linux): Ubuntu
#- CUDA/cuDNN version: 12.2
#- GPU models and configuration: 4x NVIDIA A10G
#- How you installed Lightning(`conda`, `pip`, source): pip

More info

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingneeds triageWaiting to be triaged by maintainers

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions