NCCL error: Invalid rank requested

### Bug description

Torch distributed on-prem run error. The env was set on the master was

```
2024-08-20 18:18:44 WORLD_SIZE:8 NODE_RANK:0 MASTER_ADDR:192.168.1.1 MASTER_PORT:9001
```

and on the client was seeing the same master IP env with its node rank:

```
2024-08-20 18:18:WORLD_SIZE:8 NODE_RANK:1 MASTER_ADDR:92.168.1.1 MASTER_PORT:9001
```

SDK versions:
```
torch==2.4.0
pytorch-lightning==1.9.5
```

### What version are you seeing the problem on?

v1.x

### How to reproduce the bug

```python
hparams = {"learning_rate": float(learning_rate),
               "margin": float(margin),
               "strategy": "batch all",
               "batch_size":  int(int(batch_size)//2),
               "epochs": int(epochs)}

    train_dataset = TensorDataset(torch.tensor(preprocess.pairs))
    trainloader = DataLoader(train_dataset,
                             batch_size = hparams["batch_size"],
                             collate_fn = my_collate,
                             drop_last = False,
                             shuffle=True,
                             num_workers=psutil.cpu_count(),
                             pin_memory=True)

if train_strategy=='ddp':
        train_strategy="ddp_find_unused_parameters_false"

    # Initialize a trainer
    trainer = pl.Trainer(logger=logger,
                        callbacks=[checkpoint_callback],
                         max_epochs=hparams["epochs"],
                         devices=devices,
                         accelerator=accelerator,
                         strategy=train_strategy)

    # Train the model
    log_message(f'training on {accelerator} with {num_of_gpus} gpu')
    trainer.fit(model, trainloader)
```


### Error messages and logs

```
Initializing distributed: GLOBAL_RANK: 6, MEMBER: 7/4
2024-08-20 17:30:09 training on gpu with 4 gpu
Initializing distributed: GLOBAL_RANK: 7, MEMBER: 8/4
2024-08-20 17:30:09 training on gpu with 4 gpu
Initializing distributed: GLOBAL_RANK: 5, MEMBER: 6/4
```

and later the `dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation` error:

```bash
[1] NCCL INFO cudaDriverVersion 12020
[2] NCCL INFO cudaDriverVersion 12020
 [3] NCCL INFO cudaDriverVersion 12020
 [2] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
 [1] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
 [3] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation

 [2] init.cc:1726 NCCL WARN Invalid rank requested : 6/4
 [1] init.cc:1726 NCCL WARN Invalid rank requested : 5/4

 [3] init.cc:1726 NCCL WARN Invalid rank requested : 7/4
[2] NCCL INFO init.cc:1872 -> 4
[1] NCCL INFO init.cc:1872 -> 4
 [3] NCCL INFO init.cc:1872 -> 4
 [1] NCCL INFO init.cc:1876 -> 4
 [3] NCCL INFO init.cc:1876 -> 4
 [2] NCCL INFO init.cc:1876 -> 4
 [0] NCCL INFO cudaDriverVersion 12020
 [0] NCCL INFO Bootstrap : Using eth0:10.0.146.34<0>
 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation

 [0] init.cc:1726 NCCL WARN Invalid rank requested : 4/4
 [0] NCCL INFO init.cc:1872 -> 4
 [0] NCCL INFO init.cc:1876 -> 4
2024-08-20 17:30:09 NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:275, invalid argument (run with NCCL_DEBUG=WARN for details), NCCL version 2.20.5
ncclInvalidArgument: Invalid value for an argument.
```

Stacktrace:

```bash
Last error:
Invalid rank requested : 7/4. traceback: Traceback (most recent call last):
  File "/home/coder/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 608, in fit
    call._call_and_handle_interrupt(
  File "/home/coder/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/coder/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _fit_impl
    self._run(model, ckpt_path=self.ckpt_path)
  File "/home/coder/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1049, in _run
    self.__setup_profiler()
  File "/home/coder/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1509, in __setup_profiler
    self.profiler.setup(stage=self.state.fn, local_rank=local_rank, log_dir=self.log_dir)
  File "/home/coder/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1826, in log_dir
    dirpath = self.strategy.broadcast(dirpath)
  File "/home/coder/.local/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 314, in broadcast
    torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
  File "/home/coder/.local/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
    return func(*args, **kwargs)
  File "/home/coder/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2901, in broadcast_object_list
    broadcast(object_sizes_tensor, src=src, group=group)
  File "/home/coder/.local/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
    return func(*args, **kwargs)
  File "/home/coder/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2205, in broadcast
    work = default_pg.broadcast([tensor], opts)
```


### Environment

<details>
  <summary>Current environment</summary>

```
#- PyTorch Lightning Version (e.g., 2.4.0): 1.9.5
#- PyTorch Version (e.g., 2.4): 2.4.0
#- Python version (e.g., 3.12): 3.8.10
#- OS (e.g., Linux): Ubuntu
#- CUDA/cuDNN version: 12.2
#- GPU models and configuration: 4x NVIDIA A10G
#- How you installed Lightning(`conda`, `pip`, source): pip
```

</details>


### More info

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

NCCL error: Invalid rank requested #20219

Bug description

What version are you seeing the problem on?

How to reproduce the bug

Error messages and logs

Environment

More info

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

NCCL error: Invalid rank requested #20219

Description

Bug description

What version are you seeing the problem on?

How to reproduce the bug

Error messages and logs

Environment

More info

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions