Skip to content

SLURM mpirun/openmpi training selects horovod automatically when not requested #8573

@marcelschilling

Description

@marcelschilling

🐛 Bug

I tried to do a hyperparameter search using wandb sweep function on a SLURM cluster by mpirun (openmpi 4.0).
However, during training always horovod as distributed backend is tried to be selected though the trainer config is accelerator=None/distributed_backend=None/gpus=1.

Error:
'Requested accelerator="horovod", but Horovod is not installed.'
"Install with \n $HOROVOD_WITH_PYTORCH=1 pip install horovod[pytorch]"

I use mpirun like this

mpirun
--display-map
--display-allocation
--map-by ppr:2:socket:pe=19
bash -c 'export CUDA_VISIBLE_DEVICES=${OMPI_COMM_WORLD_LOCAL_RANK}; wandb agent SWEEPID'

Also tried to install $HOROVOD_WITH_PYTORCH=1 pip install horovod[pytorch]. But in this case, training won't start and hangs in a never ending loop (but no error occurs).

Using

CUDA_VISIBLE_DEVICES=0 wandb agent $SWEEPID &
CUDA_VISIBLE_DEVICES=1 wandb agent $SWEEPID &
CUDA_VISIBLE_DEVICES=2 wandb agent $SWEEPID &
CUDA_VISIBLE_DEVICES=3 wandb agent $SWEEPID

works fine.

Environment

  • PyTorch Lightning Version: 1.2.10
  • PyTorch Version: 1.8
  • Python version: 3.8.0
  • OS (e.g., Linux): Linux on SLURM cluster
  • openmpi 4.0
  • CUDA/cuDNN version: 11.2
  • GPUs: one node with 4 Tesla A100
  • How you installed PyTorch (conda, pip, source): conda

Do you have any idea, why horovod is implictly selected instead of doing 4 "normal trainings in parallel" with different parameter settings?

Thanks in advance!

Best,
Marcel

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinghelp wantedOpen to be worked on

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions