-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
🐛 Bug
I tried to do a hyperparameter search using wandb sweep function on a SLURM cluster by mpirun (openmpi 4.0).
However, during training always horovod as distributed backend is tried to be selected though the trainer config is accelerator=None/distributed_backend=None/gpus=1.
Error:
'Requested accelerator="horovod", but Horovod is not installed.'
"Install with \n $HOROVOD_WITH_PYTORCH=1 pip install horovod[pytorch]"
I use mpirun like this
mpirun
--display-map
--display-allocation
--map-by ppr:2:socket:pe=19
bash -c 'export CUDA_VISIBLE_DEVICES=${OMPI_COMM_WORLD_LOCAL_RANK}; wandb agent SWEEPID'
Also tried to install $HOROVOD_WITH_PYTORCH=1 pip install horovod[pytorch]. But in this case, training won't start and hangs in a never ending loop (but no error occurs).
Using
CUDA_VISIBLE_DEVICES=0 wandb agent $SWEEPID &
CUDA_VISIBLE_DEVICES=1 wandb agent $SWEEPID &
CUDA_VISIBLE_DEVICES=2 wandb agent $SWEEPID &
CUDA_VISIBLE_DEVICES=3 wandb agent $SWEEPID
works fine.
Environment
- PyTorch Lightning Version: 1.2.10
- PyTorch Version: 1.8
- Python version: 3.8.0
- OS (e.g., Linux): Linux on SLURM cluster
- openmpi 4.0
- CUDA/cuDNN version: 11.2
- GPUs: one node with 4 Tesla A100
- How you installed PyTorch (
conda,pip, source): conda
Do you have any idea, why horovod is implictly selected instead of doing 4 "normal trainings in parallel" with different parameter settings?
Thanks in advance!
Best,
Marcel