-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
🐛 Bug
When running the code for ddp_cpu on SLURM based cluster, I get this error:
Traceback (most recent call last): File "image_classifier.py", line 99, in <module> cli_main() File "image_classifier.py", line 87, in cli_main trainer.fit(model, datamodule=dm) File "/pylon5/cis200022p/balu/softwares/pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 472, in fit results = self.accelerator_backend.train() File "/pylon5/cis200022p/balu/softwares/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_hpc_accelerator.py", line 64, in train self.ddp_train(process_idx=self.task_idx, model=model) File "/pylon5/cis200022p/balu/softwares/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_hpc_accelerator.py", line 172, in ddp_train self.model_to_device(model) TypeError: model_to_device() missing 1 required positional argument: 'process_idx'
When I look here the model_to_device function needs process_idx as an input, but is not sent here
Please reproduce using the BoringModel
I used this code :
https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pl_examples/basic_examples/simple_image_classifier.py
Along with this slurm job script:
> #!/bin/bash
> #SBATCH --job-name='pl_dist'
> #SBATCH --nodes=2
> #SBATCH -p RM
> #SBATCH --ntasks-per-node=1
> #SBATCH -t 1:00:00
>
> module load anaconda3
> source activate /pylon5/softwares/pytorch
>
> export NCCL_DEBUG=INFO
> export PYTHONFAULTHANDLER=1
>
> srun -n 2 --ntasks-per-node 1 python image_classifier.py --accelerator 'ddp_cpu' --num_nodes 2 --num_processes 1 --max_epochs 50
Environment
- CUDA:
- GPU:
- available: False
- version: 10.2 - Packages:
- numpy: 1.19.2
- pyTorch_debug: False
- pyTorch_version: 1.7.1
- pytorch-lightning: 1.1.3
- tqdm: 4.56.0 - System:
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.8.5
- version: Proposal for help #1 SMP Mon Jul 29 17:46:05 UTC 2019