-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
🐛 Bug
I'm running against Pytorch 1.7.1, and I was getting an "invalid device ordinal" error when running multi-node DDP training. After some digging, it looks like PyTorch was using rank as the device ID to move a tensor to. This relationship doesn't hold in multi-node, hence the invalid device ordinal.
After some more digging, I realized Lightning handled this error by overriding the PyTorch function, but only for versions older than 1.7. Unfortunately, the problem persists in all of 1.7.x.
If I hack things to always use the override (i.e. check _TORCH_GREATER_EQUAL_1_8 instead of _TORCH_GREATER_EQUAL_1_7), I don't get the error anymore.
Please reproduce using the BoringModel
This doesn't have anything to do with a model, but if you want to see some code on how I'm calling PTL, here you go.
To Reproduce
Run this code with PyTorch 1.7.x
Expected behavior
No "invalid device ordinal" error.
Environment
- PyTorch Version (e.g., 1.0): 1.7.1
- OS (e.g., Linux): Red Hat Enterprise Linux Server, 7.6 (Maipo)
- How you installed PyTorch (
conda,pip, source): cloned from IBM's open-ce Conda environment - Build command you used (if compiling from source): N/A
- Python version: 3.8
- CUDA/cuDNN version: cudatoolkit-10.2.89, cudnn-7.6.5_10.2
- GPU models and configuration: V100
- Any other relevant information: running on PPC