-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Closed
Labels
bugSomething isn't workingSomething isn't workinghelp wantedOpen to be worked onOpen to be worked onpriority: 1Medium priority taskMedium priority task
Milestone
Description
🐛 Bug
Although different results will be shown depending on hardware conditions such as GPU specifications, it can be seen that rpc timeout errors may occur while making piped sequeicial models. In my case, it occurred by making the 4B's GPT2 model as a sequential parallel model on 8 x P40 GPU.
Exception has occurred: RuntimeError
RPCErr:1:RPC ran for more than 60000 milliseconds and timed out.
It would be nice to have the following rpc_timeout_sec parameters that can control the RPC timeout.
DDPSequentialPlugin(balance=[5, 5, 5, 5, 5, 5, 5, 4], microbatches=8, rpc_timeout_sec=60 * 5)After modifying the following code, training could proceed normally.
Please reproduce using the BoringModel
python train.py --gpus 8 --accelerator ddp .... --use_ddp_sequential
To Reproduce
Environment
- PyTorch Version (e.g., 1.0): 1.6.0
- OS (e.g., Linux): CentOS 7
- How you installed PyTorch (
conda,pip, source): source - Build command you used (if compiling from source): pip install -U .
- Python version: 3.7
- CUDA/cuDNN version: 10.1
- GPU models and configuration: P40
- Any other relevant information:
Additional context
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workinghelp wantedOpen to be worked onOpen to be worked onpriority: 1Medium priority taskMedium priority task