Skip to content

RPC timeout when model parallel with large model(> 4B) #5318

@haven-jeon

Description

@haven-jeon

🐛 Bug

Although different results will be shown depending on hardware conditions such as GPU specifications, it can be seen that rpc timeout errors may occur while making piped sequeicial models. In my case, it occurred by making the 4B's GPT2 model as a sequential parallel model on 8 x P40 GPU.

Exception has occurred: RuntimeError
RPCErr:1:RPC ran for more than 60000 milliseconds and timed out.

It would be nice to have the following rpc_timeout_sec parameters that can control the RPC timeout.

DDPSequentialPlugin(balance=[5, 5, 5, 5, 5, 5, 5, 4], microbatches=8, rpc_timeout_sec=60 * 5)

After modifying the following code, training could proceed normally.

haven-jeon@6e2205c

Please reproduce using the BoringModel

python train.py --gpus 8 --accelerator ddp  ....  --use_ddp_sequential

To Reproduce

Environment

  • PyTorch Version (e.g., 1.0): 1.6.0
  • OS (e.g., Linux): CentOS 7
  • How you installed PyTorch (conda, pip, source): source
  • Build command you used (if compiling from source): pip install -U .
  • Python version: 3.7
  • CUDA/cuDNN version: 10.1
  • GPU models and configuration: P40
  • Any other relevant information:

Additional context

Metadata

Metadata

Labels

bugSomething isn't workinghelp wantedOpen to be worked onpriority: 1Medium priority task

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions