Skip to content

Training stuck at 0% after few epochs while training with DDP #5865

@HareshKarnan

Description

@HareshKarnan

🐛 Bug

I recently updated to pytorch_lightning 1.1.7 and noticed that after a few epochs of training, the training % is stuck at 0% and never progresses. When I switched back to 1.1.4, this strange behavior does not occur. I do not know the root cause of this issue.

  • PyTorch Version (e.g., 1.0): 1.1.7
  • OS (e.g., Linux): Linux, U18
  • How you installed PyTorch (conda, pip, source): pip install
  • Build command you used (if compiling from source):
  • Python version: 3.6
  • CUDA/cuDNN version: 10.0
  • GPU models and configuration: RTX 2080 x3
  • Any other relevant information:

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingdistributedGeneric distributed-related topichelp wantedOpen to be worked onpriority: 0High priority task

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions