Skip to content

Training is stopping, without any warning or error message, at random epochs. #13873

@hannody

Description

@hannody

🐛 Bug

Trying to use DETR feature extractor, when the training starts, at random epoch (always below 20) the training stops with no further info/warning/error.
I have tried the code with three different GPU powered machines, plus CoLab, all gives same result, for all the machines that I have used, RAM is not less than 32GB and GPU (GTX 1060, RTX 2080ti, Tesla T4) all with Ubuntu 20.04.

To Reproduce

Please just have a look on the following Notebook, it uses a small dataset and it is based on the following Tutorial

My code can be found here:
https://colab.research.google.com/drive/1yyql7CPrly75TUBIMD-l16-oR8ykDMsR?usp=sharing

Expected behavior

To continue the training cycle.

Environment

  • PyTorch Lightning Version (e.g., 1.5.0):1.6.5
  • PyTorch Version (e.g., 1.10): 1.12 and also 1.8 LTS
  • Python version (e.g., 3.9): 3.7, 3.9, 3.8
  • OS (e.g., Linux): 20.04
  • CUDA/cuDNN version: 11.2, 11.5, 11.0 (tired these three)
  • GPU models and configuration:
  • How you installed PyTorch (conda, pip, source): pip
  • If compiling from source, the output of torch.__config__.show():
  • Any other relevant information:

Additional context

cc @carmocca @justusschock @ananthsub @ninginthecloud @rohitgr7

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingloopsRelated to the Loop API

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions