Skip to content

Runtime- and Assertionerror handling in trainer.run_train #6807

@mibaumgartner

Description

@mibaumgartner

🐛 Bug

Runtime- (e.g. Out of Memory Errors) /Assertionerror are ignored when running trainer.fit(...). The error will not be raised correctly and the script will continue.
This can waste a lot of time in some cases:

# prepare code
...

# train (model raises an error, e.g. Out of Memory which is not raised by trainer)
trainer.fit(...)

# will continue here
# time intensive computation / evaluation / prediction
...

While the error is printed correctly, it is not raised and thus the script will continue.
https://github.com/PyTorchLightning/pytorch-lightning/blob/bb9ace43334ad50e3758d9cff08ad34216c7d4da/pytorch_lightning/trainer/trainer.py#L621-L634

Please reproduce using the BoringModel

To Reproduce

Use following BoringModel and post here

Expected behavior

The script should stop after the trainer cleaned up when an Assertion or Runtime error occurs during training.

Environment

Note: Bugs with code are solved faster ! Colab Notebook should be made public !

You can get the script and run it with:

wget https://raw.githubusercontent.com/PyTorchLightning/pytorch-lightning/master/tests/collect_env_details.py
# For security purposes, please check the contents of collect_env_details.py before running it.
python collect_env_details.py
  • PyTorch Version (e.g., 1.0):
  • OS (e.g., Linux):
  • How you installed PyTorch (conda, pip, source):
  • Build command you used (if compiling from source):
  • Python version:
  • CUDA/cuDNN version:
  • GPU models and configuration:
  • Any other relevant information:

Additional context

Simply saving the exception and raising it after the trainer called the final hook should be sufficient.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinghelp wantedOpen to be worked on

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions