Skip to content

Trainer Error Handling Fix #6842

@shuyingsunshine21

Description

@shuyingsunshine21

🐛 Bug

For distributed training, if a subset of ranks fail during some training step, the current setting tries to gracefully shutdown by calling

https://github.com/PyTorchLightning/pytorch-lightning/blob/22a266d8b8cf57455cc863e20491e416ec635ba7/pytorch_lightning/trainer/trainer.py#L634

However, as not all ranks enter this on_train_end, we have the logic to perform model checkpoint which would hang while broadcasting.

https://github.com/PyTorchLightning/pytorch-lightning/blob/22a266d8b8cf57455cc863e20491e416ec635ba7/pytorch_lightning/callbacks/model_checkpoint.py#L723-L730

see also discussion #6807 and #6791

Pitch

keep special case for KeyboardInterrupt expection, for the rest exceptions, we raise the exception and remove finally

  try:
      ..... (run training epochs)
      # hook
      self.train_loop.on_train_end()
  except KeyboardInterrupt:
      rank_zero_warn('Detected KeyboardInterrupt, attempting graceful shutdown...')
      # user could press Ctrl+c many times... only shutdown once
      if not self.interrupted:
          self.state = TrainerState.INTERRUPTED
          self.on_keyboard_interrupt()
          self.train_loop.on_train_end()
  except:
      print_exc()
      raise

To Reproduce

Use following BoringModel and post here

Expected behavior

Environment

Note: Bugs with code are solved faster ! Colab Notebook should be made public !

You can get the script and run it with:

wget https://raw.githubusercontent.com/PyTorchLightning/pytorch-lightning/master/tests/collect_env_details.py
# For security purposes, please check the contents of collect_env_details.py before running it.
python collect_env_details.py
  • PyTorch Version (e.g., 1.0):
  • OS (e.g., Linux):
  • How you installed PyTorch (conda, pip, source):
  • Build command you used (if compiling from source):
  • Python version:
  • CUDA/cuDNN version:
  • GPU models and configuration:
  • Any other relevant information:

Additional context

Metadata

Metadata

Labels

bugSomething isn't workinghelp wantedOpen to be worked on

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions