-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
🐛 Bug
For distributed training, if a subset of ranks fail during some training step, the current setting tries to gracefully shutdown by calling
However, as not all ranks enter this on_train_end, we have the logic to perform model checkpoint which would hang while broadcasting.
see also discussion #6807 and #6791
Pitch
keep special case for KeyboardInterrupt expection, for the rest exceptions, we raise the exception and remove finally
try:
..... (run training epochs)
# hook
self.train_loop.on_train_end()
except KeyboardInterrupt:
rank_zero_warn('Detected KeyboardInterrupt, attempting graceful shutdown...')
# user could press Ctrl+c many times... only shutdown once
if not self.interrupted:
self.state = TrainerState.INTERRUPTED
self.on_keyboard_interrupt()
self.train_loop.on_train_end()
except:
print_exc()
raise
To Reproduce
Use following BoringModel and post here
Expected behavior
Environment
Note: Bugs with code are solved faster ! Colab Notebook should be made public !
-
IDE: Please, use our python bug_report_model.py template. -
Colab Notebook: Please copy and paste the output from our environment collection script (or fill out the checklist below manually).
You can get the script and run it with:
wget https://raw.githubusercontent.com/PyTorchLightning/pytorch-lightning/master/tests/collect_env_details.py
# For security purposes, please check the contents of collect_env_details.py before running it.
python collect_env_details.py
- PyTorch Version (e.g., 1.0):
- OS (e.g., Linux):
- How you installed PyTorch (
conda,pip, source): - Build command you used (if compiling from source):
- Python version:
- CUDA/cuDNN version:
- GPU models and configuration:
- Any other relevant information: