-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Train End Error Handling Fix #6864
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Train End Error Handling Fix #6864
Conversation
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
…oint_consolidate Update test_all_gather_grad.py
This reverts commit 9d4a2b8.
This reverts commit 0d23d75.
This reverts commit 70fe5da.
This reverts commit a9aae99.
This reverts commit ea74906.
This reverts commit bf70e43.
This reverts commit f172101.
This reverts commit 536c132.
This reverts commit 3a9fde9.
This reverts commit 7a369f4.
This reverts commit 8222dc9.
This reverts commit 6c095b2.
This reverts commit 250d0aa.
This reverts commit 8651d54.
This reverts commit dcdcd29.
|
Labeling as API change since this removes support for saving checkpoints on Ctrl+C/exception and it might not be okay for some users. I'd like to have @williamFalcon's approval first. |
|
Also, given the current changes, this should be updated since callback's |
Borda
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we have a Traner argument that would enable all these emergency savings, I this was quite a useful feature in case you run training for a while and then timeout or any other kill came...
wonder if setting "frequent" checkpointing in training would help. |
…lightning into train_end_try_catch
|
looks like the test fail is related to a recent test: https://github.com/PyTorchLightning/pytorch-lightning/pull/6969/files because of the current error handling logic, it does not catch this assertion error |
|
Dear @shuyingsunshine21, You have 2 failing test on Azure. Do you need help to solve this ? Best, |
Oh, thanks, did not notice that. Let me take a look. |
@shuyingsunshine21 The issue is on our end. Seems like there was a fairscale update that is breaking our CI. We are taking a look. |
What does this PR do?
This PR changes the logic so we NO LONGER try to save on
KeyboardInterruptor call callback'son_train_endFixes #6842
Fixes #6807
Fixes #5766
Before submitting
PR review
Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:
Did you have fun?
Make sure you had fun coding 🙃