Skip to content

Conversation

@shuyingsunshine21
Copy link
Contributor

@shuyingsunshine21 shuyingsunshine21 commented Apr 7, 2021

What does this PR do?

This PR changes the logic so we NO LONGER try to save on KeyboardInterrupt or call callback's on_train_end

Fixes #6842
Fixes #6807
Fixes #5766

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

Shuying Sun and others added 30 commits March 23, 2021 12:06
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
…oint_consolidate

Update test_all_gather_grad.py
…1-checkpoint_consolidate"

This reverts commit c5053da, reversing
changes made to 0d23d75.
This reverts commit 70fe5da.
This reverts commit a9aae99.
@carmocca
Copy link
Contributor

carmocca commented Apr 9, 2021

Labeling as API change since this removes support for saving checkpoints on Ctrl+C/exception and it might not be okay for some users. I'd like to have @williamFalcon's approval first.

@carmocca carmocca added _Will design Includes a design discussion labels Apr 9, 2021
@carmocca
Copy link
Contributor

carmocca commented Apr 9, 2021

Also, given the current changes, this should be updated since callback's on_train_end won't be called on Ctrl+C

https://github.com/PyTorchLightning/pytorch-lightning/blame/master/docs/source/common/trainer.rst#L145-L149

@awaelchli awaelchli linked an issue Apr 13, 2021 that may be closed by this pull request
@Borda Borda requested a review from carmocca April 13, 2021 20:42
Copy link
Collaborator

@Borda Borda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we have a Traner argument that would enable all these emergency savings, I this was quite a useful feature in case you run training for a while and then timeout or any other kill came...

@shuyingsunshine21
Copy link
Contributor Author

can we have a Traner argument that would enable all these emergency savings, I this was quite a useful feature in case you run training for a while and then timeout or any other kill came...

wonder if setting "frequent" checkpointing in training would help.

@shuyingsunshine21
Copy link
Contributor Author

looks like the test fail is related to a recent test: https://github.com/PyTorchLightning/pytorch-lightning/pull/6969/files

because of the current error handling logic, it does not catch this assertion error

https://github.com/PyTorchLightning/pytorch-lightning/blob/master/tests/trainer/test_training_loop.py#L140-L143

@tchaton
Copy link
Contributor

tchaton commented Apr 14, 2021

Dear @shuyingsunshine21,

You have 2 failing test on Azure. Do you need help to solve this ?

Best,
T.C

@tchaton tchaton removed the _Will label Apr 14, 2021
@shuyingsunshine21
Copy link
Contributor Author

Dear @shuyingsunshine21,

You have 2 failing test on Azure. Do you need help to solve this ?

Best,
T.C

Oh, thanks, did not notice that. Let me take a look.

@carmocca
Copy link
Contributor

Oh, thanks, did not notice that. Let me take a look.

@shuyingsunshine21 The issue is on our end. Seems like there was a fairscale update that is breaking our CI. We are taking a look.

@carmocca carmocca merged commit 03a73b3 into Lightning-AI:master Apr 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working design Includes a design discussion distributed Generic distributed-related topic

Projects

None yet

10 participants