Skip to content

Conversation

@SeanNaren
Copy link
Contributor

@SeanNaren SeanNaren commented Feb 21, 2021

What does this PR do?

Received a few reports that DeepSpeed quickly gives off NaNs when using ZeRO. This seems to be related to default Loss scaling values, which are not exposed currently. As a result the user needs to override everything to set these values.

Expose these values so that the user can access them. I thought about putting the values into the Precision plugin, but then I'd need the training_type_plugin to be aware of the precision plugin and I'm not a fan of that, given a longer term plan of the Training Type plugin handling precision.

Also I've included a few tests of edge cases that needed to be checked + seeing if we need to run the single GPU tests as special tests (might need to revert this, we'll see).

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

@SeanNaren SeanNaren added the bug Something isn't working label Feb 21, 2021
@SeanNaren SeanNaren added this to the 1.2.x milestone Feb 21, 2021
@SeanNaren SeanNaren self-assigned this Feb 21, 2021
@SeanNaren SeanNaren added 3rd party Related to a 3rd-party distributed Generic distributed-related topic labels Feb 21, 2021
@codecov
Copy link

codecov bot commented Feb 21, 2021

Codecov Report

Merging #6115 (50224ee) into master (3b0e4e0) will decrease coverage by 0%.
The diff coverage is n/a.

@@          Coverage Diff           @@
##           master   #6115   +/-   ##
======================================
- Coverage      93%     93%   -0%     
======================================
  Files         160     160           
  Lines       11405   11405           
======================================
- Hits        10661   10629   -32     
- Misses        744     776   +32     

@Borda Borda merged commit 432e563 into master Feb 21, 2021
@Borda Borda deleted the fix/fp16_enable branch February 21, 2021 20:43
SeanNaren added a commit that referenced this pull request Mar 16, 2021
* Expose deepspeed config parameters to init function due to instability in parameters

* See if tests can run on normal CI, without special tests

* Add changelog

* Update pytorch_lightning/plugins/training_type/deepspeed.py

Co-authored-by: Carlos Mocholí <[email protected]>

Co-authored-by: Carlos Mocholí <[email protected]>

(cherry picked from commit 432e563)
Borda pushed a commit that referenced this pull request Mar 16, 2021
* Expose deepspeed config parameters to init function due to instability in parameters

* See if tests can run on normal CI, without special tests

* Add changelog

* Update pytorch_lightning/plugins/training_type/deepspeed.py

Co-authored-by: Carlos Mocholí <[email protected]>

Co-authored-by: Carlos Mocholí <[email protected]>

(cherry picked from commit 432e563)

Add missing config
lexierule pushed a commit that referenced this pull request Mar 16, 2021
* Expose deepspeed config parameters to init function due to instability in parameters

* See if tests can run on normal CI, without special tests

* Add changelog

* Update pytorch_lightning/plugins/training_type/deepspeed.py

Co-authored-by: Carlos Mocholí <[email protected]>

Co-authored-by: Carlos Mocholí <[email protected]>

(cherry picked from commit 432e563)

Add missing config
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3rd party Related to a 3rd-party bug Something isn't working distributed Generic distributed-related topic

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants