Expose DeepSpeed FP16 parameters due to loss instability #6115

SeanNaren · 2021-02-21T13:31:59Z

What does this PR do?

Received a few reports that DeepSpeed quickly gives off NaNs when using ZeRO. This seems to be related to default Loss scaling values, which are not exposed currently. As a result the user needs to override everything to set these values.

Expose these values so that the user can access them. I thought about putting the values into the Precision plugin, but then I'd need the training_type_plugin to be aware of the precision plugin and I'm not a fan of that, given a longer term plan of the Training Type plugin handling precision.

Also I've included a few tests of edge cases that needed to be checked + seeing if we need to run the single GPU tests as special tests (might need to revert this, we'll see).

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

…y in parameters

tests/plugins/test_deepspeed_plugin.py

codecov · 2021-02-21T13:34:00Z

Codecov Report

Merging #6115 (50224ee) into master (3b0e4e0) will decrease coverage by 0%.
The diff coverage is n/a.

@@          Coverage Diff           @@
##           master   #6115   +/-   ##
======================================
- Coverage      93%     93%   -0%     
======================================
  Files         160     160           
  Lines       11405   11405           
======================================
- Hits        10661   10629   -32     
- Misses        744     776   +32

tests/plugins/test_deepspeed_plugin.py

pytorch_lightning/plugins/training_type/deepspeed.py

Co-authored-by: Carlos Mocholí <[email protected]>

* Expose deepspeed config parameters to init function due to instability in parameters * See if tests can run on normal CI, without special tests * Add changelog * Update pytorch_lightning/plugins/training_type/deepspeed.py Co-authored-by: Carlos Mocholí <[email protected]> Co-authored-by: Carlos Mocholí <[email protected]> (cherry picked from commit 432e563)

* Expose deepspeed config parameters to init function due to instability in parameters * See if tests can run on normal CI, without special tests * Add changelog * Update pytorch_lightning/plugins/training_type/deepspeed.py Co-authored-by: Carlos Mocholí <[email protected]> Co-authored-by: Carlos Mocholí <[email protected]> (cherry picked from commit 432e563) Add missing config

SeanNaren added 2 commits February 21, 2021 13:26

Expose deepspeed config parameters to init function due to instabilit…

bf9cd51

…y in parameters

See if tests can run on normal CI, without special tests

bffb11c

SeanNaren added the bug Something isn't working label Feb 21, 2021

SeanNaren added this to the 1.2.x milestone Feb 21, 2021

SeanNaren self-assigned this Feb 21, 2021

SeanNaren requested review from Borda, awaelchli, carmocca, justusschock, tchaton and williamFalcon as code owners February 21, 2021 13:32

SeanNaren added 3rd party Related to a 3rd-party distributed Generic distributed-related topic labels Feb 21, 2021

Add changelog

973b43c

SeanNaren commented Feb 21, 2021

View reviewed changes

tests/plugins/test_deepspeed_plugin.py Show resolved Hide resolved

carmocca approved these changes Feb 21, 2021

View reviewed changes

tests/plugins/test_deepspeed_plugin.py Show resolved Hide resolved

pytorch_lightning/plugins/training_type/deepspeed.py Outdated Show resolved Hide resolved

Update pytorch_lightning/plugins/training_type/deepspeed.py

5a536cf

Co-authored-by: Carlos Mocholí <[email protected]>

kaushikb11 approved these changes Feb 21, 2021

View reviewed changes

awaelchli approved these changes Feb 21, 2021

View reviewed changes

Borda approved these changes Feb 21, 2021

View reviewed changes

Borda merged commit 432e563 into master Feb 21, 2021

Borda deleted the fix/fp16_enable branch February 21, 2021 20:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Expose DeepSpeed FP16 parameters due to loss instability #6115

Expose DeepSpeed FP16 parameters due to loss instability #6115

Uh oh!

SeanNaren commented Feb 21, 2021 •

edited

Loading

Uh oh!

Uh oh!

codecov bot commented Feb 21, 2021 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Expose DeepSpeed FP16 parameters due to loss instability #6115

Expose DeepSpeed FP16 parameters due to loss instability #6115

Uh oh!

Conversation

SeanNaren commented Feb 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

PR review

Did you have fun?

Uh oh!

Uh oh!

codecov bot commented Feb 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

SeanNaren commented Feb 21, 2021 •

edited

Loading

codecov bot commented Feb 21, 2021 •

edited

Loading