Allow training type plugin to delay optimizer creation (FSDP 2/n) #6331

SeanNaren · 2021-03-03T19:21:30Z

What does this PR do?

Allows the training type plugin to delay optimizer creation. This is useful when the optimizers can only be created after the model has been wrapped. In the case of FSDP, after the model has been sharded onto devices.

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

codecov · 2021-03-03T19:23:41Z

Codecov Report

Merging #6331 (0b6a659) into master (4e9b453) will increase coverage by 45%.
The diff coverage is 100%.

❗ Current head 0b6a659 differs from pull request most recent head 4b94920. Consider uploading reports for the commit 4b94920 to get more accurate results

@@           Coverage Diff            @@
##           master   #6331     +/-   ##
========================================
+ Coverage      46%     91%    +45%     
========================================
  Files         167     160      -7     
  Lines       10722   11407    +685     
========================================
+ Hits         4905   10377   +5472     
+ Misses       5817    1030   -4787

pytorch_lightning/plugins/training_type/training_type_plugin.py

tests/accelerators/test_cpu.py

tchaton · 2021-03-09T11:37:14Z

pytorch_lightning/accelerators/accelerator.py

        """
        self.connect_training_type_plugin(self.training_type_plugin, model)
-        self.setup_optimizers(trainer)
+        if not self.training_type_plugin.setup_optimizers_after_dispatch:


Hey @SeanNaren

Won't it be better to move self.setup_optimizers(trainer) in dispatch directly or there will be a blocking part ?

If we stick with setup_optimizers_after_dispatch, it would be more clear to have setup_optimizers_in_pre_dispatch ?

I never thought about it honestly, that's a really good point lol. Let me just move it and see what happens

EDIT: The only edge case I see is that Sharded Training (not FSDP) requires re-configuring the optimizers. That might require a little bit of refactor. Since there are not exposed hooks to the user at this stage, I think we're fine to do so. Will make the change and see if it works out

After #6506 I understand this piece of code a bit better.

Because Apex wraps optimizers, the optimizers have to be set before setting up the precision plugin.

So I'll update the doc strings making it clear that if you enable setup_optimizers_in_pre_dispatch that this may break APEX.

total n00b question: is it possible to delay setting up both the optimizers and precision plugin to pre-dispatch to make apex work?

# Conflicts: # pytorch_lightning/accelerators/accelerator.py

pytorch_lightning/plugins/training_type/training_type_plugin.py

tchaton

LGTM !

SeanNaren added 2 commits March 3, 2021 11:47

Allow training_type_plugin to delay optimizer configure

ab591a8

Add missing references to trainer, add a CPU accelerator based test

a60f2c0

SeanNaren added feature Is an improvement or enhancement distributed Generic distributed-related topic labels Mar 3, 2021

SeanNaren added this to the 1.3 milestone Mar 3, 2021

SeanNaren requested review from Borda, awaelchli, carmocca, justusschock, tchaton and williamFalcon as code owners March 3, 2021 19:21

Merge branch 'master' into feat/fsdp_2n

62fe5a4

Borda reviewed Mar 7, 2021

View reviewed changes

pytorch_lightning/plugins/training_type/training_type_plugin.py Show resolved Hide resolved

tests/accelerators/test_cpu.py Outdated Show resolved Hide resolved

tchaton reviewed Mar 9, 2021

View reviewed changes

mergify bot added the has conflicts label Mar 18, 2021

Merge branch 'master' into feat/fsdp_2n

4b94920

# Conflicts: # pytorch_lightning/accelerators/accelerator.py

mergify bot removed the has conflicts label Mar 18, 2021

awaelchli approved these changes Mar 19, 2021

View reviewed changes

justusschock approved these changes Mar 19, 2021

View reviewed changes

ananthsub approved these changes Mar 22, 2021

View reviewed changes

pytorch_lightning/plugins/training_type/training_type_plugin.py Show resolved Hide resolved

tchaton approved these changes Mar 22, 2021

View reviewed changes

SeanNaren enabled auto-merge (squash) March 22, 2021 11:20

SeanNaren merged commit 58c9fa7 into master Mar 22, 2021

SeanNaren deleted the feat/fsdp_2n branch March 22, 2021 11:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Allow training type plugin to delay optimizer creation (FSDP 2/n) #6331

Allow training type plugin to delay optimizer creation (FSDP 2/n) #6331

Uh oh!

SeanNaren commented Mar 3, 2021 •

edited

Loading

Uh oh!

codecov bot commented Mar 3, 2021 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

tchaton Mar 9, 2021

Uh oh!

SeanNaren Mar 10, 2021 •

edited

Loading

Uh oh!

SeanNaren Mar 18, 2021

Uh oh!

ananthsub Mar 22, 2021

Uh oh!

Uh oh!

tchaton left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Allow training type plugin to delay optimizer creation (FSDP 2/n) #6331

Allow training type plugin to delay optimizer creation (FSDP 2/n) #6331

Uh oh!

Conversation

SeanNaren commented Mar 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

PR review

Did you have fun?

Uh oh!

codecov bot commented Mar 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

tchaton Mar 9, 2021

Choose a reason for hiding this comment

Uh oh!

SeanNaren Mar 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SeanNaren Mar 18, 2021

Choose a reason for hiding this comment

Uh oh!

ananthsub Mar 22, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tchaton left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

SeanNaren commented Mar 3, 2021 •

edited

Loading

codecov bot commented Mar 3, 2021 •

edited

Loading

SeanNaren Mar 10, 2021 •

edited

Loading