Automatically set sync_batchnorm for training_type_plugin #6536

amogkam · 2021-03-15T18:32:30Z

What does this PR do?

This PR automatically sets the sync_batchnorm attribute for the training_type_plugin in the accelerator_connector. This is useful for custom plugins when sync_batchnorm is not known during plugin instantiation.

Fixes #<issue_number>

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

SeanNaren · 2021-03-15T18:38:48Z

We may want a small test for this, just to make sure we catch this in case! Would you be able to write a small test using the BoringModel using a custom plugin? Just to ensure that sync_batchnorm is set correctly if not present

pep8speaks · 2021-03-15T18:47:40Z

Hello @amogkam! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-03-19 19:49:25 UTC

pytorch_lightning/trainer/connectors/accelerator_connector.py

justusschock

@amogkam I'll approve, but please address the comment from @carmocca

pytorch_lightning/trainer/connectors/accelerator_connector.py

tests/plugins/test_custom_plugin.py

carmocca · 2021-03-15T19:19:17Z

Please, do not ignore the Before submitting/PR review headers at the top.

Co-authored-by: Carlos Mocholí <[email protected]>

tests/plugins/test_custom_plugin.py

Co-authored-by: Roger Shieh <[email protected]>

amogkam · 2021-03-17T18:37:35Z

Is this ready to get merged in?

Borda · 2021-03-17T18:48:40Z

Is this ready to get merged in?

seems you have failing GPU test test_sync_batchnorm_set

amogkam · 2021-03-17T19:58:38Z

@Borda which job is the failing test on? I’m not able to find it.

kaushikb11 · 2021-03-17T20:44:26Z

@amogkam You could see the failing tests, here.

… batchnorm

amogkam · 2021-03-18T00:42:25Z

I skipped the test on GPU, but the CI is still failing. Any suggestions here? It doesn't look like it's related to this PR.

amogkam · 2021-03-18T16:19:42Z

Though all the required jobs are passing. Is this enough to merge this in?

codecov · 2021-03-18T16:41:20Z

Codecov Report

Merging #6536 (e2fafe5) into master (ea36ee3) will decrease coverage by 8%.
The diff coverage is 67%.

❗ Current head e2fafe5 differs from pull request most recent head e033f51. Consider uploading reports for the commit e033f51 to get more accurate results

@@           Coverage Diff           @@
##           master   #6536    +/-   ##
=======================================
- Coverage      94%     86%    -8%     
=======================================
  Files         166     168     +2     
  Lines       11634   12205   +571     
=======================================
- Hits        10947   10533   -414     
- Misses        687    1672   +985

kaushikb11

Thanks @amogkam for your contribution! The test was failing for Windows, added a skip for that.

Just wanted to know why is it needed to skip the test on GPU machines?

tests/plugins/test_custom_plugin.py

amogkam · 2021-03-19T16:06:50Z

@kaushikb11 it was failing with this error

pytorch_lightning/trainer/trainer.py:469: in fit
    self.pre_dispatch()
pytorch_lightning/trainer/trainer.py:496: in pre_dispatch
    self.accelerator.pre_dispatch()
pytorch_lightning/accelerators/accelerator.py:91: in pre_dispatch
    self.training_type_plugin.pre_dispatch()
pytorch_lightning/plugins/training_type/ddp.py:253: in pre_dispatch
    self.configure_ddp()
pytorch_lightning/plugins/training_type/ddp.py:198: in configure_ddp
    **self._ddp_kwargs,
/usr/local/lib/python3.7/dist-packages/torch/nn/parallel/distributed.py:333: in __init__
    self.broadcast_bucket_size)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = DistributedDataParallel(
  (module): LightningDistributedModule(
    (module): BoringModel(
      (layer): Linear(in_features=32, out_features=2, bias=True)
    )
  )
)
tensors = [tensor([[-0.0684, -0.0013,  0.0504,  0.1136, -0.1166, -0.1267, -0.0770, -0.1303,
          0.0095, -0.1341, -0.0082, ...0.0698,
          0.0967, -0.1245, -0.1528, -0.0294, -0.0254, -0.0679,  0.0465, -0.1357]]), tensor([ 0.1718, -0.0356])]
buffer_size = 262144000

    def _distributed_broadcast_coalesced(self, tensors, buffer_size):
>       dist._broadcast_coalesced(self.process_group, tensors, buffer_size)
E       RuntimeError: Tensors must be CUDA and dense

So I thought that BoringModel just didn't work with GPU and skipped it. I'll remove the skip and see if it's passing now.

tests/plugins/test_custom_plugin.py

Co-authored-by: Carlos Mocholí <[email protected]> Co-authored-by: Roger Shieh <[email protected]> Co-authored-by: Kaushik Bokka <[email protected]> (cherry picked from commit 3b72bcc)

Co-authored-by: Carlos Mocholí <[email protected]> Co-authored-by: Roger Shieh <[email protected]> Co-authored-by: Kaushik Bokka <[email protected]>

amogkam added 2 commits March 15, 2021 11:25

auto sync batchnorm

35e8e05

change

4ce6906

amogkam requested review from Borda, SeanNaren, carmocca and tchaton as code owners March 15, 2021 18:32

SeanNaren approved these changes Mar 15, 2021

View reviewed changes

SeanNaren mentioned this pull request Mar 15, 2021

1.2.x cherries 🍒 #6083

Closed

add test

6e29ef6

amogkam requested review from awaelchli, justusschock and williamFalcon as code owners March 15, 2021 18:47

amogkam added 2 commits March 15, 2021 11:48

new line

65579b0

formatting

d1ca0a5

amogkam mentioned this pull request Mar 15, 2021

PTL 1.2 Compatibility ray-project/ray_lightning#15

Merged

SeanNaren approved these changes Mar 15, 2021

View reviewed changes

kaushikb11 approved these changes Mar 15, 2021

View reviewed changes

carmocca reviewed Mar 15, 2021

View reviewed changes

pytorch_lightning/trainer/connectors/accelerator_connector.py Outdated Show resolved Hide resolved

justusschock approved these changes Mar 15, 2021

View reviewed changes

pytorch_lightning/trainer/connectors/accelerator_connector.py Outdated Show resolved Hide resolved

SeanNaren added ready PRs ready to be merged and removed ready PRs ready to be merged labels Mar 15, 2021

carmocca added the bug Something isn't working label Mar 15, 2021

carmocca added this to the 1.2.x milestone Mar 15, 2021

carmocca reviewed Mar 15, 2021

View reviewed changes

tests/plugins/test_custom_plugin.py Show resolved Hide resolved

Update tests/plugins/test_custom_plugin.py

d698716

Co-authored-by: Carlos Mocholí <[email protected]>

amogkam commented Mar 15, 2021

View reviewed changes

tests/plugins/test_custom_plugin.py Outdated Show resolved Hide resolved

Update tests/plugins/test_custom_plugin.py

8f08efc

Update pytorch_lightning/trainer/connectors/accelerator_connector.py

162a2b1

Co-authored-by: Roger Shieh <[email protected]>

ananthsub approved these changes Mar 17, 2021

View reviewed changes

amogkam added 3 commits March 17, 2021 16:29

skip test on gpu

91deb34

Merge branch 'batchnorm' of github.com:amogkam/pytorch-lightning into…

aead530

… batchnorm

formatting

e2fafe5

skip test for Windows

e033f51

kaushikb11 suggested changes Mar 19, 2021

View reviewed changes

awaelchli reviewed Mar 19, 2021

View reviewed changes

tests/plugins/test_custom_plugin.py Outdated Show resolved Hide resolved

awaelchli reviewed Mar 19, 2021

View reviewed changes

tests/plugins/test_custom_plugin.py Outdated Show resolved Hide resolved

amogkam commented Mar 19, 2021

View reviewed changes

tests/plugins/test_custom_plugin.py Outdated Show resolved Hide resolved

amogkam and others added 2 commits March 19, 2021 09:08

Update tests/plugins/test_custom_plugin.py

84a1a1e

Remove unused imports

3ff62e1

kaushikb11 approved these changes Mar 19, 2021

View reviewed changes

kaushikb11 enabled auto-merge (squash) March 19, 2021 19:50

kaushikb11 merged commit 3b72bcc into Lightning-AI:master Mar 19, 2021

Borda mentioned this pull request Mar 23, 2021

Weekly Patch Release v.1.2.5 [full merge, no squash] #6646

Merged

4 tasks

Borda pushed a commit that referenced this pull request Mar 30, 2021

Automatically set sync_batchnorm for training_type_plugin (#6536)

4aa9be2

Co-authored-by: Carlos Mocholí <[email protected]> Co-authored-by: Roger Shieh <[email protected]> Co-authored-by: Kaushik Bokka <[email protected]>

Automatically set sync_batchnorm for training_type_plugin #6536

Automatically set sync_batchnorm for training_type_plugin #6536

Uh oh!

Conversation

amogkam commented Mar 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

PR review

Did you have fun?

Uh oh!

SeanNaren commented Mar 15, 2021

Uh oh!

pep8speaks commented Mar 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated at 2021-03-19 19:49:25 UTC

Uh oh!

Uh oh!

justusschock left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

carmocca commented Mar 15, 2021

Uh oh!

Uh oh!

amogkam commented Mar 17, 2021

Uh oh!

Borda commented Mar 17, 2021

Uh oh!

amogkam commented Mar 17, 2021

Uh oh!

kaushikb11 commented Mar 17, 2021

Uh oh!

amogkam commented Mar 18, 2021

Uh oh!

amogkam commented Mar 18, 2021

Uh oh!

codecov bot commented Mar 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

kaushikb11 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

amogkam commented Mar 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

amogkam commented Mar 15, 2021 •

edited

Loading

pep8speaks commented Mar 15, 2021 •

edited

Loading

codecov bot commented Mar 18, 2021 •

edited

Loading

amogkam commented Mar 19, 2021 •

edited

Loading