Skip to content

Conversation

@amogkam
Copy link
Contributor

@amogkam amogkam commented Mar 15, 2021

What does this PR do?

This PR automatically sets the sync_batchnorm attribute for the training_type_plugin in the accelerator_connector. This is useful for custom plugins when sync_batchnorm is not known during plugin instantiation.

Fixes #<issue_number>

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

@SeanNaren
Copy link
Contributor

We may want a small test for this, just to make sure we catch this in case! Would you be able to write a small test using the BoringModel using a custom plugin? Just to ensure that sync_batchnorm is set correctly if not present

@SeanNaren SeanNaren mentioned this pull request Mar 15, 2021
@pep8speaks
Copy link

pep8speaks commented Mar 15, 2021

Hello @amogkam! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-03-19 19:49:25 UTC

Copy link
Member

@justusschock justusschock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amogkam I'll approve, but please address the comment from @carmocca

@SeanNaren SeanNaren added ready PRs ready to be merged and removed ready PRs ready to be merged labels Mar 15, 2021
@carmocca carmocca added the bug Something isn't working label Mar 15, 2021
@carmocca carmocca added this to the 1.2.x milestone Mar 15, 2021
@carmocca
Copy link
Contributor

Please, do not ignore the Before submitting/PR review headers at the top.

@amogkam
Copy link
Contributor Author

amogkam commented Mar 17, 2021

Is this ready to get merged in?

@Borda
Copy link
Collaborator

Borda commented Mar 17, 2021

Is this ready to get merged in?

seems you have failing GPU test test_sync_batchnorm_set

@amogkam
Copy link
Contributor Author

amogkam commented Mar 17, 2021

@Borda which job is the failing test on? I’m not able to find it.

@kaushikb11
Copy link
Contributor

@amogkam You could see the failing tests, here.

@amogkam
Copy link
Contributor Author

amogkam commented Mar 18, 2021

I skipped the test on GPU, but the CI is still failing. Any suggestions here? It doesn't look like it's related to this PR.

@amogkam
Copy link
Contributor Author

amogkam commented Mar 18, 2021

Though all the required jobs are passing. Is this enough to merge this in?

@codecov
Copy link

codecov bot commented Mar 18, 2021

Codecov Report

Merging #6536 (e2fafe5) into master (ea36ee3) will decrease coverage by 8%.
The diff coverage is 67%.

❗ Current head e2fafe5 differs from pull request most recent head e033f51. Consider uploading reports for the commit e033f51 to get more accurate results

@@           Coverage Diff           @@
##           master   #6536    +/-   ##
=======================================
- Coverage      94%     86%    -8%     
=======================================
  Files         166     168     +2     
  Lines       11634   12205   +571     
=======================================
- Hits        10947   10533   -414     
- Misses        687    1672   +985     

Copy link
Contributor

@kaushikb11 kaushikb11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @amogkam for your contribution! The test was failing for Windows, added a skip for that.

Just wanted to know why is it needed to skip the test on GPU machines?

@amogkam
Copy link
Contributor Author

amogkam commented Mar 19, 2021

@kaushikb11 it was failing with this error

pytorch_lightning/trainer/trainer.py:469: in fit
    self.pre_dispatch()
pytorch_lightning/trainer/trainer.py:496: in pre_dispatch
    self.accelerator.pre_dispatch()
pytorch_lightning/accelerators/accelerator.py:91: in pre_dispatch
    self.training_type_plugin.pre_dispatch()
pytorch_lightning/plugins/training_type/ddp.py:253: in pre_dispatch
    self.configure_ddp()
pytorch_lightning/plugins/training_type/ddp.py:198: in configure_ddp
    **self._ddp_kwargs,
/usr/local/lib/python3.7/dist-packages/torch/nn/parallel/distributed.py:333: in __init__
    self.broadcast_bucket_size)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = DistributedDataParallel(
  (module): LightningDistributedModule(
    (module): BoringModel(
      (layer): Linear(in_features=32, out_features=2, bias=True)
    )
  )
)
tensors = [tensor([[-0.0684, -0.0013,  0.0504,  0.1136, -0.1166, -0.1267, -0.0770, -0.1303,
          0.0095, -0.1341, -0.0082, ...0.0698,
          0.0967, -0.1245, -0.1528, -0.0294, -0.0254, -0.0679,  0.0465, -0.1357]]), tensor([ 0.1718, -0.0356])]
buffer_size = 262144000

    def _distributed_broadcast_coalesced(self, tensors, buffer_size):
>       dist._broadcast_coalesced(self.process_group, tensors, buffer_size)
E       RuntimeError: Tensors must be CUDA and dense

So I thought that BoringModel just didn't work with GPU and skipped it. I'll remove the skip and see if it's passing now.

@kaushikb11 kaushikb11 enabled auto-merge (squash) March 19, 2021 19:50
@kaushikb11 kaushikb11 merged commit 3b72bcc into Lightning-AI:master Mar 19, 2021
Borda pushed a commit that referenced this pull request Mar 23, 2021
Co-authored-by: Carlos Mocholí <[email protected]>
Co-authored-by: Roger Shieh <[email protected]>
Co-authored-by: Kaushik Bokka <[email protected]>
(cherry picked from commit 3b72bcc)
Borda pushed a commit that referenced this pull request Mar 23, 2021
Co-authored-by: Carlos Mocholí <[email protected]>
Co-authored-by: Roger Shieh <[email protected]>
Co-authored-by: Kaushik Bokka <[email protected]>
(cherry picked from commit 3b72bcc)
Borda pushed a commit that referenced this pull request Mar 23, 2021
Co-authored-by: Carlos Mocholí <[email protected]>
Co-authored-by: Roger Shieh <[email protected]>
Co-authored-by: Kaushik Bokka <[email protected]>
(cherry picked from commit 3b72bcc)
carmocca added a commit that referenced this pull request Mar 29, 2021
Co-authored-by: Carlos Mocholí <[email protected]>
Co-authored-by: Roger Shieh <[email protected]>
Co-authored-by: Kaushik Bokka <[email protected]>
Borda pushed a commit that referenced this pull request Mar 30, 2021
Co-authored-by: Carlos Mocholí <[email protected]>
Co-authored-by: Roger Shieh <[email protected]>
Co-authored-by: Kaushik Bokka <[email protected]>
lexierule pushed a commit that referenced this pull request Mar 30, 2021
Co-authored-by: Carlos Mocholí <[email protected]>
Co-authored-by: Roger Shieh <[email protected]>
Co-authored-by: Kaushik Bokka <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working ready PRs ready to be merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants