Skip to content

Conversation

@SeanNaren
Copy link
Contributor

@SeanNaren SeanNaren commented Jun 21, 2022

What does this PR do?

Closes #12317.

This is a solution to fixing estimated_stepping_batches for the DeepSpeedStrategy.

This solution is optimal, however, touches internals which may prove error-prone in the future. The reason we do not use model_to_device is that we do not want to move the entire model to the device when using DeepSpeed Stage 3. We rely on DeepSpeed to manage device sharding/assignment.

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you list all the breaking changes introduced by this pull request?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

cc @Borda @SeanNaren @awaelchli @rohitgr7 @akihironitta

@SeanNaren SeanNaren added bug Something isn't working strategy: deepspeed labels Jun 21, 2022
@SeanNaren SeanNaren added this to the pl:1.6.x milestone Jun 21, 2022
@SeanNaren SeanNaren self-assigned this Jun 21, 2022
@SeanNaren SeanNaren changed the title Fix/deepspeed [BUG] estimated_stepping_batches requires distributed comms in configure_optimizers for DeepSpeedStrategy Jun 21, 2022
@mergify mergify bot added the ready PRs ready to be merged label Jun 21, 2022
Copy link
Collaborator

@Borda Borda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@SeanNaren SeanNaren merged commit 89e2e69 into master Jun 21, 2022
@SeanNaren SeanNaren deleted the fix/deepspeed branch June 21, 2022 16:48
@rohitgr7 rohitgr7 mentioned this pull request Jul 1, 2022
12 tasks
rohitgr7 pushed a commit that referenced this pull request Jul 1, 2022
lexierule pushed a commit that referenced this pull request Jul 12, 2022
* update NGC docker (#13136)

* update docker
* Apply suggestions from code review

Co-authored-by: Akihiro Nitta <[email protected]>
Co-authored-by: Carlos Mocholí <[email protected]>

* Decouple pulling legacy checkpoints from existing GHA workflows and docker files (#13185)

* Add pull-legacy-checkpoints action
* Replace pulls with the new action and script
* Simplify

* Merge pull request #13250 from PyTorchLightning/ci/rm-base

CI: Remove simple test `ci_test-base.yml`

* Update rich requirement from !=10.15.*,<=12.0.0,>=10.2.2 to >=10.2.2,!=10.15.0.a,<13.0.0 in /requirements (#13047)

* Update rich requirement in /requirements

Updates the requirements on [rich](https://github.com/willmcgugan/rich) to permit the latest version.
- [Release notes](https://github.com/willmcgugan/rich/releases)
- [Changelog](https://github.com/Textualize/rich/blob/master/CHANGELOG.md)
- [Commits](Textualize/rich@v10.2.2...v12.4.1)

---
updated-dependencies:
- dependency-name: rich
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>

* Fix torch.distributed._sharded_tensor DeprecationWarning (#13261)

* update tutorials (#13268)

* [BUG] `estimated_stepping_batches` requires distributed comms in `configure_optimizers` for `DeepSpeedStrategy` (#13350)

* Update torchmetrics requirement from <=0.7.2,>=0.4.1 to >=0.4.1,<0.9.2 in /requirements (#13275)

Update torchmetrics requirement in /requirements

Updates the requirements on [torchmetrics](https://github.com/PyTorchLightning/metrics) to permit the latest version.
- [Release notes](https://github.com/PyTorchLightning/metrics/releases)
- [Changelog](https://github.com/PyTorchLightning/metrics/blob/master/CHANGELOG.md)
- [Commits](Lightning-AI/torchmetrics@v0.4.1...v0.9.1)

---
updated-dependencies:
- dependency-name: torchmetrics
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Fix mypy errors for model summary utilities (#13384)

* rename org Lightning AI

* Modified python version check to accommodate for legacy version styles (#13420)

Co-authored-by: Carlos Mocholí <[email protected]>

(cherry picked from commit b332b66)

* Call `set_epoch` for distributed batch samplers (#13396)

Co-authored-by: Jirka <[email protected]>
Co-authored-by: Rohit Gupta <[email protected]>

(cherry picked from commit 2dd332f)

* _RICH_AVAILABLE

* _FAIRSCALE_AVAILABLE

* _BAGUA_AVAILABLE

* redefine

* chlog spaces

* CI: Fix `fatal: unsafe repository` (#13515)

* update release date

* CI: azure rename

* Restore log step during restart (#13467)

Co-authored-by: Carlos Mocholí <[email protected]>

* remove redundant test

* Update CI setup (#13291)

* drop mamba
* use legacy GPU machines

* fix schema check

Co-authored-by: Jirka Borovec <[email protected]>
Co-authored-by: Akihiro Nitta <[email protected]>
Co-authored-by: Carlos Mocholí <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Adam J. Stewart <[email protected]>
Co-authored-by: Sean Naren <[email protected]>
Co-authored-by: Adrian Wälchli <[email protected]>
Co-authored-by: Jirka <[email protected]>
Co-authored-by: Martino Sorbaro <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working ready PRs ready to be merged strategy: deepspeed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Can't use estimated_stepping_batches in configure_optimizers with DDP

4 participants