Skip to content

Conversation

@SeanNaren
Copy link
Contributor

@SeanNaren SeanNaren commented May 5, 2022

What does this PR do?

Related #12334 #12447.

After merging the above-associated PR and trying to use Native FSDP, I noticed there were many things wrong.

  1. Tests were not being run as GPU CI uses 1.8.x PyTorch (LTS). This is an issue that is being resolved by potentially introducing an additional GPU run for PyTorch Latest. This meant the tests were not running, and they were all failing
  2. 1.11 FSDP seems to be broken in various ways, with state_dict saving/loading issues + no mixed precision. So many fixes have come out for 1.12 (nightly) that no user really should be using 1.11, and be using 1.12 FSDP.
  3. The native precision plugin was not selected when using Native FSDP.

This PR addresses 2 & 3, however, 1 remains a separate issue. I've run each test individually to ensure the integration works (as well as updated it a bit, and introduced mixed-precision support which was missing). I've also moved the requirement to 1.12dev, which will work for PyTorch Nightly.

Does your PR introduce any breaking changes? If yes, please list them.

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you list all the breaking changes introduced by this pull request?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

cc @Borda @tchaton @rohitgr7 @otaj

@SeanNaren SeanNaren added bug Something isn't working strategy: fsdp Fully Sharded Data Parallel labels May 5, 2022
@SeanNaren SeanNaren added this to the 1.7 milestone May 5, 2022
@SeanNaren SeanNaren self-assigned this May 5, 2022
@SeanNaren
Copy link
Contributor Author

These tests were seriously broken 😅

There is an error in the test and integration currently. Going to push some changes to support the subprocess launcher + make the tests fail.

@akihironitta akihironitta mentioned this pull request May 10, 2022
12 tasks
@SeanNaren
Copy link
Contributor Author

After some thought, I think I've reached a consensus:

Let's wait for 1.12 to be released, and revisit this PR when tests start failing (if we GPU test latest). Given past experience with BFloat16, I think its best to hold off trying to stabilize the API till the major release is out. If people end up trying to use FSDP native, we can point them to this PR for a fixed version.

@mergify mergify bot added the ready PRs ready to be merged label May 12, 2022
@Borda Borda marked this pull request as draft May 12, 2022 12:17
@sisilmehta2000
Copy link
Contributor

Most of the code is gated by the PYTORCH_1_12 flag. So then should we just land this before the PyTD 1.12 is released? cc @SeanNaren

@rohan-varma
Copy link

@SeanNaren is there any plan to revisit this PR now that 1.12 is landed? Would be great to have the subprocess launcher and the additional features (mixed precision) for FSDP.

@carmocca
Copy link
Contributor

@rohan-varma Absolutely! This is a high priority for the PL 1.7 release

@carmocca carmocca added the priority: 0 High priority task label Jul 18, 2022
@SeanNaren SeanNaren marked this pull request as ready for review July 19, 2022 09:42
@SeanNaren
Copy link
Contributor Author

As discussed with @carmocca offline, we should get this PR in even if we currently do not have 1.12 tests running on the GPU.

I have manually confirmed all tests pass using 1.12.

@github-actions github-actions bot added the pl Generic label for PyTorch Lightning package label Jul 19, 2022
Copy link
Contributor

@rohitgr7 rohitgr7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

@SeanNaren SeanNaren enabled auto-merge (squash) July 20, 2022 10:51
@SeanNaren SeanNaren merged commit d786985 into master Jul 20, 2022
@SeanNaren SeanNaren deleted the fix/native_fsdp branch July 20, 2022 11:32
justusschock pushed a commit that referenced this pull request Jul 21, 2022
justusschock added a commit that referenced this pull request Jul 25, 2022
* Rename GPUAccelerator to CUDAAccelerator

* Add back GPUAccelerator and deprecate it

* Remove temporary registration

* accelerator connector reroute

* accelerator_connector tests

* update enums

* lite support + tests

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* typo

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* move "gpu" support up before actual accelerator flag checks

* Stupid arguments

* fix tests

* change exception type

* fix registry test

* pre-commit

* CI: debug HPU flow (#13419)

* Update the hpu-tests.yml to pull docker from vault
* fire & sudo
* habana-gaudi-hpus
* Check the driver status on gaudi server (#13718)

Co-authored-by: arao <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Akarsha Rao <[email protected]>

* Update typing-extensions requirement from <4.2.1,>=4.0.0 to >=4.0.0,<4.3.1 in /requirements (#13529)

Update typing-extensions requirement in /requirements

Updates the requirements on [typing-extensions](https://github.com/python/typing_extensions) to permit the latest version.
- [Release notes](https://github.com/python/typing_extensions/releases)
- [Changelog](https://github.com/python/typing_extensions/blob/main/CHANGELOG.md)
- [Commits](python/typing_extensions@4.0.0...4.3.0)

---
updated-dependencies:
- dependency-name: typing-extensions
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* [pre-commit.ci] pre-commit suggestions (#13540)

updates:
- [github.com/psf/black: 22.3.0 → 22.6.0](psf/black@22.3.0...22.6.0)

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* [FIX] Native FSDP precision + tests (#12985)

* Simplify fetching's loader types (#13111)

* Include app templates to the lightning and app packages (#13731)

* Include app templates to the package

Co-authored-by: mansy <[email protected]>
Co-authored-by: Adrian Wälchli <[email protected]>

* Fix mypy typing errors in pytorch_lightning/callbacks/model_checkpoint.py (#13617)

Co-authored-by: Carlos Mocholí <[email protected]>

* Fix typos initialize in docs (#13557)


Co-authored-by: Carlos Mocholí <[email protected]>
Co-authored-by: Adrian Wälchli <[email protected]>

* Fix main progress bar counter when `val_check_interval=int` and `check_val_every_n_epoch=None` (#12832)

* Fix mypy errors attributed to `pytorch_lightning.loggers.tensorboard.py` (#13688)

Co-authored-by: Adrian Wälchli <[email protected]>
Co-authored-by: Rohit Gupta <[email protected]>
Co-authored-by: Carlos Mocholí <[email protected]>

* Fix mypy errors attributed to `pytorch_lightning.loggers.mlflow` (#13691)

Co-authored-by: Jirka Borovec <[email protected]>
Co-authored-by: otaj <[email protected]>

* fix mypy errors for loggers/wandb.py (#13483)


Co-authored-by: Carlos Mocholí <[email protected]>
Co-authored-by: Rohit Gupta <[email protected]>
Co-authored-by: Akihiro Nitta <[email protected]>

* Fix gatekeeper minimum check (#13769)

* changelog

* changelog

* fix order

* move up again

* add missing test

Co-authored-by: rohitgr7 <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Jirka Borovec <[email protected]>
Co-authored-by: arao <[email protected]>
Co-authored-by: Akarsha Rao <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Sean Naren <[email protected]>
Co-authored-by: Carlos Mocholí <[email protected]>
Co-authored-by: Mansy <[email protected]>
Co-authored-by: mansy <[email protected]>
Co-authored-by: Adrian Wälchli <[email protected]>
Co-authored-by: Lee Jungwon <[email protected]>
Co-authored-by: Nathaniel D'Amours <[email protected]>
Co-authored-by: Justin Goheen <[email protected]>
Co-authored-by: otaj <[email protected]>
Co-authored-by: Gautier Dagan <[email protected]>
Co-authored-by: Akihiro Nitta <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working pl Generic label for PyTorch Lightning package priority: 0 High priority task ready PRs ready to be merged strategy: fsdp Fully Sharded Data Parallel

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants