-
Notifications
You must be signed in to change notification settings - Fork 3.6k
[FIX] Native FSDP precision + tests #12985
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
These tests were seriously broken 😅 There is an error in the test and integration currently. Going to push some changes to support the subprocess launcher + make the tests fail. |
|
After some thought, I think I've reached a consensus: Let's wait for 1.12 to be released, and revisit this PR when tests start failing (if we GPU test latest). Given past experience with BFloat16, I think its best to hold off trying to stabilize the API till the major release is out. If people end up trying to use FSDP native, we can point them to this PR for a fixed version. |
|
Most of the code is gated by the PYTORCH_1_12 flag. So then should we just land this before the PyTD 1.12 is released? cc @SeanNaren |
|
@SeanNaren is there any plan to revisit this PR now that 1.12 is landed? Would be great to have the subprocess launcher and the additional features (mixed precision) for FSDP. |
|
@rohan-varma Absolutely! This is a high priority for the PL 1.7 release |
|
As discussed with @carmocca offline, we should get this PR in even if we currently do not have 1.12 tests running on the GPU. I have manually confirmed all tests pass using 1.12. |
rohitgr7
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit
src/pytorch_lightning/plugins/precision/fully_sharded_native_amp.py
Outdated
Show resolved
Hide resolved
Co-authored-by: Rohit Gupta <[email protected]>
for more information, see https://pre-commit.ci
* Rename GPUAccelerator to CUDAAccelerator * Add back GPUAccelerator and deprecate it * Remove temporary registration * accelerator connector reroute * accelerator_connector tests * update enums * lite support + tests * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * typo * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * move "gpu" support up before actual accelerator flag checks * Stupid arguments * fix tests * change exception type * fix registry test * pre-commit * CI: debug HPU flow (#13419) * Update the hpu-tests.yml to pull docker from vault * fire & sudo * habana-gaudi-hpus * Check the driver status on gaudi server (#13718) Co-authored-by: arao <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Akarsha Rao <[email protected]> * Update typing-extensions requirement from <4.2.1,>=4.0.0 to >=4.0.0,<4.3.1 in /requirements (#13529) Update typing-extensions requirement in /requirements Updates the requirements on [typing-extensions](https://github.com/python/typing_extensions) to permit the latest version. - [Release notes](https://github.com/python/typing_extensions/releases) - [Changelog](https://github.com/python/typing_extensions/blob/main/CHANGELOG.md) - [Commits](python/typing_extensions@4.0.0...4.3.0) --- updated-dependencies: - dependency-name: typing-extensions dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * [pre-commit.ci] pre-commit suggestions (#13540) updates: - [github.com/psf/black: 22.3.0 → 22.6.0](psf/black@22.3.0...22.6.0) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * [FIX] Native FSDP precision + tests (#12985) * Simplify fetching's loader types (#13111) * Include app templates to the lightning and app packages (#13731) * Include app templates to the package Co-authored-by: mansy <[email protected]> Co-authored-by: Adrian Wälchli <[email protected]> * Fix mypy typing errors in pytorch_lightning/callbacks/model_checkpoint.py (#13617) Co-authored-by: Carlos Mocholí <[email protected]> * Fix typos initialize in docs (#13557) Co-authored-by: Carlos Mocholí <[email protected]> Co-authored-by: Adrian Wälchli <[email protected]> * Fix main progress bar counter when `val_check_interval=int` and `check_val_every_n_epoch=None` (#12832) * Fix mypy errors attributed to `pytorch_lightning.loggers.tensorboard.py` (#13688) Co-authored-by: Adrian Wälchli <[email protected]> Co-authored-by: Rohit Gupta <[email protected]> Co-authored-by: Carlos Mocholí <[email protected]> * Fix mypy errors attributed to `pytorch_lightning.loggers.mlflow` (#13691) Co-authored-by: Jirka Borovec <[email protected]> Co-authored-by: otaj <[email protected]> * fix mypy errors for loggers/wandb.py (#13483) Co-authored-by: Carlos Mocholí <[email protected]> Co-authored-by: Rohit Gupta <[email protected]> Co-authored-by: Akihiro Nitta <[email protected]> * Fix gatekeeper minimum check (#13769) * changelog * changelog * fix order * move up again * add missing test Co-authored-by: rohitgr7 <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Jirka Borovec <[email protected]> Co-authored-by: arao <[email protected]> Co-authored-by: Akarsha Rao <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Sean Naren <[email protected]> Co-authored-by: Carlos Mocholí <[email protected]> Co-authored-by: Mansy <[email protected]> Co-authored-by: mansy <[email protected]> Co-authored-by: Adrian Wälchli <[email protected]> Co-authored-by: Lee Jungwon <[email protected]> Co-authored-by: Nathaniel D'Amours <[email protected]> Co-authored-by: Justin Goheen <[email protected]> Co-authored-by: otaj <[email protected]> Co-authored-by: Gautier Dagan <[email protected]> Co-authored-by: Akihiro Nitta <[email protected]>
What does this PR do?
Related #12334 #12447.
After merging the above-associated PR and trying to use Native FSDP, I noticed there were many things wrong.
This PR addresses 2 & 3, however, 1 remains a separate issue. I've run each test individually to ensure the integration works (as well as updated it a bit, and introduced mixed-precision support which was missing). I've also moved the requirement to 1.12dev, which will work for PyTorch Nightly.
Does your PR introduce any breaking changes? If yes, please list them.
Before submitting
PR review
Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:
Did you have fun?
Make sure you had fun coding 🙃
cc @Borda @tchaton @rohitgr7 @otaj