Add support for `torch.use_deterministic_algorithms` #9121

ananthsub · 2021-08-25T23:22:56Z

What does this PR do?

https://pytorch.org/docs/stable/generated/torch.use_deterministic_algorithms.html
Sets the flag for using deterministic algorithms based on the trainer flag deterministic

Fixes #9107
Fixes #9544

Does your PR introduce any breaking changes? If yes, please list them.

Yes. This will raise a runtime error if there are no deterministic algorithms available. Previously, this was best-effort, and no error would be raised.

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you list all the breaking changes introduced by this pull request?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

pytorch_lightning/trainer/connectors/accelerator_connector.py

pytorch_lightning/trainer/trainer.py

pytorch_lightning/trainer/connectors/accelerator_connector.py

ananthsub · 2021-08-26T01:12:54Z

test failures for parity tests are related due to the new flag. will take a deeper look

codecov · 2021-08-26T01:18:09Z

Codecov Report

Merging #9121 (9ab9bbd) into master (8c9cb0c) will decrease coverage by 4%.
The diff coverage is 100%.

@@           Coverage Diff           @@
##           master   #9121    +/-   ##
=======================================
- Coverage      93%     89%    -4%     
=======================================
  Files         177     177            
  Lines       15456   15460     +4     
=======================================
- Hits        14317   13698   -619     
- Misses       1139    1762   +623

tchaton

LGTM !

justusschock · 2021-09-11T07:01:26Z

Actually, I am not sure if we can switch it that easily, Before deterministic was more like a "make it as deterministic as possible but still run it". Changing that to raising issues is quite a breaking change here even though this is more what one would expect.

Personally, I have a lot of code using that flag and relying on it not to fail.

ananthsub · 2021-09-11T07:24:01Z

@justusschock I agree, this would be a breaking change. Even some of our tests have been updated to avoid the runtime error raised.

I am not sure how to square this with new expectations PyTorch provides around these deterministic checks. The new API is more comprehensive and offers stricter guarantees around reproducibility.

Options:

go with the current approach in this PR, where deterministic follows the strictest interpretation currently offered. This leads to a breaking change with potential runtime exceptions now raised if deterministic implementations aren't available. To opt out, users don't set the flag on the trainer, and instead configure the cudnn flag independently (as seen in the parity test change). The runtime error could be seen as a good thing if we're trying to provide more visibility & guarantees around determinism. But it's obviously not great dealing with newly raised exceptions.
or ask users to set torch's use deterministic algorithms themselves, and deal with runtime errors, while the trainer does only safe changes that don't risk runtime errors raised. This is cleaner from the exception handling POV. But it's misleading because the guarantees the Lightning Trainer can make around determinism are inherently weaker. It's also the framework not leveraging PyTorch's capabilities to the fullest.

Are there other paths you see?

carmocca · 2021-09-11T12:53:41Z

Are there other paths you see?

deterministic=False: Not enabled
deterministic=True: Same as today. Prints info message about deterministic="strict" for discoverability
deterministic="strict": Uses torch.use_deterministic_algorithms

IMO torch should've provided a function or flag that doesn't raise a RuntimeError so that cudnn.benchmark can be easily replaced, since the new function also "will make other PyTorch operations [than CUDA] behave deterministically".

For context: pytorch/pytorch#15359

carmocca · 2021-09-11T13:40:25Z

@kurtamohler (author of use_deterministic_algorithms) Looks like the initial design included a mechanism to default to a warning but was eventually removed: pytorch/pytorch#38683 (comment). Where was the final decision made? Do you have any suggestions for how this PR should proceed?

Thanks :)

kurtamohler · 2021-09-11T17:32:27Z

I'm not sure at the moment where the discussion was, I'll do some searching.

To me, it seems like we would have to add a warn-only option to use_deterministic_algorithms. Would you mind opening an issue in pytorch and tag me in it? Otherwise, I can open an issue next time I'm at my computer

awaelchli · 2021-09-18T09:41:17Z

+1 for the current approach. IMO if our Trainer has an argument "deterministic" it should do what it says and the stricter version introduced here makes sense to me.

tchaton

LGTM !

tests/models/test_horovod.py

pytorch_lightning/trainer/connectors/accelerator_connector.py

ananthsub requested review from Borda, SeanNaren, awaelchli, carmocca, justusschock, kaushikb11, tchaton and williamFalcon as code owners August 25, 2021 23:22

ananthsub mentioned this pull request Aug 25, 2021

Link to https://pytorch.org/docs/stable/notes/randomness.html #9107

Closed

ananthsub added this to the v1.5 milestone Aug 25, 2021

awaelchli added the feature Is an improvement or enhancement label Aug 25, 2021

awaelchli approved these changes Aug 25, 2021

View reviewed changes

carmocca reviewed Aug 26, 2021

View reviewed changes

pytorch_lightning/trainer/connectors/accelerator_connector.py Outdated Show resolved Hide resolved

pytorch_lightning/trainer/trainer.py Outdated Show resolved Hide resolved

pytorch_lightning/trainer/connectors/accelerator_connector.py Show resolved Hide resolved

carmocca reviewed Aug 26, 2021

View reviewed changes

pytorch_lightning/trainer/connectors/accelerator_connector.py Show resolved Hide resolved

tchaton mentioned this pull request Aug 26, 2021

[RFC] Depreceate Trainer(benchmark) and use deterministic #9128

Closed

awaelchli mentioned this pull request Aug 27, 2021

extend the parameter Trainer(deterministic=False) #9164

Closed

ananthsub linked an issue Aug 27, 2021 that may be closed by this pull request

extend the parameter Trainer(deterministic=False) #9164

Closed

tchaton approved these changes Aug 30, 2021

View reviewed changes

mergify bot added the ready PRs ready to be merged label Aug 30, 2021

Borda approved these changes Aug 31, 2021

View reviewed changes

ananthsub closed this Sep 10, 2021

ananthsub force-pushed the feat/torch-deterministic-algorithm branch from 454f826 to b294c57 Compare September 10, 2021 22:07

ananthsub reopened this Sep 10, 2021

ananthsub added the breaking change Includes a breaking change label Sep 17, 2021

mergify bot added the has conflicts label Sep 17, 2021

tchaton approved these changes Sep 20, 2021

View reviewed changes

ananthsub force-pushed the feat/torch-deterministic-algorithm branch from 346264a to e04c956 Compare September 25, 2021 04:32

mergify bot removed the has conflicts label Sep 25, 2021

ananthsub commented Sep 25, 2021

View reviewed changes

tests/models/test_horovod.py Show resolved Hide resolved

s-rog reviewed Sep 28, 2021

View reviewed changes

pytorch_lightning/trainer/connectors/accelerator_connector.py Show resolved Hide resolved

s-rog approved these changes Sep 28, 2021

View reviewed changes

mergify bot added the has conflicts label Sep 29, 2021

ananthsub added 6 commits September 29, 2021 15:28

re-add changes

e95d5c6

Update test_data_parallel.py

af9d303

Update CHANGELOG.md

ad5e2f7

Update test_legacy_checkpoints.py

ab1c328

Update test_horovod.py

da17500

Update test_horovod.py

3b2ad55

ananthsub force-pushed the feat/torch-deterministic-algorithm branch from 704d2da to 3b2ad55 Compare September 29, 2021 22:31

mergify bot removed the has conflicts label Sep 29, 2021

ananthsub added 2 commits September 29, 2021 15:34

Update accelerator_connector.py

a75d07e

update tests

c4e5ee6

ananthsub enabled auto-merge (squash) September 29, 2021 23:23

ananthsub added 8 commits September 29, 2021 16:39

update tests

421d044

Update test_data_parallel.py

68434f5

Update test_data_parallel.py

54dc9fe

Update test_data_parallel.py

c6f288b

Update test_data_parallel.py

bb8bdc0

Update conftest.py

1986bb9

Update accelerator_connector.py

fa269b3

Update conftest.py

9ab9bbd

ananthsub merged commit 0d3325e into Lightning-AI:master Sep 30, 2021

Add support for torch.use_deterministic_algorithms #9121

Add support for torch.use_deterministic_algorithms #9121

Uh oh!

Conversation

ananthsub commented Aug 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Does your PR introduce any breaking changes? If yes, please list them.

Before submitting

PR review

Did you have fun?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ananthsub commented Aug 26, 2021

Uh oh!

codecov bot commented Aug 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

tchaton left a comment

Choose a reason for hiding this comment

Uh oh!

justusschock commented Sep 11, 2021

Uh oh!

ananthsub commented Sep 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

carmocca commented Sep 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

carmocca commented Sep 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kurtamohler commented Sep 11, 2021

Uh oh!

awaelchli commented Sep 18, 2021

Uh oh!

tchaton left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

Add support for `torch.use_deterministic_algorithms` #9121

Add support for `torch.use_deterministic_algorithms` #9121

ananthsub commented Aug 25, 2021 •

edited

Loading

codecov bot commented Aug 26, 2021 •

edited

Loading

ananthsub commented Sep 11, 2021 •

edited

Loading

carmocca commented Sep 11, 2021 •

edited

Loading

carmocca commented Sep 11, 2021 •

edited

Loading