Change accelerator in backward to use DDP-wrapped model #4415
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does this PR do?
From #4301
Even when gradient accumulation is enabled with DDP, we still see significant time spent in the backwards pass.
#4301 enables the
no_syncwhen accumulating gradients. However, in thebackwardpass, we use the module inside of DDP for computing the backward. This circumvents therequire_backward_grad_sync=Falseon the wrapped DDP model, so we miss out on the gradient accumulation speedups.https://github.com/PyTorchLightning/pytorch-lightning/blob/41de4538aa0c187793709a93875e67666c2ddde8/pytorch_lightning/trainer/connectors/model_connector.py#L54-L57
https://github.com/PyTorchLightning/pytorch-lightning/blob/41de4538aa0c187793709a93875e67666c2ddde8/pytorch_lightning/accelerators/accelerator.py#L89-L101
Before submitting
PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.
Did you have fun?
Make sure you had fun coding 🙃