Skip to content

Conversation

@ananthsub
Copy link
Contributor

@ananthsub ananthsub commented Oct 28, 2020

What does this PR do?

From #4301

Even when gradient accumulation is enabled with DDP, we still see significant time spent in the backwards pass.
#4301 enables the no_sync when accumulating gradients. However, in the backward pass, we use the module inside of DDP for computing the backward. This circumvents the require_backward_grad_sync=False on the wrapped DDP model, so we miss out on the gradient accumulation speedups.

https://github.com/PyTorchLightning/pytorch-lightning/blob/41de4538aa0c187793709a93875e67666c2ddde8/pytorch_lightning/trainer/connectors/model_connector.py#L54-L57

https://github.com/PyTorchLightning/pytorch-lightning/blob/41de4538aa0c187793709a93875e67666c2ddde8/pytorch_lightning/accelerators/accelerator.py#L89-L101

Before submitting

  • Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together? Otherwise, we ask you to create a separate PR for every change.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?
  • Did you verify new and existing tests pass locally with your changes?
  • If you made a notable change (that affects users), did you update the CHANGELOG?

PR review

  • Is this pull request ready for review? (if not, please submit in draft mode)

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

else:
# do backward pass
model = self.trainer.get_model()
model = self.trainer.model
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ummm... this will break the other accelerators no?
DP will be wrapped, and so will the ddp one?

Copy link
Contributor Author

@ananthsub ananthsub Oct 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The motivating factor is skipping parameter syncs in DDP (#4301). While accumulating gradients,

  • training_step_and_backward calls training_loop backward
  • training_loop backward calls accelerator backward here
  • accelerator backward reaches into the DP/DDP model and extracts the module inside and calls backward on that
  • which ignores the flags here:

How should we respect those settings in the backwards pass here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants