From @ananthsub:
how should Lightning keep its DDP override in sync with the upstream torch DistributedDataParallel? these implementations have now diverged. I think this leads to performance degradations with Lightning + gradient accumulations, since the require_backward_grad_sync attribute isn't checked before the backwards pass