Avoid unnecessary DDP synchronization when gradient_accumulation_steps > 1 

## 🚀 Feature
Avoid unnecessary DDP synchronization when gradient_accumulation_steps > 1 

### Motivation

When training large models the synchronization  is costly and the actual speedup from 2 gpus is much lower than 200%

### Pitch

We can use DDP [no_sync](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel.no_sync) feature to avoid synchronization in steps that doesn't call ```optimizer_step```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Avoid unnecessary DDP synchronization when gradient_accumulation_steps > 1 #4092

🚀 Feature

Motivation

Pitch

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Avoid unnecessary DDP synchronization when gradient_accumulation_steps > 1 #4092

Description

🚀 Feature

Motivation

Pitch

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions