🚀 Feature
Avoid unnecessary DDP synchronization when gradient_accumulation_steps > 1
Motivation
When training large models the synchronization is costly and the actual speedup from 2 gpus is much lower than 200%
Pitch
We can use DDP no_sync feature to avoid synchronization in steps that doesn't call optimizer_step