DeepSpeed: Multi-GPU don't seem to converge with accumulate_grad_batches > 1

### Discussed in https://github.com/PyTorchLightning/pytorch-lightning/discussions/8019

<div type='discussions-op-text'>

<sup>Originally posted by **thomas-happify** June 17, 2021</sup>
```
seed_everything(43, worker=True)

Trainer(gpus=1, accumulate_grad_batches=16, accelerator='ddp', plugin='deepspeed',  max_epoch=20)
Trainer(gpus=2, accumulate_grad_batches=8, accelerator='ddp', plugin='deepspeed', max_epoch=20)
```

shouldn't these two args have similar training results? 
when I use `gpus=2` and `accumulate_grad_batches=8`, the model can't converge.</div>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DeepSpeed: Multi-GPU don't seem to converge with accumulate_grad_batches > 1 #8058

Discussed in #8019

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

DeepSpeed: Multi-GPU don't seem to converge with accumulate_grad_batches > 1 #8058

Description

Discussed in #8019

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions