Skip to content

DeepSpeed: Multi-GPU don't seem to converge with accumulate_grad_batches > 1 #8058

@tchaton

Description

@tchaton

Discussed in #8019

Originally posted by thomas-happify June 17, 2021

seed_everything(43, worker=True)

Trainer(gpus=1, accumulate_grad_batches=16, accelerator='ddp', plugin='deepspeed',  max_epoch=20)
Trainer(gpus=2, accumulate_grad_batches=8, accelerator='ddp', plugin='deepspeed', max_epoch=20)

shouldn't these two args have similar training results?
when I use gpus=2 and accumulate_grad_batches=8, the model can't converge.

Metadata

Metadata

Assignees

Labels

distributedGeneric distributed-related topicwaiting on authorWaiting on user action, correction, or updateworking as intendedWorking as intended

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions