-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Closed
Labels
bugSomething isn't workingSomething isn't workinghelp wantedOpen to be worked onOpen to be worked onpriority: 1Medium priority taskMedium priority taskwaiting on authorWaiting on user action, correction, or updateWaiting on user action, correction, or update
Milestone
Description
🐛 Bug
We can run dp backend with manual optimization, but the gradients seem to be messed up hence the model can't learn anything.
To Reproduce
- Change optimization to manual in basic gan bolt, then change the backend to
dp. - Set
batch_size = 2, compare experiments on 1 GPU vs 2 GPUs - When using 1 GPU everything is fine, but using 2 GPUs will fail the training.
I haven't really test it yet, but since I've done many experiments on my own implementations (which is too heavy to paste them here and hard to extract), I think it should be able to reproduce.
Expected behavior
Performance under 2 GPUs with dp backend should be identical to the 1 GPU one.
Environment
(Should be ) Any.
Additional context
This bug comes from my experiments on GANs but should be affecting other models as long as the manual optimization is utilized.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workinghelp wantedOpen to be worked onOpen to be worked onpriority: 1Medium priority taskMedium priority taskwaiting on authorWaiting on user action, correction, or updateWaiting on user action, correction, or update