Correct way to use all_gather in DDP #14152
Unanswered
cnut1648
asked this question in
DDP / multi-GPU / multi-node
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello, I wonder if there is a definitive guide on how to use
all_gatherin DDP with single dataloader so that we can gather outputs from model and write to a file.I am using the latest pl 1.7.1. Say in
train_stepI return a dictionary, with a fieldgeneratedbeing a list of dictionaries. Then intrain_epoch_end(self, outputs)I want to collect all dictionaries from everytrain_stepin every process and then save the wholeList[dict]on disk. I have the following:I found that
outputswithoutself.all_gather(i.e. input oftrain_epoch_end) is a list of list of dictionaries, am I correct that thelen(outputs) = num_of_train_step_called?. Moreover, in each of the dictionaries returned bytrain_stepI have a field that is a list of floats. But afterall_gatherthis field seems to be changed (eg from[0.24, 0.38 ...]to[[0.24, 0.93], [0.38, 0.84], ....]but I have no idea where the second float come from).Therefore I wonder what is the best practice to gather outputs from all previous
train_stepin DDP. Isall_gatherstill the best practice? Am I doing correctly withall_gather? I saw in #11019 that usingPredictionWriteris the best practice. Is it still considered the best practice as of now?Thank you!
Beta Was this translation helpful? Give feedback.
All reactions