Correct way to use all_gather in DDP #14152

cnut1648 · 2022-08-10T18:23:41Z

cnut1648
Aug 10, 2022

Hello, I wonder if there is a definitive guide on how to use all_gather in DDP with single dataloader so that we can gather outputs from model and write to a file.
I am using the latest pl 1.7.1. Say in train_step I return a dictionary, with a field generated being a list of dictionaries. Then in train_epoch_end(self, outputs) I want to collect all dictionaries from every train_step in every process and then save the whole List[dict] on disk. I have the following:

def train_epoch_end(self, outputs):
     # should I call here?
     # generated = self.all_gather(outputs)
     if self.trainer.is_global_zero:
          save_to_dist(generated)

I found that outputs without self.all_gather (i.e. input of train_epoch_end) is a list of list of dictionaries, am I correct that the len(outputs) = num_of_train_step_called?. Moreover, in each of the dictionaries returned by train_step I have a field that is a list of floats. But after all_gather this field seems to be changed (eg from [0.24, 0.38 ...] to [[0.24, 0.93], [0.38, 0.84], ....] but I have no idea where the second float come from).
Therefore I wonder what is the best practice to gather outputs from all previous train_step in DDP. Is all_gather still the best practice? Am I doing correctly with all_gather? I saw in #11019 that using PredictionWriter is the best practice. Is it still considered the best practice as of now?
Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Correct way to use all_gather in DDP #14152

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Correct way to use all_gather in DDP #14152

Uh oh!

Uh oh!

cnut1648 Aug 10, 2022

Replies: 0 comments

cnut1648
Aug 10, 2022