-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
When training or Validating on 2 nodes (8 gpus per node), Lightning show 16 same progress bars with different loss, such like
Epoch 0: 9%|▊ | 3/35 [00:14<02:29, 4.68s/it, loss=10.122, v_num=193413].
Epoch 0: 9%|▊ | 3/35 [00:14<02:29, 4.68s/it, loss=9.858, v_num=193413]
Epoch 0: 9%|▊ | 3/35 [00:14<02:29, 4.68s/it, loss=10.225, v_num=193413]
...
It means the output have 16 progress bars if training on 16 GPUs. I suppose that samples used for training are different on each GPU therefore leads to different progress bar. Moreover, the similar situation also show up in validation. I am wondering if different samples are distributed to each GPU during validation.
I use Dataset for training and IterableDataset for validation.
Training: 0it [00:00, ?it/s]
Training: 0%| | 0/35 [00:00<?, ?it/s]
Epoch 0: 0%| | 0/35 [00:00<?, ?it/s]
Epoch 0: 3%|▎ | 1/35 [00:08<04:50, 8.54s/it]
Epoch 0: 3%|▎ | 1/35 [00:08<04:50, 8.54s/it, loss=10.457, v_num=193413]
Epoch 0: 6%|▌ | 2/35 [00:11<03:06, 5.66s/it, loss=10.457, v_num=193413]
Epoch 0: 6%|▌ | 2/35 [00:11<03:06, 5.66s/it, loss=10.122, v_num=193413]
Epoch 0: 9%|▊ | 3/35 [00:14<02:29, 4.68s/it, loss=10.122, v_num=193413]
Epoch 0: 9%|▊ | 3/35 [00:14<02:29, 4.68s/it, loss=9.858, v_num=193413]
Epoch 0: 11%|█▏ | 4/35 [00:16<02:09, 4.19s/it, loss=9.858, v_num=193413]
Epoch 0: 11%|█▏ | 4/35 [00:16<02:09, 4.19s/it, loss=9.626, v_num=193413]
Epoch 0: 14%|█▍ | 5/35 [00:19<01:57, 3.90s/it, loss=9.626, v_num=193413]
Epoch 0: 14%|█▍ | 5/35 [00:19<01:57, 3.90s/it, loss=9.432, v_num=193413]
Epoch 0: 17%|█▋ | 6/35 [00:22<01:47, 3.71s/it, loss=9.43
Training: 0it [00:00, ?it/s]
Training: 0%| | 0/35 [00:00<?, ?it/s]
Epoch 0: 0%| | 0/35 [00:00<?, ?it/s]
Epoch 0: 3%|▎ | 1/35 [00:08<04:50, 8.54s/it]
Epoch 0: 3%|▎ | 1/35 [00:08<04:50, 8.54s/it, loss=10.468, v_num=193413]
Epoch 0: 6%|▌ | 2/35 [00:11<03:06, 5.66s/it, loss=10.468, v_num=193413]
Epoch 0: 6%|▌ | 2/35 [00:11<03:06, 5.66s/it, loss=10.122, v_num=193413]
Epoch 0: 9%|▊ | 3/35 [00:14<02:29, 4.68s/it, loss=10.122, v_num=193413]
Epoch 0: 9%|▊ | 3/35 [00:14<02:29, 4.68s/it, loss=9.846, v_num=193413]
Epoch 0: 11%|█▏ | 4/35 [00:16<02:09, 4.19s/it, loss=9.846, v_num=193413]
Epoch 0: 11%|█▏ | 4/35 [00:16<02:09, 4.19s/it, loss=9.638, v_num=193413]
Epoch 0: 14%|█▍ | 5/35 [00:19<01:57, 3.90s/it, loss=9.638, v_num=193413]
Epoch 0: 14%|█▍ | 5/35 [00:19<01:57, 3.90s/it, loss=9.465, v_num=193413]
Epoch 0: 17%|█▋ | 6/35 [00:22<01:47, 3.71s/it, loss=9.46
Training: 0it [00:00, ?it/s]
Training: 0%| | 0/35 [00:00<?, ?it/s]
Epoch 0: 0%| | 0/35 [00:00<?, ?it/s]
Validating: 0it [00:00, ?it/s]�[A
Validating: 0it [00:00, ?it/s]�[A
Validating: 0it [00:00, ?it/s]�[A
Validating: 0it [00:00, ?it/s]�[A
Validating: 0it [00:00, ?it/s]�[A
Validating: 0it [00:00, ?it/s]�[A
Validating: 0it [00:00, ?it/s]�[A
Validating: 0%| | 1/30738 [00:01<16:11:25, 1.90s/it]�[A
Epoch 0: : 36it [02:15, 3.76s/it, loss=7.437, v_num=193413]
Validating: 0%| | 1/30738 [00:01<16:08:01, 1.89s/it]�[A
Epoch 0: : 36it [02:15, 3.76s/it, loss=7.469, v_num=193413]
Validating: 0%| | 1/30738 [00:01<16:08:42, 1.89s/it]�[A
Epoch 0: : 36it [02:15, 3.76s/it, loss=7.447, v_num=193413]
Validating: 0%| | 1/30738 [00:01<16:08:19, 1.89s/it]�[A
Epoch 0: : 36it [02:15, 3.76s/it, loss=7.470, v_num=193413]
Validating: 0%| | 1/30738 [00:01<16:12:26, 1.90s/it]�[A
Epoch 0: : 36it [02:15, 3.76s/it, loss=7.443, v_num=193413]
Validating: 0%| | 1/30738 [00:01<16:15:56, 1.91s/it]�[A
Epoch 0: : 36it [02:15, 3.76s/it, loss=7.438, v_num=193413]
Validating: 0%| | 1/30738 [00:01<16:14:04, 1.90s/it]�[A
Epoch 0: : 36it [02:15, 3.76s/it, loss=7.433, v_num=193413]
class Train_Dataset(Dataset):
def __init__(self, filepath):
self.examples=torch.load(filepath)
self.len = len(self.examples)
def __len__(self):
return self.len
def __getitem__(self, idx):
return self.examples[idx]
class Val_Dataset(IterableDataset):
def __init__(self, filename_list):
self.filename_list = filename_list
self.len = 0
for file_path in tqdm(self.filename_list):
self.len+=len(torch.load(file_path))
def __len__(self):
return self.len
def __iter__(self):
for file_path in self.filename_list:
for x in torch.load(file_path):
yield x