-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Closed
Closed
Copy link
Labels
bugSomething isn't workingSomething isn't workinghelp wantedOpen to be worked onOpen to be worked onpriority: 0High priority taskHigh priority task
Milestone
Description
🐛 Bug
outputs in training_epoch_end contain only outputs from last batch repeated multiple times. I believe it got broken only in 1.4.0, but in 1.3.x it worked.
To Reproduce
import torch
from torch.utils.data import DataLoader, Dataset
from pytorch_lightning import LightningModule, Trainer
class RandomDataset(Dataset):
def __init__(self, size, length):
self.len = length
self.data = torch.randn(length, size)
def __getitem__(self, index):
return self.data[index]
def __len__(self):
return self.len
class BoringModel(LightningModule):
def __init__(self):
super().__init__()
self.layer = torch.nn.Linear(32, 2)
def forward(self, x):
return self.layer(x)
def training_step(self, batch, batch_idx):
loss = self(batch).sum()
print(f'training_step, {batch_idx=}: {loss=}')
return loss
def configure_optimizers(self):
return torch.optim.SGD(self.parameters(), lr=0.1)
def training_epoch_end(self, outputs):
print('training_epoch_end:', outputs)
dl = DataLoader(RandomDataset(32, 100), batch_size=10)
model = BoringModel()
trainer = Trainer(max_epochs=1, progress_bar_refresh_rate=0)
trainer.fit(model, dl)
This will print the same loss repeated 10 times (equal to last batch loss) in training_epoch_end:
training_step, batch_idx=0: loss=tensor(0.6952, grad_fn=<SumBackward0>)
training_step, batch_idx=1: loss=tensor(-18.9661, grad_fn=<SumBackward0>)
training_step, batch_idx=2: loss=tensor(-27.7834, grad_fn=<SumBackward0>)
training_step, batch_idx=3: loss=tensor(-84.3158, grad_fn=<SumBackward0>)
training_step, batch_idx=4: loss=tensor(-119.3664, grad_fn=<SumBackward0>)
training_step, batch_idx=5: loss=tensor(-138.1930, grad_fn=<SumBackward0>)
training_step, batch_idx=6: loss=tensor(-126.4004, grad_fn=<SumBackward0>)
training_step, batch_idx=7: loss=tensor(-143.7022, grad_fn=<SumBackward0>)
training_step, batch_idx=8: loss=tensor(-175.9583, grad_fn=<SumBackward0>)
training_step, batch_idx=9: loss=tensor(-161.6977, grad_fn=<SumBackward0>)
training_epoch_end: [{'loss': tensor(-161.6977)}, {'loss': tensor(-161.6977)}, {'loss': tensor(-161.6977)}, {'loss': tensor(-161.6977)}, {'loss': tensor(-161.6977)}, {'loss': tensor(-161.6977)}, {'loss': tensor(-161.6977)}, {'loss': tensor(-161.6977)}, {'loss': tensor(-161.6977)}, {'loss': tensor(-161.6977)}]
Expected behavior
Output from all steps/batches is available in training_epoch_end (not only from last batch)
Environment
* CUDA:
- GPU:
- available: False
- version: None
* Packages:
- numpy: 1.18.5
- pyTorch_debug: False
- pyTorch_version: 1.8.0
- pytorch-lightning: 1.4.0
- tqdm: 4.47.0
* System:
- OS: Darwin
- architecture:
- 64bit
-
- processor: i386
- python: 3.8.3
- version: Darwin Kernel Version 19.6.0: Tue Jan 12 22:13:05 PST 2021; root:xnu-6153.141.16~1/RELEASE_X86_64
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workinghelp wantedOpen to be worked onOpen to be worked onpriority: 0High priority taskHigh priority task