Skip to content

outputs in training_epoch_end contain only outputs from last batch repeated #8603

@stas-sl

Description

@stas-sl

🐛 Bug

outputs in training_epoch_end contain only outputs from last batch repeated multiple times. I believe it got broken only in 1.4.0, but in 1.3.x it worked.

To Reproduce

import torch
from torch.utils.data import DataLoader, Dataset
from pytorch_lightning import LightningModule, Trainer


class RandomDataset(Dataset):
    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len


class BoringModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        loss = self(batch).sum()
        print(f'training_step, {batch_idx=}: {loss=}')
        return loss

    def configure_optimizers(self):
        return torch.optim.SGD(self.parameters(), lr=0.1)

    def training_epoch_end(self, outputs):
        print('training_epoch_end:', outputs)


dl = DataLoader(RandomDataset(32, 100), batch_size=10)

model = BoringModel()
trainer = Trainer(max_epochs=1, progress_bar_refresh_rate=0)
trainer.fit(model, dl)

This will print the same loss repeated 10 times (equal to last batch loss) in training_epoch_end:

training_step, batch_idx=0: loss=tensor(0.6952, grad_fn=<SumBackward0>)
training_step, batch_idx=1: loss=tensor(-18.9661, grad_fn=<SumBackward0>)
training_step, batch_idx=2: loss=tensor(-27.7834, grad_fn=<SumBackward0>)
training_step, batch_idx=3: loss=tensor(-84.3158, grad_fn=<SumBackward0>)
training_step, batch_idx=4: loss=tensor(-119.3664, grad_fn=<SumBackward0>)
training_step, batch_idx=5: loss=tensor(-138.1930, grad_fn=<SumBackward0>)
training_step, batch_idx=6: loss=tensor(-126.4004, grad_fn=<SumBackward0>)
training_step, batch_idx=7: loss=tensor(-143.7022, grad_fn=<SumBackward0>)
training_step, batch_idx=8: loss=tensor(-175.9583, grad_fn=<SumBackward0>)
training_step, batch_idx=9: loss=tensor(-161.6977, grad_fn=<SumBackward0>)

training_epoch_end: [{'loss': tensor(-161.6977)}, {'loss': tensor(-161.6977)}, {'loss': tensor(-161.6977)}, {'loss': tensor(-161.6977)}, {'loss': tensor(-161.6977)}, {'loss': tensor(-161.6977)}, {'loss': tensor(-161.6977)}, {'loss': tensor(-161.6977)}, {'loss': tensor(-161.6977)}, {'loss': tensor(-161.6977)}]

Expected behavior

Output from all steps/batches is available in training_epoch_end (not only from last batch)

Environment

* CUDA:
	- GPU:
	- available:         False
	- version:           None
* Packages:
	- numpy:             1.18.5
	- pyTorch_debug:     False
	- pyTorch_version:   1.8.0
	- pytorch-lightning: 1.4.0
	- tqdm:              4.47.0
* System:
	- OS:                Darwin
	- architecture:
		- 64bit
		-
	- processor:         i386
	- python:            3.8.3
	- version:           Darwin Kernel Version 19.6.0: Tue Jan 12 22:13:05 PST 2021; root:xnu-6153.141.16~1/RELEASE_X86_64

Metadata

Metadata

Assignees

Labels

bugSomething isn't workinghelp wantedOpen to be worked onpriority: 0High priority task

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions