Skip to content

Training fails at the end of the epoch when returning None in the training step #7544

@TommasoBendinelli

Description

@TommasoBendinelli

🐛 Bug

Sometimes my training loss in a batch is nan. Hence, I return None as loss so that the model will not backpropagate through it as suggested here: #4956. It works fine during the epoch; however, the code fails at the end of the epoch in the function reduce_across_time (line 532).

           if isinstance(value, list):
                value = torch.tensor(value)

In case of None, value will be equal to [None] and torch cannot create a proper tensor out of it (*** RuntimeError: Could not infer dtype of NoneType)

Is it me doing something wrong, or is it a bug in Lightning? Is there any workaround?

Pytorch Version
pytorch-lightning-1.3.1
torch 1.8.1+cu11
python 3.7.9

Metadata

Metadata

Assignees

Labels

bugSomething isn't workinghelp wantedOpen to be worked on

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions