Skip to content

default EarlyStopping callback should not fail on missing val_loss data #524

@colllin

Description

@colllin

Describe the bug
My training script failed overnight — this is the last thing I see in the logs before the instance shut down:

python3.7/site-packages/pytorch_lightning/callbacks/pt_callbacks.py:128: RuntimeWarning: Early stopping conditioned on metric `val_loss` which is not available. Available metrics are: avg_val_loss_total,avg_val_jacc10,avg_val_ce
  RuntimeWarning)

It seems like we intended this to be a "warning" but it appears that it interrupted my training script. Do you think that's possible, or could it be something else? I had 2 training scripts running on 2 different instances last night and both shut down in this way, with this RuntimeWarning as the last line in the logs. Is it possible that the default EarlyStopping callback killed my script because I didn't log a val_loss tensor somewhere it could find it? To be clear, it is not my intention to use EarlyStopping at all, so I was quite surprised to wake up today and find my instance shut down and training interrupted, and no clear sign of a bug on my end. Did you intend this to interrupt the trainer? If so, how do we feel about changing that plan so that the default EarlyStopping callback has no effect when it can't find a val_loss metric?

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions