Skip to content

Support ModelCheckpoint saving at step intervals and fractional epoch intervals #6333

@timothybrooks

Description

@timothybrooks

🚀 Feature

Currently, ModelCheckpoint supports the period option, which specifies the epoch interval for saving checkpoints, and must be an integer. In general and especially for extremely large datasets, it would be useful to support finer control over when to save checkpoints.

Motivation

I am training models with huge datasets, thus making the interval between epochs so large that saving checkpoints only at the end of epochs does not satisfy my needs.

Pitch

I propose that similar to the val_check_interval training flag, ModelCheckpoint should support fractional epoch intervals, e.g. period=0.25 would indicate that a model checkpoint should be saved at each quarter of an epoch. It is desirable to sometimes specify intervals in terms of batch steps rather than epochs, so I also propose adding a parameter to support this, such as step_period, where the caller can specify the number of steps in between saving checkpoints.

Alternatives

Users can implement custom callbacks that save checkpoints at the end of a batch step. However, it would be great to leverage all the smarts of ModelCheckpoint (such as top k logic), which quickly makes the custom callback redundant and complex. It is also a feature which I believe would be commonly used enough that it would be valuable to expose to users without the need to write custom callbacks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    featureIs an improvement or enhancementhelp wantedOpen to be worked on

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions