Skip to content

Replace/deprecate SLURM specific checkpointing #5373

@tarepan

Description

@tarepan

🚀 Feature

Replace HPC/SLURM-specific checkpointing with general checkpointing in CheckpointConnector, then deprecate it.

Motivation

CheckpointConnector has HPC/SLURM-specific checkpointing (save/load system) for auto-resubmit (doc).
Now that auto-resubmit is supported in normal checkpointing process (#4402), HPC auto-resubmit also can be handled by this general process.
In my opinion, the HPC-specific checkpointing ended its historical role.
By deprecating this specific checkpointing, CheckpointConnector can be refactored so simple and become easy to maintain.

Pitch

Deprecate hpc_save & hpc_load, which use hpc_ckpt_{ckpt_number}.ckpt name convention for auto-resume/resubmit.
Use general checkpointing, which attempt to use last.ckpt automatically, for SLURM auto-resubmit.

Backward compatibility

The deprecation break previously-generated checkpoint for auto-resubmit.
But auto-resubmit checkpoint is, in general, used within short-term.
In other words, the checkpoint is ephemeral.

And hpc_save and hpc_load are internal method (no public API in docs).

In this point of views, in my opinion, we can deprecate (internal) old checkpointing without deprecation warning/term.

Metadata

Metadata

Assignees

No one assigned

    Labels

    featureIs an improvement or enhancementhelp wantedOpen to be worked onrefactor

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions