-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
🚀 Feature
Replace HPC/SLURM-specific checkpointing with general checkpointing in CheckpointConnector, then deprecate it.
Motivation
CheckpointConnector has HPC/SLURM-specific checkpointing (save/load system) for auto-resubmit (doc).
Now that auto-resubmit is supported in normal checkpointing process (#4402), HPC auto-resubmit also can be handled by this general process.
In my opinion, the HPC-specific checkpointing ended its historical role.
By deprecating this specific checkpointing, CheckpointConnector can be refactored so simple and become easy to maintain.
Pitch
Deprecate hpc_save & hpc_load, which use hpc_ckpt_{ckpt_number}.ckpt name convention for auto-resume/resubmit.
Use general checkpointing, which attempt to use last.ckpt automatically, for SLURM auto-resubmit.
Backward compatibility
The deprecation break previously-generated checkpoint for auto-resubmit.
But auto-resubmit checkpoint is, in general, used within short-term.
In other words, the checkpoint is ephemeral.
And hpc_save and hpc_load are internal method (no public API in docs).
In this point of views, in my opinion, we can deprecate (internal) old checkpointing without deprecation warning/term.