Skip to content

DDPSpawnPlugin generates a file based on the "best model path" #10933

@awaelchli

Description

@awaelchli

🐛 Bug

In the DDPSpawn / TPUSpawn plugin we transfer the weights from rank 0 back to the main process. To do this, we save a checkpoint of the latest model weights and then load it in the main process. The file name is determined based on the checkpoint callback's best_model_path:

https://github.com/PyTorchLightning/pytorch-lightning/blob/a28b4cd0c0bba30c21cae571e650877f66cf5588/pytorch_lightning/plugins/training_type/ddp_spawn.py#L259-L261

This is not a bug that affects users directly as long as they ignore the file that's being saved. The name of the file does not reflect the state of the contents of that file, because the latest weights may not always be the best!

Furthermore, the temp file never gets deleted.

To Reproduce

Run boring model with Trainer(strategy="ddp_spawn", devices=2). The checkpoint directory will contain a file
epoch=0-step=7.tmp_end.ckpt

Expected behavior

The filename is not based on the "best model path" and the file gets deleted after state has been loaded in main process.

Additional context

Found during debugging in #10896.
A PR for this fix is in the work.

cc @awaelchli @ananthsub @ninginthecloud @justusschock @kaushikb11

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingcheckpointingRelated to checkpointingstrategy: ddpDistributedDataParallel

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions