-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
🐛 Bug
In the DDPSpawn / TPUSpawn plugin we transfer the weights from rank 0 back to the main process. To do this, we save a checkpoint of the latest model weights and then load it in the main process. The file name is determined based on the checkpoint callback's best_model_path:
This is not a bug that affects users directly as long as they ignore the file that's being saved. The name of the file does not reflect the state of the contents of that file, because the latest weights may not always be the best!
Furthermore, the temp file never gets deleted.
To Reproduce
Run boring model with Trainer(strategy="ddp_spawn", devices=2). The checkpoint directory will contain a file
epoch=0-step=7.tmp_end.ckpt
Expected behavior
The filename is not based on the "best model path" and the file gets deleted after state has been loaded in main process.
Additional context
Found during debugging in #10896.
A PR for this fix is in the work.
cc @awaelchli @ananthsub @ninginthecloud @justusschock @kaushikb11