-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
🐛 Bug
In #7928 the trainer logic was modified to restore the model state from the checkpoint connector instead of from the training type plugin and restore_model_from_ckpt_path was split into three new modular APIs. For our use case we overrode restore_model_from_ckpt_path in the FSDP plugin to prevent CPU OOMs, and now that the functionality for restoring the model state has been offloaded to the checkpoint, we run into OOMs again.
In #7509 it was proposed to solve this problem on the level of trainer — comment suggests offloading responsibility to training_type_plugin since this is not widely required outside of DDP and its derivatives, but restoring model state functionality no longer belongs to the plugin. Could we add some more memory-friendly logic to the checkpoint connector in case of multiple workers?
Please reproduce using the BoringModel
To Reproduce
Use following BoringModel and post here
Expected behavior
Environment
Note: Bugs with code are solved faster ! Colab Notebook should be made public !
-
IDE: Please, use our python bug_report_model.py template. -
Colab Notebook: Please copy and paste the output from our environment collection script (or fill out the checklist below manually).
You can get the script and run it with:
wget https://raw.githubusercontent.com/PyTorchLightning/pytorch-lightning/master/tests/collect_env_details.py
# For security purposes, please check the contents of collect_env_details.py before running it.
python collect_env_details.py
- PyTorch Version (e.g., 1.0):
- OS (e.g., Linux):
- How you installed PyTorch (
conda,pip, source): - Build command you used (if compiling from source):
- Python version:
- CUDA/cuDNN version:
- GPU models and configuration:
- Any other relevant information: