Skip to content

OOM issues with loading large model checkpoints w/ FSDP after checkpoint refactor #8043

@mleshen

Description

@mleshen

🐛 Bug

In #7928 the trainer logic was modified to restore the model state from the checkpoint connector instead of from the training type plugin and restore_model_from_ckpt_path was split into three new modular APIs. For our use case we overrode restore_model_from_ckpt_path in the FSDP plugin to prevent CPU OOMs, and now that the functionality for restoring the model state has been offloaded to the checkpoint, we run into OOMs again.

In #7509 it was proposed to solve this problem on the level of trainer — comment suggests offloading responsibility to training_type_plugin since this is not widely required outside of DDP and its derivatives, but restoring model state functionality no longer belongs to the plugin. Could we add some more memory-friendly logic to the checkpoint connector in case of multiple workers?

Please reproduce using the BoringModel

To Reproduce

Use following BoringModel and post here

Expected behavior

Environment

Note: Bugs with code are solved faster ! Colab Notebook should be made public !

You can get the script and run it with:

wget https://raw.githubusercontent.com/PyTorchLightning/pytorch-lightning/master/tests/collect_env_details.py
# For security purposes, please check the contents of collect_env_details.py before running it.
python collect_env_details.py
  • PyTorch Version (e.g., 1.0):
  • OS (e.g., Linux):
  • How you installed PyTorch (conda, pip, source):
  • Build command you used (if compiling from source):
  • Python version:
  • CUDA/cuDNN version:
  • GPU models and configuration:
  • Any other relevant information:

Additional context

Metadata

Metadata

Assignees

Labels

3rd partyRelated to a 3rd-partybugSomething isn't workingcheckpointingRelated to checkpointinghelp wantedOpen to be worked on

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions