Skip to content

Unable to load Checkpoint  #20192

@JaLnYn

Description

@JaLnYn

Bug description

I am currently training a model with a custom save path. The model saves properly however, trainer.fit(model, ckpt=last) is unable to find any checkpoint.

    checkpoint_callback = ModelCheckpoint(
        monitor='total_loss',
        dirpath=config["training"]["model_path"]+"/"+config["training"]["name"],
        filename='portrait-{epoch:02d}-{total_loss:.2f}',
        save_top_k=2,
        mode='min',
    )

    # latest_checkpoint = find_latest_checkpoint(config["training"]["model_path"])

    trainer = pl.Trainer(default_root_dir=config["training"]["model_path"]+"/"+config["training"]["name"], max_epochs=config["training"]["num_epochs"], devices=-1 if torch.cuda.is_available() else 0, accelerator="gpu" if torch.cuda.is_available() else None, strategy='ddp_find_unused_parameters_true', callbacks=[checkpoint_callback]
)

    model = PortraitTrainer(config)

    trainer.fit(model, ckpt_path="last")

I have tried the following while setting the default_root_dir to above:

  1. Specifying the exact path from os.getcwd i.e. config["training"]["model_path"]+"/"+config["training"]["name"] + "some checkpoint" works
  2. Copying the checkpoint to os.getcwd and I change ckpt_path to trainer.fit(model, ckpt_path="some checkpoint") works
  3. Not copying the checkpoint to workspace while doing trainer.fit(model, ckpt_path="some checkpoint") does not work.

I believe default_root_dir is not being set correctly or fit is not appending the rootdir to ckpt_path

What version are you seeing the problem on?

v2.4

How to reproduce the bug

No response

Error messages and logs

# Error messages and logs here please

No error however, last does not resume the checkpoint.

Environment

  • CUDA:
    - GPU:
    - NVIDIA GeForce RTX 3090
    - available: True
    - version: 12.1
  • Lightning:
    - facenet-pytorch: 2.5.3
    - lightning: 2.4.0
    - lightning-utilities: 0.11.6
    - pytorch-lightning: 2.4.0
    - torch: 2.3.1
    - torchaudio: 2.3.1
    - torchmetrics: 1.4.1
    - torchvision: 0.18.1
  • Packages:
    - aiohappyeyeballs: 2.3.5
    - aiohttp: 3.10.3
    - aiosignal: 1.3.1
    - async-timeout: 4.0.3
    - attrs: 24.2.0
    - certifi: 2024.6.2
    - charset-normalizer: 3.3.2
    - click: 8.1.7
    - contourpy: 1.2.1
    - cycler: 0.12.1
    - decord: 0.6.0
    - dlib: 19.24.4
    - docker-pycreds: 0.4.0
    - facenet-pytorch: 2.5.3
    - filelock: 3.15.4
    - fonttools: 4.53.0
    - frozenlist: 1.4.1
    - fsspec: 2024.6.1
    - gitdb: 4.0.11
    - gitpython: 3.1.43
    - idna: 3.7
    - jinja2: 3.1.4
    - kiwisolver: 1.4.5
    - lightning: 2.4.0
    - lightning-utilities: 0.11.6
    - lpips: 0.1.4
    - markupsafe: 2.1.5
    - matplotlib: 3.8.4
    - mpmath: 1.3.0
    - multidict: 6.0.5
    - networkx: 3.3
    - numpy: 1.26.4
    - nvidia-cublas-cu12: 12.1.3.1
    - nvidia-cuda-cupti-cu12: 12.1.105
    - nvidia-cuda-nvrtc-cu12: 12.1.105
    - nvidia-cuda-runtime-cu12: 12.1.105
    - nvidia-cudnn-cu12: 8.9.2.26
    - nvidia-cufft-cu12: 11.0.2.54
    - nvidia-curand-cu12: 10.3.2.106
    - nvidia-cusolver-cu12: 11.4.5.107
    - nvidia-cusparse-cu12: 12.1.0.106
    - nvidia-nccl-cu12: 2.20.5
    - nvidia-nvjitlink-cu12: 12.5.82
    - nvidia-nvtx-cu12: 12.1.105
    - opencv-python: 4.9.0.80
    - packaging: 24.1
    - pillow: 10.3.0
    - pip: 22.0.2
    - platformdirs: 4.2.2
    - protobuf: 5.27.2
    - psutil: 6.0.0
    - pyparsing: 3.1.2
    - python-dateutil: 2.9.0.post0
    - pytorch-lightning: 2.4.0
    - pyyaml: 6.0.1
    - requests: 2.32.3
    - scipy: 1.13.1
    - sentry-sdk: 2.7.1
    - setproctitle: 1.3.3
    - setuptools: 59.6.0
    - six: 1.16.0
    - smmap: 5.0.1
    - sympy: 1.12.1
    - talking-head: 1.0
    - torch: 2.3.1
    - torchaudio: 2.3.1
    - torchmetrics: 1.4.1
    - torchvision: 0.18.1
    - tqdm: 4.66.2
    - trimesh: 4.3.2
    - triton: 2.3.1
    - typing-extensions: 4.12.2
    - urllib3: 2.2.2
    - wandb: 0.17.4
    - yarl: 1.9.4
  • System:
    - OS: Linux
    - architecture:
    - 64bit
    - ELF
    - processor: x86_64
    - python: 3.10.12
    - release: 6.5.0-44-generic
    - version: Extend CI #44~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Jun 18 14:36:16 UTC 2

More info

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingneeds triageWaiting to be triaged by maintainersver: 2.4.x

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions