-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Closed
Labels
bugSomething isn't workingSomething isn't workingneeds triageWaiting to be triaged by maintainersWaiting to be triaged by maintainersver: 2.4.x
Description
Bug description
I am currently training a model with a custom save path. The model saves properly however, trainer.fit(model, ckpt=last) is unable to find any checkpoint.
checkpoint_callback = ModelCheckpoint(
monitor='total_loss',
dirpath=config["training"]["model_path"]+"/"+config["training"]["name"],
filename='portrait-{epoch:02d}-{total_loss:.2f}',
save_top_k=2,
mode='min',
)
# latest_checkpoint = find_latest_checkpoint(config["training"]["model_path"])
trainer = pl.Trainer(default_root_dir=config["training"]["model_path"]+"/"+config["training"]["name"], max_epochs=config["training"]["num_epochs"], devices=-1 if torch.cuda.is_available() else 0, accelerator="gpu" if torch.cuda.is_available() else None, strategy='ddp_find_unused_parameters_true', callbacks=[checkpoint_callback]
)
model = PortraitTrainer(config)
trainer.fit(model, ckpt_path="last")
I have tried the following while setting the default_root_dir to above:
- Specifying the exact path from
os.getcwdi.e.config["training"]["model_path"]+"/"+config["training"]["name"] + "some checkpoint"works - Copying the checkpoint to
os.getcwdand I change ckpt_path totrainer.fit(model, ckpt_path="some checkpoint")works - Not copying the checkpoint to workspace while doing
trainer.fit(model, ckpt_path="some checkpoint")does not work.
I believe default_root_dir is not being set correctly or fit is not appending the rootdir to ckpt_path
What version are you seeing the problem on?
v2.4
How to reproduce the bug
No response
Error messages and logs
# Error messages and logs here please
No error however, last does not resume the checkpoint.
Environment
- CUDA:
- GPU:
- NVIDIA GeForce RTX 3090
- available: True
- version: 12.1 - Lightning:
- facenet-pytorch: 2.5.3
- lightning: 2.4.0
- lightning-utilities: 0.11.6
- pytorch-lightning: 2.4.0
- torch: 2.3.1
- torchaudio: 2.3.1
- torchmetrics: 1.4.1
- torchvision: 0.18.1 - Packages:
- aiohappyeyeballs: 2.3.5
- aiohttp: 3.10.3
- aiosignal: 1.3.1
- async-timeout: 4.0.3
- attrs: 24.2.0
- certifi: 2024.6.2
- charset-normalizer: 3.3.2
- click: 8.1.7
- contourpy: 1.2.1
- cycler: 0.12.1
- decord: 0.6.0
- dlib: 19.24.4
- docker-pycreds: 0.4.0
- facenet-pytorch: 2.5.3
- filelock: 3.15.4
- fonttools: 4.53.0
- frozenlist: 1.4.1
- fsspec: 2024.6.1
- gitdb: 4.0.11
- gitpython: 3.1.43
- idna: 3.7
- jinja2: 3.1.4
- kiwisolver: 1.4.5
- lightning: 2.4.0
- lightning-utilities: 0.11.6
- lpips: 0.1.4
- markupsafe: 2.1.5
- matplotlib: 3.8.4
- mpmath: 1.3.0
- multidict: 6.0.5
- networkx: 3.3
- numpy: 1.26.4
- nvidia-cublas-cu12: 12.1.3.1
- nvidia-cuda-cupti-cu12: 12.1.105
- nvidia-cuda-nvrtc-cu12: 12.1.105
- nvidia-cuda-runtime-cu12: 12.1.105
- nvidia-cudnn-cu12: 8.9.2.26
- nvidia-cufft-cu12: 11.0.2.54
- nvidia-curand-cu12: 10.3.2.106
- nvidia-cusolver-cu12: 11.4.5.107
- nvidia-cusparse-cu12: 12.1.0.106
- nvidia-nccl-cu12: 2.20.5
- nvidia-nvjitlink-cu12: 12.5.82
- nvidia-nvtx-cu12: 12.1.105
- opencv-python: 4.9.0.80
- packaging: 24.1
- pillow: 10.3.0
- pip: 22.0.2
- platformdirs: 4.2.2
- protobuf: 5.27.2
- psutil: 6.0.0
- pyparsing: 3.1.2
- python-dateutil: 2.9.0.post0
- pytorch-lightning: 2.4.0
- pyyaml: 6.0.1
- requests: 2.32.3
- scipy: 1.13.1
- sentry-sdk: 2.7.1
- setproctitle: 1.3.3
- setuptools: 59.6.0
- six: 1.16.0
- smmap: 5.0.1
- sympy: 1.12.1
- talking-head: 1.0
- torch: 2.3.1
- torchaudio: 2.3.1
- torchmetrics: 1.4.1
- torchvision: 0.18.1
- tqdm: 4.66.2
- trimesh: 4.3.2
- triton: 2.3.1
- typing-extensions: 4.12.2
- urllib3: 2.2.2
- wandb: 0.17.4
- yarl: 1.9.4 - System:
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.10.12
- release: 6.5.0-44-generic
- version: Extend CI #44~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Jun 18 14:36:16 UTC 2
More info
No response
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingneeds triageWaiting to be triaged by maintainersWaiting to be triaged by maintainersver: 2.4.x