-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
🐛 Bug
After updating to PL 1.2 LightningModule.load_from_checkpoint(checkpoint), using a checkpoint from a model trained using PL 1.1.6, fails with the following AttributeError:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/larshbj/Library/Caches/pypoetry/virtualenvs/vake-TBRjjU-l-py3.8/lib/python3.8/site-packages/pytorch_lightning/core/saving.py", line 134, in load_from_checkpoint
checkpoint = pl_load(checkpoint_path, map_location=lambda storage, loc: storage)
File "/Users/larshbj/Library/Caches/pypoetry/virtualenvs/vake-TBRjjU-l-py3.8/lib/python3.8/site-packages/pytorch_lightning/utilities/cloud_io.py", line 32, in load
return torch.load(f, map_location=map_location)
File "/Users/larshbj/Library/Caches/pypoetry/virtualenvs/vake-TBRjjU-l-py3.8/lib/python3.8/site-packages/torch/serialization.py", line 594, in load
return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
File "/Users/larshbj/Library/Caches/pypoetry/virtualenvs/vake-TBRjjU-l-py3.8/lib/python3.8/site-packages/torch/serialization.py", line 853, in _load
result = unpickler.load()
AttributeError: Can't get attribute '_gpus_arg_default' on <module 'pytorch_lightning.utilities.argparse_utils' from '/Users/larshbj/Library/Caches/pypoetry/virtualenvs/vake-TBRjjU-l-py3.8/lib/python3.8/site-packages/pytorch_lightning/utilities/argparse_utils.py'>
The model used is torchvision.models.detection.fasterrcnn_resnet50_fpn and can be found here. However, the problem does not seem to be related to type of model (see BoringModel colab).
To Reproduce
See Colab: https://colab.research.google.com/drive/1JbDHiipjx7zBYQYTPUzWtEatUB1AIfq4?usp=sharing
Reproducing requires training the model and saving a checkpoint using PL version 1.1.6, then loading model using PL version 1.2. To do this in the Colab:
- Run all cells down to (and including) cell that installs/updates PL 1.2
- Reset colab runtime and re-run "Deps" and "Model" steps
- Run test steps
Expected behavior
LightningModule.load_from_checkpoint(checkpoint) successfully loads model.
Environment
* CUDA:
- GPU:
- Tesla K80
- available: True
- version: 10.1
* Packages:
- numpy: 1.19.5
- pyTorch_debug: False
- pyTorch_version: 1.7.1+cu101
- pytorch-lightning: 1.2.1
- tqdm: 4.41.1
* System:
- OS: Linux
- architecture:
- 64bit
-
- processor: x86_64
- python: 3.7.10
- version: #1 SMP Thu Jul 23 08:00:38 PDT 2020
Additional context
When reproducing the error I noticed that it does not fail if one omit self.save_hyperparameters() in the init method of the Lightning Module that is trained. I guess this saves the hyper parameters to the lightning module, and thus to the checkpoint. Printing the saved hparams from the checkpoint generated in the Colab:
{'accelerator': 'ddp',
'accumulate_grad_batches': 1,
'amp_backend': 'native',
'amp_level': 'O2',
'auto_lr_find': False,
'auto_scale_batch_size': False,
'auto_select_gpus': False,
'automatic_optimization': None,
'benchmark': False,
'check_val_every_n_epoch': 1,
'checkpoint_callback': True,
'default_root_dir': None,
'deterministic': False,
'distributed_backend': None,
'enable_pl_optimizer': None,
'fast_dev_run': False,
'flush_logs_every_n_steps': 100,
'gpus': 1,
'gradient_clip_val': 0,
'limit_test_batches': 1.0,
'limit_train_batches': 1.0,
'limit_val_batches': 1.0,
'log_every_n_steps': 50,
'log_gpu_memory': None,
'logger': True,
'max_epochs': 1,
'max_steps': None,
'min_epochs': 1,
'min_steps': None,
'move_metrics_to_cpu': False,
'num_nodes': 1,
'num_processes': 1,
'num_sanity_val_steps': 2,
'overfit_batches': 0.0,
'plugins': None,
'precision': 32,
'prepare_data_per_node': True,
'process_position': 0,
'profiler': None,
'progress_bar_refresh_rate': 1,
'reload_dataloaders_every_epoch': False,
'replace_sampler_ddp': True,
'resume_from_checkpoint': None,
'sync_batchnorm': False,
'terminate_on_nan': False,
'tpu_cores': <function pytorch_lightning.utilities.argparse_utils._gpus_arg_default>,
'track_grad_norm': -1,
'truncated_bptt_steps': None,
'val_check_interval': 1.0,
'weights_save_path': None,
'weights_summary': 'top'}
My guess is that the problem occurs due to the 'tpu_cores': <function pytorch_lightning.utilities.argparse_utils._gpus_arg_default>, and this path changed to pytorch_lightning.utilities.argparse in PL 1.2.