Skip to content

Error in PL 1.2 when loading models that calls save_hyperparameters and is trained using PL <1.2 #6263

@larshbj

Description

@larshbj

🐛 Bug

After updating to PL 1.2 LightningModule.load_from_checkpoint(checkpoint), using a checkpoint from a model trained using PL 1.1.6, fails with the following AttributeError:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/larshbj/Library/Caches/pypoetry/virtualenvs/vake-TBRjjU-l-py3.8/lib/python3.8/site-packages/pytorch_lightning/core/saving.py", line 134, in load_from_checkpoint
    checkpoint = pl_load(checkpoint_path, map_location=lambda storage, loc: storage)
  File "/Users/larshbj/Library/Caches/pypoetry/virtualenvs/vake-TBRjjU-l-py3.8/lib/python3.8/site-packages/pytorch_lightning/utilities/cloud_io.py", line 32, in load
    return torch.load(f, map_location=map_location)
  File "/Users/larshbj/Library/Caches/pypoetry/virtualenvs/vake-TBRjjU-l-py3.8/lib/python3.8/site-packages/torch/serialization.py", line 594, in load
    return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
  File "/Users/larshbj/Library/Caches/pypoetry/virtualenvs/vake-TBRjjU-l-py3.8/lib/python3.8/site-packages/torch/serialization.py", line 853, in _load
    result = unpickler.load()
AttributeError: Can't get attribute '_gpus_arg_default' on <module 'pytorch_lightning.utilities.argparse_utils' from '/Users/larshbj/Library/Caches/pypoetry/virtualenvs/vake-TBRjjU-l-py3.8/lib/python3.8/site-packages/pytorch_lightning/utilities/argparse_utils.py'>

The model used is torchvision.models.detection.fasterrcnn_resnet50_fpn and can be found here. However, the problem does not seem to be related to type of model (see BoringModel colab).

To Reproduce

See Colab: https://colab.research.google.com/drive/1JbDHiipjx7zBYQYTPUzWtEatUB1AIfq4?usp=sharing

Reproducing requires training the model and saving a checkpoint using PL version 1.1.6, then loading model using PL version 1.2. To do this in the Colab:

  • Run all cells down to (and including) cell that installs/updates PL 1.2
  • Reset colab runtime and re-run "Deps" and "Model" steps
  • Run test steps

Expected behavior

LightningModule.load_from_checkpoint(checkpoint) successfully loads model.

Environment

* CUDA:
	- GPU:
		- Tesla K80
	- available:         True
	- version:           10.1
* Packages:
	- numpy:             1.19.5
	- pyTorch_debug:     False
	- pyTorch_version:   1.7.1+cu101
	- pytorch-lightning: 1.2.1
	- tqdm:              4.41.1
* System:
	- OS:                Linux
	- architecture:
		- 64bit
		- 
	- processor:         x86_64
	- python:            3.7.10
	- version:           #1 SMP Thu Jul 23 08:00:38 PDT 2020

Additional context

When reproducing the error I noticed that it does not fail if one omit self.save_hyperparameters() in the init method of the Lightning Module that is trained. I guess this saves the hyper parameters to the lightning module, and thus to the checkpoint. Printing the saved hparams from the checkpoint generated in the Colab:

{'accelerator': 'ddp',
 'accumulate_grad_batches': 1,
 'amp_backend': 'native',
 'amp_level': 'O2',
 'auto_lr_find': False,
 'auto_scale_batch_size': False,
 'auto_select_gpus': False,
 'automatic_optimization': None,
 'benchmark': False,
 'check_val_every_n_epoch': 1,
 'checkpoint_callback': True,
 'default_root_dir': None,
 'deterministic': False,
 'distributed_backend': None,
 'enable_pl_optimizer': None,
 'fast_dev_run': False,
 'flush_logs_every_n_steps': 100,
 'gpus': 1,
 'gradient_clip_val': 0,
 'limit_test_batches': 1.0,
 'limit_train_batches': 1.0,
 'limit_val_batches': 1.0,
 'log_every_n_steps': 50,
 'log_gpu_memory': None,
 'logger': True,
 'max_epochs': 1,
 'max_steps': None,
 'min_epochs': 1,
 'min_steps': None,
 'move_metrics_to_cpu': False,
 'num_nodes': 1,
 'num_processes': 1,
 'num_sanity_val_steps': 2,
 'overfit_batches': 0.0,
 'plugins': None,
 'precision': 32,
 'prepare_data_per_node': True,
 'process_position': 0,
 'profiler': None,
 'progress_bar_refresh_rate': 1,
 'reload_dataloaders_every_epoch': False,
 'replace_sampler_ddp': True,
 'resume_from_checkpoint': None,
 'sync_batchnorm': False,
 'terminate_on_nan': False,
 'tpu_cores': <function pytorch_lightning.utilities.argparse_utils._gpus_arg_default>,
 'track_grad_norm': -1,
 'truncated_bptt_steps': None,
 'val_check_interval': 1.0,
 'weights_save_path': None,
 'weights_summary': 'top'}

My guess is that the problem occurs due to the 'tpu_cores': <function pytorch_lightning.utilities.argparse_utils._gpus_arg_default>, and this path changed to pytorch_lightning.utilities.argparse in PL 1.2.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingcheckpointingRelated to checkpointinghelp wantedOpen to be worked onpriority: 0High priority task

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions