Skip to content

ValueError: ctypes objects containing pointers cannot be pickled #14198

@MooreManor

Description

@MooreManor

🐛 Bug

When training on 1 GPU, my code has no problem. When I try using 2 GPUs, my Trainer throws the following error:
ValueError: ctypes objects containing pointers cannot be pickled

To Reproduce

trainer = pl.Trainer(
        gpus=2,
        logger=experiment_loggers,
        max_epochs=hparams.TRAINING.MAX_EPOCHS,
        callbacks=[ckpt_callback],
        log_every_n_steps=50,
        terminate_on_nan=True,
        default_root_dir=log_dir,
        progress_bar_refresh_rate=50,
        check_val_every_n_epoch=hparams.TRAINING.CHECK_VAL_EVERY_N_EPOCH,
        # checkpoint_callback=ckpt_callback,
        reload_dataloaders_every_epoch=hparams.TRAINING.RELOAD_DATALOADERS_EVERY_EPOCH,
        resume_from_checkpoint=hparams.TRAINING.RESUME,
        num_sanity_val_steps=0,
        fast_dev_run=fast_dev_run,
        **amp_params,
    )

Expected behavior

I expect my trainer to run without errors, as it does on 1 CPU or 1 GPU.

Environment

* CUDA:
	- GPU:
		- NVIDIA GeForce RTX 3080 Ti
		- NVIDIA GeForce RTX 3080 Ti
		- NVIDIA GeForce RTX 3080 Ti
		- NVIDIA GeForce RTX 3080 Ti
		- NVIDIA GeForce RTX 3080 Ti
		- NVIDIA GeForce RTX 3080 Ti
		- NVIDIA GeForce RTX 3080 Ti
		- NVIDIA GeForce RTX 3080 Ti
	- available:         True
	- version:           11.1
* Lightning:
	- neural-renderer-pytorch: 1.1.3
	- pytorch-lightning: 1.1.8
	- pytorch3d:         0.7.0
	- torch:             1.9.0+cu111
	- torchaudio:        0.9.0
	- torchgeometry:     0.1.2
	- torchmetrics:      0.9.3
	- torchvision:       0.10.0+cu111
* Packages:
	- absl-py:           1.2.0
	- aiohttp:           3.8.1
	- aiosignal:         1.2.0
	- albumentations:    1.2.1
	- anyio:             3.6.1
	- async-timeout:     4.0.2
	- asyncer:           0.0.1
	- attrs:             22.1.0
	- beautifulsoup4:    4.11.1
	- cachetools:        5.2.0
	- certifi:           2021.5.30
	- cffi:              1.14.3
	- chardet:           3.0.4
	- charset-normalizer: 2.1.0
	- chumpy:            0.71
	- click:             8.0.3
	- colorama:          0.4.5
	- cycler:            0.11.0
	- fastapi:           0.72.0
	- filelock:          3.8.0
	- filetype:          1.0.9
	- flatbuffers:       2.0
	- flatten-dict:      0.4.2
	- fonttools:         4.34.4
	- freetype-py:       2.3.0
	- frozenlist:        1.3.1
	- fsspec:            2022.7.1
	- future:            0.18.2
	- fvcore:            0.1.5.post20210915
	- gdown:             4.4.0
	- google-auth:       2.10.0
	- google-auth-oauthlib: 0.4.6
	- grpcio:            1.42.0
	- human-det:         0.0.2
	- idna:              2.10
	- imageio:           2.21.1
	- importlib-metadata: 4.11.4
	- iopath:            0.1.9
	- joblib:            1.1.0
	- jpeg4py:           0.1.4
	- kaolin:            0.12.0
	- kiwisolver:        1.4.4
	- llvmlite:          0.39.0
	- loguru:            0.6.0
	- lxml:              4.9.1
	- markdown:          3.4.1
	- markupsafe:        2.1.1
	- matplotlib:        3.5.3
	- mkl-fft:           1.3.0
	- mkl-random:        1.1.1
	- mkl-service:       2.3.0
	- multidict:         6.0.2
	- networkx:          2.8.5
	- neural-renderer-pytorch: 1.1.3
	- numba:             0.56.0
	- numpy:             1.22.4
	- oauthlib:          3.2.0
	- olefile:           0.46
	- onnxruntime:       1.10.0
	- opencv-python:     4.6.0.66
	- opencv-python-headless: 4.6.0.66
	- packaging:         21.3
	- pare:              0.1
	- pillow:            9.2.0
	- pip:               22.1.2
	- portalocker:       2.5.1
	- protobuf:          3.15.8
	- pyasn1:            0.4.8
	- pyasn1-modules:    0.2.8
	- pycparser:         2.21
	- pydantic:          1.9.2
	- pydeprecate:       0.3.2
	- pyglet:            1.5.26
	- pymatting:         1.1.5
	- pymcubes:          0.1.2
	- pyopengl:          3.1.0
	- pyopenssl:         19.1.0
	- pyparsing:         3.0.9
	- pyrender:          0.1.45
	- pysocks:           1.7.1
	- python-dateutil:   2.8.2
	- python-multipart:  0.0.5
	- pytorch-lightning: 1.1.8
	- pytorch3d:         0.7.0
	- pywavelets:        1.3.0
	- pyyaml:            6.0
	- qudida:            0.0.4
	- rembg:             2.0.13+1.geb110d2
	- requests:          2.24.0
	- requests-oauthlib: 1.3.1
	- requests-toolbelt: 0.9.1
	- rsa:               4.9
	- scikit-image:      0.19.3
	- scikit-learn:      1.1.2
	- scipy:             1.9.0
	- setuptools:        61.2.0
	- six:               1.16.0
	- smplx:             0.1.26
	- sniffio:           1.2.0
	- soupsieve:         2.3.2.post1
	- starlette:         0.17.1
	- tabulate:          0.8.10
	- tensorboard:       2.10.0
	- tensorboard-data-server: 0.6.1
	- tensorboard-plugin-wit: 1.8.1
	- termcolor:         1.1.0
	- threadpoolctl:     3.1.0
	- tifffile:          2022.8.8
	- torch:             1.9.0+cu111
	- torchaudio:        0.9.0
	- torchgeometry:     0.1.2
	- torchmetrics:      0.9.3
	- torchvision:       0.10.0+cu111
	- tqdm:              4.64.0
	- trimesh:           3.9.35
	- typing-extensions: 4.3.0
	- urllib3:           1.25.11
	- usd-core:          22.5.post1
	- voxelize-cuda:     0.0.0
	- watchdog:          2.1.7
	- werkzeug:          2.2.2
	- wheel:             0.37.1
	- yacs:              0.1.8
	- yarl:              1.8.1
	- zipp:              3.8.1
* System:
	- OS:                Linux
	- architecture:
		- 64bit
		- ELF
	- processor:         x86_64
	- python:            3.8.13
	- version:           #121-Ubuntu SMP Thu Mar 24 16:04:27 UTC 2022

Additional context

Bug traceback

Traceback (most recent call last):
  File "scripts/train.py", line 194, in <module>
    main(hparams, disable_comet=args.disable_comet, fast_dev_run=args.fdr)
  File "scripts/train.py", line 140, in main
    trainer.fit(model)
  File "/home/zqli/miniconda3/envs/icon/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 510, in fit
    results = self.accelerator_backend.train()
  File "/home/zqli/miniconda3/envs/icon/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_spawn_accelerator.py", line 83, in train
    mp.spawn(self.ddp_train, nprocs=self.nprocs, args=(self.mp_queue, model,))
  File "/home/zqli/miniconda3/envs/icon/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/zqli/miniconda3/envs/icon/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 179, in start_processes
    process.start()
  File "/home/zqli/miniconda3/envs/icon/lib/python3.8/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/home/zqli/miniconda3/envs/icon/lib/python3.8/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/home/zqli/miniconda3/envs/icon/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/home/zqli/miniconda3/envs/icon/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/home/zqli/miniconda3/envs/icon/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/home/zqli/miniconda3/envs/icon/lib/python3.8/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
ValueError: ctypes objects containing pointers cannot be pickled
[W CudaIPCTypes.cpp:99] Producer process tried to deallocate over 1000 memory blocks referred by consumer processes. Deallocation might be significantly slowed down. We assume it will never going to be the case, but if it is, please file but to https://github.com/pytorch/pytorch

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions