-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Closed
Labels
questionFurther information is requestedFurther information is requested
Description
🐛 Bug
When training on 1 GPU, my code has no problem. When I try using 2 GPUs, my Trainer throws the following error:
ValueError: ctypes objects containing pointers cannot be pickled
To Reproduce
trainer = pl.Trainer(
gpus=2,
logger=experiment_loggers,
max_epochs=hparams.TRAINING.MAX_EPOCHS,
callbacks=[ckpt_callback],
log_every_n_steps=50,
terminate_on_nan=True,
default_root_dir=log_dir,
progress_bar_refresh_rate=50,
check_val_every_n_epoch=hparams.TRAINING.CHECK_VAL_EVERY_N_EPOCH,
# checkpoint_callback=ckpt_callback,
reload_dataloaders_every_epoch=hparams.TRAINING.RELOAD_DATALOADERS_EVERY_EPOCH,
resume_from_checkpoint=hparams.TRAINING.RESUME,
num_sanity_val_steps=0,
fast_dev_run=fast_dev_run,
**amp_params,
)
Expected behavior
I expect my trainer to run without errors, as it does on 1 CPU or 1 GPU.
Environment
* CUDA:
- GPU:
- NVIDIA GeForce RTX 3080 Ti
- NVIDIA GeForce RTX 3080 Ti
- NVIDIA GeForce RTX 3080 Ti
- NVIDIA GeForce RTX 3080 Ti
- NVIDIA GeForce RTX 3080 Ti
- NVIDIA GeForce RTX 3080 Ti
- NVIDIA GeForce RTX 3080 Ti
- NVIDIA GeForce RTX 3080 Ti
- available: True
- version: 11.1
* Lightning:
- neural-renderer-pytorch: 1.1.3
- pytorch-lightning: 1.1.8
- pytorch3d: 0.7.0
- torch: 1.9.0+cu111
- torchaudio: 0.9.0
- torchgeometry: 0.1.2
- torchmetrics: 0.9.3
- torchvision: 0.10.0+cu111
* Packages:
- absl-py: 1.2.0
- aiohttp: 3.8.1
- aiosignal: 1.2.0
- albumentations: 1.2.1
- anyio: 3.6.1
- async-timeout: 4.0.2
- asyncer: 0.0.1
- attrs: 22.1.0
- beautifulsoup4: 4.11.1
- cachetools: 5.2.0
- certifi: 2021.5.30
- cffi: 1.14.3
- chardet: 3.0.4
- charset-normalizer: 2.1.0
- chumpy: 0.71
- click: 8.0.3
- colorama: 0.4.5
- cycler: 0.11.0
- fastapi: 0.72.0
- filelock: 3.8.0
- filetype: 1.0.9
- flatbuffers: 2.0
- flatten-dict: 0.4.2
- fonttools: 4.34.4
- freetype-py: 2.3.0
- frozenlist: 1.3.1
- fsspec: 2022.7.1
- future: 0.18.2
- fvcore: 0.1.5.post20210915
- gdown: 4.4.0
- google-auth: 2.10.0
- google-auth-oauthlib: 0.4.6
- grpcio: 1.42.0
- human-det: 0.0.2
- idna: 2.10
- imageio: 2.21.1
- importlib-metadata: 4.11.4
- iopath: 0.1.9
- joblib: 1.1.0
- jpeg4py: 0.1.4
- kaolin: 0.12.0
- kiwisolver: 1.4.4
- llvmlite: 0.39.0
- loguru: 0.6.0
- lxml: 4.9.1
- markdown: 3.4.1
- markupsafe: 2.1.1
- matplotlib: 3.5.3
- mkl-fft: 1.3.0
- mkl-random: 1.1.1
- mkl-service: 2.3.0
- multidict: 6.0.2
- networkx: 2.8.5
- neural-renderer-pytorch: 1.1.3
- numba: 0.56.0
- numpy: 1.22.4
- oauthlib: 3.2.0
- olefile: 0.46
- onnxruntime: 1.10.0
- opencv-python: 4.6.0.66
- opencv-python-headless: 4.6.0.66
- packaging: 21.3
- pare: 0.1
- pillow: 9.2.0
- pip: 22.1.2
- portalocker: 2.5.1
- protobuf: 3.15.8
- pyasn1: 0.4.8
- pyasn1-modules: 0.2.8
- pycparser: 2.21
- pydantic: 1.9.2
- pydeprecate: 0.3.2
- pyglet: 1.5.26
- pymatting: 1.1.5
- pymcubes: 0.1.2
- pyopengl: 3.1.0
- pyopenssl: 19.1.0
- pyparsing: 3.0.9
- pyrender: 0.1.45
- pysocks: 1.7.1
- python-dateutil: 2.8.2
- python-multipart: 0.0.5
- pytorch-lightning: 1.1.8
- pytorch3d: 0.7.0
- pywavelets: 1.3.0
- pyyaml: 6.0
- qudida: 0.0.4
- rembg: 2.0.13+1.geb110d2
- requests: 2.24.0
- requests-oauthlib: 1.3.1
- requests-toolbelt: 0.9.1
- rsa: 4.9
- scikit-image: 0.19.3
- scikit-learn: 1.1.2
- scipy: 1.9.0
- setuptools: 61.2.0
- six: 1.16.0
- smplx: 0.1.26
- sniffio: 1.2.0
- soupsieve: 2.3.2.post1
- starlette: 0.17.1
- tabulate: 0.8.10
- tensorboard: 2.10.0
- tensorboard-data-server: 0.6.1
- tensorboard-plugin-wit: 1.8.1
- termcolor: 1.1.0
- threadpoolctl: 3.1.0
- tifffile: 2022.8.8
- torch: 1.9.0+cu111
- torchaudio: 0.9.0
- torchgeometry: 0.1.2
- torchmetrics: 0.9.3
- torchvision: 0.10.0+cu111
- tqdm: 4.64.0
- trimesh: 3.9.35
- typing-extensions: 4.3.0
- urllib3: 1.25.11
- usd-core: 22.5.post1
- voxelize-cuda: 0.0.0
- watchdog: 2.1.7
- werkzeug: 2.2.2
- wheel: 0.37.1
- yacs: 0.1.8
- yarl: 1.8.1
- zipp: 3.8.1
* System:
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.8.13
- version: #121-Ubuntu SMP Thu Mar 24 16:04:27 UTC 2022
Additional context
Bug traceback
Traceback (most recent call last):
File "scripts/train.py", line 194, in <module>
main(hparams, disable_comet=args.disable_comet, fast_dev_run=args.fdr)
File "scripts/train.py", line 140, in main
trainer.fit(model)
File "/home/zqli/miniconda3/envs/icon/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 510, in fit
results = self.accelerator_backend.train()
File "/home/zqli/miniconda3/envs/icon/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_spawn_accelerator.py", line 83, in train
mp.spawn(self.ddp_train, nprocs=self.nprocs, args=(self.mp_queue, model,))
File "/home/zqli/miniconda3/envs/icon/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/zqli/miniconda3/envs/icon/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 179, in start_processes
process.start()
File "/home/zqli/miniconda3/envs/icon/lib/python3.8/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/home/zqli/miniconda3/envs/icon/lib/python3.8/multiprocessing/context.py", line 284, in _Popen
return Popen(process_obj)
File "/home/zqli/miniconda3/envs/icon/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in __init__
super().__init__(process_obj)
File "/home/zqli/miniconda3/envs/icon/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
self._launch(process_obj)
File "/home/zqli/miniconda3/envs/icon/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 47, in _launch
reduction.dump(process_obj, fp)
File "/home/zqli/miniconda3/envs/icon/lib/python3.8/multiprocessing/reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
ValueError: ctypes objects containing pointers cannot be pickled
[W CudaIPCTypes.cpp:99] Producer process tried to deallocate over 1000 memory blocks referred by consumer processes. Deallocation might be significantly slowed down. We assume it will never going to be the case, but if it is, please file but to https://github.com/pytorch/pytorchMetadata
Metadata
Assignees
Labels
questionFurther information is requestedFurther information is requested