-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
This came up in discussions around #11818, #11797, #11798, #11799
Currently, device checks happen inside of the accelerator connector during trainer initialization.
https://github.com/PyTorchLightning/pytorch-lightning/blob/a2d8c4f6a6080234e47ccc5ad593912303d29bf9/pytorch_lightning/trainer/connectors/accelerator_connector.py#L197-L229
Discussion 1:
Should the per-device check move out of the Trainer to each Accelerator?
Pros for Trainer:
???
Pros for Accelerator:
- Logic is better encapsulated
- More extensible for new hardware without requiring changes to the Trainer
Discussion 2:
Assuming device checks happen inside of the Accelerator class, should runtime checks happen by default? Or should each accelerator determine when to assert availability?
Pros for default checks:
- Additional safety independent of strategy logic
Pros for each accelerator determining this on their own:
- more flexibility around when this is called?
Discussion 3:
Assuming device checks happen automatically inside of the Accelerator class, should these happen at initialization or during setup_environment as the first thing to happen during the Trainer's runtime?
Pros for setup environment:
- Mimics torch device experience
One can createtorch.device("cuda")on a host without GPUs. However, moving a tensor to this device would fail because CUDA is unavailable. The corollary here would be the ability to create a GPUAccelerator on a host without GPUs. But callingGPUAccelerator.setup_environmentwould fail. - Easier for testing other parts of these classes (doesn't require mocking device availability)
- Instantiating the Accelerator class doesn't imply that the model is actually on the device. It's simply an intent of what hardware the model should be trained with
Pros for constructor:
- Fails faster
Originally asked by @rohitgr7 in #11797 (comment)
cc @tchaton @justusschock @awaelchli @Borda @akihironitta @rohitgr7 @four4fish