Skip to content

Where and when should device availability checks happen? #11831

@ananthsub

Description

@ananthsub

This came up in discussions around #11818, #11797, #11798, #11799

Currently, device checks happen inside of the accelerator connector during trainer initialization.
https://github.com/PyTorchLightning/pytorch-lightning/blob/a2d8c4f6a6080234e47ccc5ad593912303d29bf9/pytorch_lightning/trainer/connectors/accelerator_connector.py#L197-L229

Discussion 1:

Should the per-device check move out of the Trainer to each Accelerator?

Pros for Trainer:
???

Pros for Accelerator:

  • Logic is better encapsulated
  • More extensible for new hardware without requiring changes to the Trainer

Discussion 2:

Assuming device checks happen inside of the Accelerator class, should runtime checks happen by default? Or should each accelerator determine when to assert availability?

Pros for default checks:

  • Additional safety independent of strategy logic

Pros for each accelerator determining this on their own:

  • more flexibility around when this is called?

Discussion 3:

Assuming device checks happen automatically inside of the Accelerator class, should these happen at initialization or during setup_environment as the first thing to happen during the Trainer's runtime?

Pros for setup environment:

  • Mimics torch device experience
    One can create torch.device("cuda") on a host without GPUs. However, moving a tensor to this device would fail because CUDA is unavailable. The corollary here would be the ability to create a GPUAccelerator on a host without GPUs. But calling GPUAccelerator.setup_environment would fail.
  • Easier for testing other parts of these classes (doesn't require mocking device availability)
  • Instantiating the Accelerator class doesn't imply that the model is actually on the device. It's simply an intent of what hardware the model should be trained with

Pros for constructor:

  • Fails faster

Originally asked by @rohitgr7 in #11797 (comment)

cc @tchaton @justusschock @awaelchli @Borda @akihironitta @rohitgr7 @four4fish

Metadata

Metadata

Assignees

No one assigned

    Labels

    acceleratordesignIncludes a design discussionwon't fixThis will not be worked on

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions