Skip to content

[RFC] Add Accelerator.is_available() interface requirement  #11818

@ananthsub

Description

@ananthsub

🚀 Feature

Motivation

Such functionality on the Accelerator abstraction would:

  • Enable automatic hardware selection without duplicating code across Trainer & individual accelerator implementations.
  • Simplify the accelerator connector logic and rewrite effort: Rewrite accelerator_connector #11448
  • Enable automatic runtime checking of hardware availability during execution
  • Provide consistency with how the Trainer auto-detects the cluster environments natively supported by the framework. The corollary here is ClusterEnvironment.detect

https://github.com/PyTorchLightning/pytorch-lightning/blob/9e63281a4c4a62f32cad9801a23b63454f8311be/pytorch_lightning/plugins/environments/cluster_environment.py#L43-L46

https://github.com/PyTorchLightning/pytorch-lightning/blob/9e63281a4c4a62f32cad9801a23b63454f8311be/pytorch_lightning/trainer/connectors/accelerator_connector.py#L810-L812

Pitch

class Accelerator(ABC):

    @staticmethod
    @abstractmethod
    def is_available() -> bool:
        """Detect if the hardware is available."""
    
    def setup_environment(self, root_device: torch.device) -> None:
        """Setup any processes or distributed connections.
        This is called before the LightningModule/DataModule setup hook which allows the user to access the accelerator
        environment before setup is complete.
        Raises:
            RuntimeError:
                If corresponding hardware is not found.
        """
        if not self.is_available():
            raise RuntimeError(f"{self.__class__.__qualname__} is not configured to run on this hardware.")
class CPUAccelerator(Accelerator):
    @staticmethod
    def is_available() -> bool:
        """CPU is always available for execution."""
        return True
class GPUAccelerator(Accelerator):
   @staticmethod
    def is_available() -> bool:
        return torch.cuda.is_available() and torch.cuda.device_count() > 0 

and so on

See a more-detailed implementation here: #11797 for what this looks like in practice

To support Trainer(accelerator="auto") this is what the logic simplifies to:

for acc_cls in (GPUAccelerator, TPUAccelerator, IPUAccelerator, CPUAccelerator):
    if acc_cls.is_available():
        return acc_cls()
return CPUAccelerator() # fallback to CPU

this could be even further simplified if we offered an AcceleratorRegistry, such that the Trainer/AcceleratorConnector didn't need to hardcode the list of accelerators to detect:

for acc_cls in AcceleratorRegistry.impls:
    if acc_cls.is_available():
        return acc_cls()
return CPUAccelerator() # fallback to CPU

Alternatives

Some other alternatives exist here:
#11799
#11798

Issues with these approaches:

  • Also breaking changes: simply instantiating the accelerator could raise a runtime error if the device isn't available.
  • The bigger issue to me is that it does not ease support for Trainer(accelerator="auto"). The accelerator connector needs to hardcode & re-implement each of the device checks to determine which Accelerator to even instantiate.

Additional context


If you enjoy Lightning, check out our other projects! ⚡

  • Metrics: Machine learning metrics for distributed, scalable PyTorch applications.

  • Lite: enables pure PyTorch users to scale their existing code on any kind of device while retaining full control over their own loops and optimization logic.

  • Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, fine-tuning, and solving problems with deep learning.

  • Bolts: Pretrained SOTA Deep Learning models, callbacks, and more for research and production with PyTorch Lightning and PyTorch.

  • Lightning Transformers: Flexible interface for high-performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.

cc @Borda @tchaton @justusschock @awaelchli @akihironitta @rohitgr7

Metadata

Metadata

Assignees

No one assigned

    Labels

    acceleratorbreaking changeIncludes a breaking changedesignIncludes a design discussionfeatureIs an improvement or enhancement

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions