-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
🚀 Feature
Motivation
Such functionality on the Accelerator abstraction would:
- Enable automatic hardware selection without duplicating code across Trainer & individual accelerator implementations.
- Simplify the accelerator connector logic and rewrite effort: Rewrite accelerator_connector #11448
- Enable automatic runtime checking of hardware availability during execution
- Provide consistency with how the Trainer auto-detects the cluster environments natively supported by the framework. The corollary here is
ClusterEnvironment.detect
Pitch
class Accelerator(ABC):
@staticmethod
@abstractmethod
def is_available() -> bool:
"""Detect if the hardware is available."""
def setup_environment(self, root_device: torch.device) -> None:
"""Setup any processes or distributed connections.
This is called before the LightningModule/DataModule setup hook which allows the user to access the accelerator
environment before setup is complete.
Raises:
RuntimeError:
If corresponding hardware is not found.
"""
if not self.is_available():
raise RuntimeError(f"{self.__class__.__qualname__} is not configured to run on this hardware.")class CPUAccelerator(Accelerator):
@staticmethod
def is_available() -> bool:
"""CPU is always available for execution."""
return Trueclass GPUAccelerator(Accelerator):
@staticmethod
def is_available() -> bool:
return torch.cuda.is_available() and torch.cuda.device_count() > 0 and so on
See a more-detailed implementation here: #11797 for what this looks like in practice
To support Trainer(accelerator="auto") this is what the logic simplifies to:
for acc_cls in (GPUAccelerator, TPUAccelerator, IPUAccelerator, CPUAccelerator):
if acc_cls.is_available():
return acc_cls()
return CPUAccelerator() # fallback to CPUthis could be even further simplified if we offered an AcceleratorRegistry, such that the Trainer/AcceleratorConnector didn't need to hardcode the list of accelerators to detect:
for acc_cls in AcceleratorRegistry.impls:
if acc_cls.is_available():
return acc_cls()
return CPUAccelerator() # fallback to CPUAlternatives
Some other alternatives exist here:
#11799
#11798
Issues with these approaches:
- Also breaking changes: simply instantiating the accelerator could raise a runtime error if the device isn't available.
- The bigger issue to me is that it does not ease support for
Trainer(accelerator="auto"). The accelerator connector needs to hardcode & re-implement each of the device checks to determine which Accelerator to even instantiate.
Additional context
If you enjoy Lightning, check out our other projects! ⚡
-
Metrics: Machine learning metrics for distributed, scalable PyTorch applications.
-
Lite: enables pure PyTorch users to scale their existing code on any kind of device while retaining full control over their own loops and optimization logic.
-
Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, fine-tuning, and solving problems with deep learning.
-
Bolts: Pretrained SOTA Deep Learning models, callbacks, and more for research and production with PyTorch Lightning and PyTorch.
-
Lightning Transformers: Flexible interface for high-performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.
cc @Borda @tchaton @justusschock @awaelchli @akihironitta @rohitgr7