-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
Proposed refactoring or deprecation
No.5 of #10416 Accelerator and Plugin refactor
and part of #10417 Core Trainer Connectors
Related to #10410 Future of gpus/ipus/tpu_cores with respect to devices
Motivation
Current flags and accelerator logic is confusing. Multiple Accelerator flags partially overlap and interfere each other.
There are 30 MisconfigurationException in accelerator connector, half of them are caused by duplicated flags interfering with each other.
Multiple flag with same meaning doesn't add much value, but cause confusing and make accelerator_connector logic unnecessarily complicated.
For example:
- The
devicesflags mentioned in [RFC] Future ofgpus/ipus/tpu_coreswith respect todevices#10410 Future of gpus/ipus/tpu_cores with respect to devices
gpu=2, device=3, the device will be ignored.
or
- Accelerator could have multiple meanings. It could be
devicewhich is a string, orAccelerator()which wraps the precision and ttp.
-
Plugins and strategy are duplicated. If user specific both will be misconfig. Also we have to keep logic to handle both
strategy flagandplugin flag. (There isdistributed_backendtoo and it's deprecated)
Also, with increasing use cases of custom Plugins, it's critical to have more scalable solutions. For example, the current enum for distributed is not scalable for customized distributed
https://github.com/PyTorchLightning/pytorch-lightning/blob/db4e7700047519ff6e6365517d7e592c8ef023cb/pytorch_lightning/utilities/enums.py
Pitch
Every Flag should have One and Only One meaning, NO overlap between flags. Reduce user's Misconfig possibility
Deprecate num_processes, tpu-cores, ipus, gpus, plugins flag
Keep options:
devices_numbers(devices): # how many devices user want to use
devices_type(Accelerator): # cpu/gpu/tpu or etc, we use this to choose Accelerator
strategy: # which TTP plugins
More restrict typing
devices_numbers: Optional[int]. #None means auto
devices_type(Accelerator): Optional[str]. #None means auto, remove Accelerator() type
strategy: Optional[Union[str, TrainingTypePlugin]]. #RFC. should we support both DDPPlugin() and 'ddp' ? or just one
Reduce unnecessary internal wrapper
Remove dependence to Enums, use TrainingTypePluginsRegistry names, which works for both built in plugins and customized plugins
Additional context
If you enjoy Lightning, check out our other projects! ⚡
-
Metrics: Machine learning metrics for distributed, scalable PyTorch applications.
-
Lite: enables pure PyTorch users to scale their existing code on any kind of device while retaining full control over their own loops and optimization logic.
-
Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, fine-tuning, and solving problems with deep learning.
-
Bolts: Pretrained SOTA Deep Learning models, callbacks, and more for research and production with PyTorch Lightning and PyTorch.
-
Lightning Transformers: Flexible interface for high-performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.