Skip to content

[RFC] Simplifying the Accelerator Connector logic and flags  #10422

@four4fish

Description

@four4fish

Proposed refactoring or deprecation

No.5 of #10416 Accelerator and Plugin refactor
and part of #10417 Core Trainer Connectors
Related to #10410 Future of gpus/ipus/tpu_cores with respect to devices

Motivation

Current flags and accelerator logic is confusing. Multiple Accelerator flags partially overlap and interfere each other.

There are 30 MisconfigurationException in accelerator connector, half of them are caused by duplicated flags interfering with each other.

Multiple flag with same meaning doesn't add much value, but cause confusing and make accelerator_connector logic unnecessarily complicated.

For example:

  1. The devices flags mentioned in [RFC] Future of gpus/ipus/tpu_cores with respect to devices #10410 Future of gpus/ipus/tpu_cores with respect to devices
    gpu=2, device=3, the device will be ignored.
    or

https://github.com/PyTorchLightning/pytorch-lightning/blob/a9bd4fbd96c1e73e251859d99b207008008de87d/pytorch_lightning/trainer/connectors/accelerator_connector.py#L867-L869

  1. Accelerator could have multiple meanings. It could be device which is a string, or Accelerator() which wraps the precision and ttp.

https://github.com/PyTorchLightning/pytorch-lightning/blob/a9bd4fbd96c1e73e251859d99b207008008de87d/pytorch_lightning/trainer/connectors/accelerator_connector.py#L784-L791

  1. Plugins and strategy are duplicated. If user specific both will be misconfig. Also we have to keep logic to handle both strategy flag and plugin flag. (There is distributed_backend too and it's deprecated)

    https://github.com/PyTorchLightning/pytorch-lightning/blob/a9bd4fbd96c1e73e251859d99b207008008de87d/pytorch_lightning/trainer/connectors/accelerator_connector.py#L317-L322

Also, with increasing use cases of custom Plugins, it's critical to have more scalable solutions. For example, the current enum for distributed is not scalable for customized distributed
https://github.com/PyTorchLightning/pytorch-lightning/blob/db4e7700047519ff6e6365517d7e592c8ef023cb/pytorch_lightning/utilities/enums.py

Pitch

Every Flag should have One and Only One meaning, NO overlap between flags. Reduce user's Misconfig possibility
Deprecate num_processes, tpu-cores, ipus, gpus, plugins flag
Keep options:

devices_numbers(devices):  # how many devices user want to use
devices_type(Accelerator): # cpu/gpu/tpu or etc, we use this to choose Accelerator
strategy: # which TTP plugins

More restrict typing

devices_numbers: Optional[int].       #None means auto
devices_type(Accelerator): Optional[str].        #None means auto, remove Accelerator() type
strategy: Optional[Union[str, TrainingTypePlugin]].    #RFC. should we support both DDPPlugin() and 'ddp' ? or just one

Reduce unnecessary internal wrapper
Remove dependence to Enums, use TrainingTypePluginsRegistry names, which works for both built in plugins and customized plugins

Additional context


If you enjoy Lightning, check out our other projects! ⚡

  • Metrics: Machine learning metrics for distributed, scalable PyTorch applications.

  • Lite: enables pure PyTorch users to scale their existing code on any kind of device while retaining full control over their own loops and optimization logic.

  • Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, fine-tuning, and solving problems with deep learning.

  • Bolts: Pretrained SOTA Deep Learning models, callbacks, and more for research and production with PyTorch Lightning and PyTorch.

  • Lightning Transformers: Flexible interface for high-performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.

Metadata

Metadata

Assignees

Labels

designIncludes a design discussionrefactor

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions