Skip to content

Rewrite Accelerator_connector and follow up tasks #11449

@four4fish

Description

@four4fish

Proposed refactor

We have been discussing this for a while. There are issues related to this topic like:

Motivation

Moving towards sable strategy version
The current logic is not clear and hard to maintain
There are a lot of simplification we can do after the rewrite

Pitch

The new logic can be divided to 3 parts (Details in the PR)

Part1 : Check mis config set by user - conflict between flags, duplication between flags. And set final flag

Part 2: Choose Strategy, Accelerator, Precision, cluster_envirment and set up parallel devices

Part 3: Initialized Strategy, set up Strategy's Accelerator, Precision, Checkpoint_IO, Cluster environment and Parallel_devices (all require lazy initialization)

Follow up items from #11448

  1. Move error messages to precisionPlugin, strategy and accelerator init method if possible.
    eg: move this check to IPUPrecision plugin. from @carmocca
if self._precision_flag not in (16, 32):
                raise MisconfigurationException(
                    f"`Trainer(accelerator='ipu', precision={self._precision_flag!r})` is not supported."
                )

move this check to strategy. from @ananthsub

if self._precision_flag in (16, "bf16") and self._amp_type_flag == AMPType.APEX:
            if isinstance(self.strategy, (DDPShardedStrategy, DDPSpawnShardedStrategy, DDPFullyShardedStrategy)):
                raise MisconfigurationException(
                    "Sharded plugins are not supported with apex, please switch to `amp_backend='native'`."
  1. Add typing to accel_connector. Can we do this as a separate PR after unused properties deprecation? from @kaushikb11 @awaelchli @ananthsub

  2. Reduce duplicated strategy registry code: Classmethod inheritance doesn't work with current strategy registry logic, cls is the base class not the inheritance class. To reduce duplicated register_strategies method, we need redo the strategy registry logic. @kaushikb11 @awaelchli @tchaton

  3. Flag conflict and fallback logic revisit:
    - different flag set to the same thing: should be error (from @tchaton )
    - dp/ddp2 on cpu fallback to ddp: should be error instead of silent fallback (from @ananthsub )
    - [RFC] handle cluster_env and checkpoint_io set in both strategy() and plugins eg: (strategy=DDPPlugin(cluster_env=LightningEnv()), plugin=[TorchelasticEnv()])
    - check there is only 1 instance of each type at most in plugin flag (from @tchaton )
    - now DDP is the default with 1 GPU multi node, why not fallback to ddp_spawn for all (from @tchaton )
    - add/revisit warning for fallback logic
    - Is Apex supported with Sharded methods? Should we remove self._precision_flag in (16, "bf16") from the "Sharded plugins are not supported with apex, please switch to amp_backend='native'."check? (from @tchaton )

  4. Move _IS_INTERACTIVE check to strategy

  5. Loss check for "The TPUAccelerator can only be used with a SingleTPUStrategy or TPUSpawnStrategy," from @ananthsub (not required, nice to have)

  6. improving error message

    • "You can only specify one strategy to the Trainer." f"You have passed Trainer(strategy={strategy})" f" but you have also passed {accelerator} in Trainer(accelerator={accelerator}) instead of "accelerator set through both strategy class and accelerator flag, choose one" (from @ananthsub)
    • "You passed Trainer(accelerator='cpu', precision=16, amp_type='apex')"
      " but apex AMP not supported on CPU." Worth to mention this works with bfloat16 and native. (from @tchaton )
  7. Enable accelerator.is_available() check

  8. all the TODOs in accelerator_connector:

    • deprecate unused properties
  9. (HIGH PRIORITY) Re-introduce the _init_deterministic method on the AcceleratorConnector and set the value for deterministic.

Additional context

Improvement and potential improvement:

  • Enums could be deprecated, _StrategyType, _AcceleratorType, _distrib_type, _device_type and distributed_backend is not needed in new version
  • Strategy registry logic revisite : Now we have half of the str name registered, the rest half in _StrategyType, we could consolidate
  • Further Lazy initialization of the parallel Strategy classes: parallel devices need to be lazy initialized
  • Revisit flag priorities(part 1), choosing logic (part 2) and associated tests
  • Consolidate and revisit device parse related logic in utilities/devices, trainer and XAccelerators
  • Improve test, increase coverage and remove unnecessary tests
  • Deprecate unused functions from accelerator_connector (kept for now for backward compatibility)

If you enjoy Lightning, check out our other projects! ⚡

  • Metrics: Machine learning metrics for distributed, scalable PyTorch applications.

  • Lite: enables pure PyTorch users to scale their existing code on any kind of device while retaining full control over their own loops and optimization logic.

  • Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, fine-tuning, and solving problems with deep learning.

  • Bolts: Pretrained SOTA Deep Learning models, callbacks, and more for research and production with PyTorch Lightning and PyTorch.

  • Lightning Transformers: Flexible interface for high-performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.

cc @justusschock @awaelchli @akihironitta @rohitgr7 @kaushikb11 @ninginthecloud @carmocca @ananthsub @tchaton

Metadata

Metadata

Type

No type

Projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions