-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
Proposed refactor
We have been discussing this for a while. There are issues related to this topic like:
Motivation
Moving towards sable strategy version
The current logic is not clear and hard to maintain
There are a lot of simplification we can do after the rewrite
Pitch
The new logic can be divided to 3 parts (Details in the PR)
Part1 : Check mis config set by user - conflict between flags, duplication between flags. And set final flag
Part 2: Choose Strategy, Accelerator, Precision, cluster_envirment and set up parallel devices
Part 3: Initialized Strategy, set up Strategy's Accelerator, Precision, Checkpoint_IO, Cluster environment and Parallel_devices (all require lazy initialization)
Follow up items from #11448
- Move error messages to precisionPlugin, strategy and accelerator init method if possible.
eg: move this check to IPUPrecision plugin. from @carmocca
if self._precision_flag not in (16, 32):
raise MisconfigurationException(
f"`Trainer(accelerator='ipu', precision={self._precision_flag!r})` is not supported."
)move this check to strategy. from @ananthsub
if self._precision_flag in (16, "bf16") and self._amp_type_flag == AMPType.APEX:
if isinstance(self.strategy, (DDPShardedStrategy, DDPSpawnShardedStrategy, DDPFullyShardedStrategy)):
raise MisconfigurationException(
"Sharded plugins are not supported with apex, please switch to `amp_backend='native'`."-
Add typing to accel_connector. Can we do this as a separate PR after unused properties deprecation? from @kaushikb11 @awaelchli @ananthsub
-
Reduce duplicated strategy registry code: Classmethod inheritance doesn't work with current strategy registry logic, cls is the base class not the inheritance class. To reduce duplicated
register_strategiesmethod, we need redo the strategy registry logic. @kaushikb11 @awaelchli @tchaton -
Flag conflict and fallback logic revisit:
- different flag set to the same thing: should be error (from @tchaton )
- dp/ddp2 on cpu fallback to ddp: should be error instead of silent fallback (from @ananthsub )
- [RFC] handlecluster_envandcheckpoint_ioset in both strategy() and plugins eg: (strategy=DDPPlugin(cluster_env=LightningEnv()), plugin=[TorchelasticEnv()])
- check there is only 1 instance of each type at most in plugin flag (from @tchaton )
- now DDP is the default with 1 GPU multi node, why not fallback to ddp_spawn for all (from @tchaton )
- add/revisit warning for fallback logic
- Is Apex supported with Sharded methods? Should we remove self._precision_flag in (16, "bf16") from the "Sharded plugins are not supported with apex, please switch toamp_backend='native'."check? (from @tchaton ) -
Move _IS_INTERACTIVE check to strategy
-
Loss check for "The
TPUAcceleratorcan only be used with aSingleTPUStrategyorTPUSpawnStrategy," from @ananthsub (not required, nice to have) -
improving error message
- "You can only specify one strategy to the Trainer." f"You have passed
Trainer(strategy={strategy})" f" but you have also passed {accelerator} in Trainer(accelerator={accelerator}) instead of "accelerator set through both strategy class and accelerator flag, choose one" (from @ananthsub) - "You passed
Trainer(accelerator='cpu', precision=16, amp_type='apex')"
" but apex AMP not supported on CPU." Worth to mention this works with bfloat16 and native. (from @tchaton )
- "You can only specify one strategy to the Trainer." f"You have passed
-
Enable accelerator.is_available() check
-
all the TODOs in accelerator_connector:
- deprecate unused properties
-
(HIGH PRIORITY) Re-introduce the _init_deterministic method on the AcceleratorConnector and set the value for deterministic.
Additional context
Improvement and potential improvement:
- Enums could be deprecated, _StrategyType, _AcceleratorType, _distrib_type, _device_type and distributed_backend is not needed in new version
- Strategy registry logic revisite : Now we have half of the str name registered, the rest half in _StrategyType, we could consolidate
- Further Lazy initialization of the parallel Strategy classes: parallel devices need to be lazy initialized
- Revisit flag priorities(part 1), choosing logic (part 2) and associated tests
- Consolidate and revisit device parse related logic in utilities/devices, trainer and XAccelerators
- Improve test, increase coverage and remove unnecessary tests
- Deprecate unused functions from accelerator_connector (kept for now for backward compatibility)
If you enjoy Lightning, check out our other projects! ⚡
-
Metrics: Machine learning metrics for distributed, scalable PyTorch applications.
-
Lite: enables pure PyTorch users to scale their existing code on any kind of device while retaining full control over their own loops and optimization logic.
-
Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, fine-tuning, and solving problems with deep learning.
-
Bolts: Pretrained SOTA Deep Learning models, callbacks, and more for research and production with PyTorch Lightning and PyTorch.
-
Lightning Transformers: Flexible interface for high-performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.
cc @justusschock @awaelchli @akihironitta @rohitgr7 @kaushikb11 @ninginthecloud @carmocca @ananthsub @tchaton