-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
🚀 Feature
Lightning Strategy stable version
Motivation
After the refactoring issue is done, the main code structure will be stable. But there are other correctness and stability related issues that need to be addressed, some API simplification should be done, and some P0 features should be supported.
Pitch
Remaining tasks
- Finish refactor, we have two remaining steps for accelerator refactor
- Accelerator_connector rewrite (in review)
- follow ups: strategy selection and fallback logic, device/accelerator default, misconfig handling
- Flatten the strategies inheritance (not started)
- Accelerator_connector rewrite (in review)
- Correctness and stability related issues
- Revisit and unification of device related logic Consolidate get_gpu_id functions to device utils #11427 and
- Stable lazy initialization in Strategies Make lazy initialization in plugins more robust #7650
- Generalize internal checks for precision plugin type, training type, accelerator type Generalize internal checks for precision plugin type, training type, accelerator type #10821
- Fix horovod bugs
- (Other bug fixes)
- API simplifications and improvement
- Deprecations : unused properties in accl_conn, enums, trainer.X_method
- Align DDP/DDPSpawn process creation Interface for Process Creation (DDPSpawn vs. DDP) #10985
- Precision API revisit, better support amp, move misconfig/availability checks from accl_conn to precision.init_
- Strategies API revisit, add is_distributed(), and move misconfig/availability checks to strategy.init
- Accelerator API revisit, add teardown(), and add devices availability check (in review)
- Finish collective refactor
- Feature support
- no-op Strategy and Accelerator, support open source project TorchRec
- Support Fairring, support init PG outside lightning
- Better engineering tasks
- Improve typing
- Improve unit tests
- docs and docstrings
Alternatives
Additional context
If you enjoy Lightning, check out our other projects! ⚡
-
Metrics: Machine learning metrics for distributed, scalable PyTorch applications.
-
Lite: enables pure PyTorch users to scale their existing code on any kind of device while retaining full control over their own loops and optimization logic.
-
Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, fine-tuning, and solving problems with deep learning.
-
Bolts: Pretrained SOTA Deep Learning models, callbacks, and more for research and production with PyTorch Lightning and PyTorch.
-
Lightning Transformers: Flexible interface for high-performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.
cc @Borda @justusschock @kaushikb11 @awaelchli @ninginthecloud @akihironitta @rohitgr7 @carmocca @tchaton @ananthsub