Lightning-AI
diff --git a/‎.azure-pipelines/gpu-tests.yml‎
Lines changed: 1 addition & 0 deletions b/‎.azure-pipelines/gpu-tests.yml‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎CHANGELOG.md‎
Lines changed: 76 additions & 7 deletions b/‎CHANGELOG.md‎
Lines changed: 76 additions & 7 deletions
diff --git a/‎docs/source/accelerators/gpu.rst‎
Lines changed: 115 additions & 0 deletions b/‎docs/source/accelerators/gpu.rst‎
Lines changed: 115 additions & 0 deletions
diff --git a/‎docs/source/advanced/profiler.rst‎
Lines changed: 2 additions & 2 deletions b/‎docs/source/advanced/profiler.rst‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎docs/source/api_references.rst‎
Lines changed: 1 addition & 0 deletions b/‎docs/source/api_references.rst‎
Lines changed: 1 addition & 0 deletions
@@ -52,6 +52,7 @@ jobs:
         python -c "fname = 'requirements/extra.txt' ; lines = [line for line in open(fname).readlines() if 'horovod' not in line] ; open(fname, 'w').writelines(lines)"
         pip install fairscale==0.4.0
         pip install deepspeed==0.5.7
+        pip install bagua-cuda102==0.9.0
         pip install . --requirement requirements/devel.txt
         pip list
       displayName: 'Install dependencies'
 
@@ -60,12 +60,21 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
 - Added `console_kwargs` for `RichProgressBar` to initialize inner Console ([#10875](https://github.com/PyTorchLightning/pytorch-lightning/pull/10875))
 
 
+- Added support for shorthand notation to instantiate loggers with the `LightningCLI` ([#11533](https://github.com/PyTorchLightning/pytorch-lightning/pull/11533))
+
+
+- Added a `LOGGER_REGISTRY` instance to register custom loggers to the `LightningCLI` ([#11533](https://github.com/PyTorchLightning/pytorch-lightning/pull/11533))
+
+
 - Added a `PrecisionPlugin.teardown` method ([#10990](https://github.com/PyTorchLightning/pytorch-lightning/pull/10990))
 
 
 - Added `LightningModule.lr_scheduler_step` ([#10249](https://github.com/PyTorchLightning/pytorch-lightning/pull/10249))
 
 
+- Added support for no pre-fetching to `DataFetcher` ([#11606](https://github.com/PyTorchLightning/pytorch-lightning/pull/11606))
+
+
 - Added `opt_idx` to scheduler config if not assigned by user ([#11247](https://github.com/PyTorchLightning/pytorch-lightning/pull/11247))
 
 
@@ -74,9 +83,20 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
 
 - Added a `MisconfigurationException` if user provided `opt_idx` in scheduler config doesn't match with actual optimizer index of its respective optimizer ([#11247](https://github.com/PyTorchLightning/pytorch-lightning/pull/11247))
 
+- Added support for DDP when using a `CombinedLoader` for the training data ([#11648](https://github.com/PyTorchLightning/pytorch-lightning/pull/11648))
+
+
+- Added a warning when using `DistributedSampler` during validation/testing ([#11479](https://github.com/PyTorchLightning/pytorch-lightning/pull/11479))
+
+
+- Added support for `Bagua` training strategy ([#11146](https://github.com/PyTorchLightning/pytorch-lightning/pull/11146))
+
 
 ### Changed
 
+- Implemented a new native and rich format in `_print_results` method of the `EvaluationLoop` ([#11332](https://github.com/PyTorchLightning/pytorch-lightning/pull/11332))
+
+
 - Set the `prog_bar` flag to False in `LightningModule.log_grad_norm` ([#11472](https://github.com/PyTorchLightning/pytorch-lightning/pull/11472))
 
 
@@ -162,9 +182,6 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
 - Removed duplicated file extension when uploading model checkpoints with `NeptuneLogger` ([#11015](https://github.com/PyTorchLightning/pytorch-lightning/pull/11015))
 
 
-- Changed `LSFEnvironment` to use `LSB_DJOB_RANKFILE` environment variable instead of `LSB_HOSTS` for determining node rank and main address ([#10825](https://github.com/PyTorchLightning/pytorch-lightning/pull/10825))
-
-
 - Removed `__getstate__` and `__setstate__` of `RichProgressBar` ([#11100](https://github.com/PyTorchLightning/pytorch-lightning/pull/11100))
 
 
@@ -195,6 +212,9 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
 - Marked the `ResultCollection`, `ResultMetric`, and `ResultMetricCollection` classes as protected ([#11130](https://github.com/PyTorchLightning/pytorch-lightning/pull/11130))
 
 
+- Marked `trainer.checkpoint_connector` as protected ([#11550](https://github.com/PyTorchLightning/pytorch-lightning/pull/11550))
+
+
 - The epoch start/end hooks are now called by the `FitLoop` instead of the `TrainingEpochLoop` ([#11201](https://github.com/PyTorchLightning/pytorch-lightning/pull/11201))
 
 
@@ -224,10 +244,18 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
 
 - Changed `MisconfigurationException` to `ModuleNotFoundError` when `rich` isn't available ([#11360](https://github.com/PyTorchLightning/pytorch-lightning/pull/11360))
 
+- Inherit from `ABC` for `Accelerator`: Users need to implement `auto_device_count` ([#11521](https://github.com/PyTorchLightning/pytorch-lightning/pull/11521))
+
+
+- Changed `parallel_devices` property in `ParallelStrategy` to be lazy initialized ([#11572](https://github.com/PyTorchLightning/pytorch-lightning/pull/11572))
+
 
 - Sorted `SimpleProfiler(extended=False)` summary based on mean duration for each hook ([#11671](https://github.com/PyTorchLightning/pytorch-lightning/pull/11671))
 
 
+- Avoid enforcing `shuffle=False` for eval dataloaders ([#11575](https://github.com/PyTorchLightning/pytorch-lightning/pull/11575))
+
+
 ### Deprecated
 
 - Deprecated `ClusterEnvironment.master_{address,port}` in favor of `ClusterEnvironment.main_{address,port}` ([#10103](https://github.com/PyTorchLightning/pytorch-lightning/pull/10103))
@@ -287,6 +315,12 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
 - Deprecated function `pytorch_lightning.callbacks.device_stats_monitor.prefix_metric_keys` ([#11254](https://github.com/PyTorchLightning/pytorch-lightning/pull/11254))
 
 
+- Deprecated `on_batch_start` and `on_batch_end` callback hooks in favor of `on_train_batch_start` and `on_train_batch_end` ([#11577](https://github.com/PyTorchLightning/pytorch-lightning/pull/11577))
+
+
+- Deprecated `on_configure_sharded_model` callback hook in favor of `setup` ([#11627](https://github.com/PyTorchLightning/pytorch-lightning/pull/11627))
+
+
 ### Removed
 
 - Removed deprecated parameter `method` in `pytorch_lightning.utilities.model_helpers.is_overridden` ([#10507](https://github.com/PyTorchLightning/pytorch-lightning/pull/10507))
@@ -454,6 +488,9 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
 - Fixed wrong typehint for `Trainer.lightning_optimizers` ([#11155](https://github.com/PyTorchLightning/pytorch-lightning/pull/11155))
 
 
+- Fixed the lr-scheduler state not being dumped to checkpoint when using the deepspeed strategy ([#11307](https://github.com/PyTorchLightning/pytorch-lightning/pull/11307))
+
+
 - Fixed bug where the path for "last" checkpoints was not getting saved correctly which caused newer runs to not remove the previous "last" checkpoint ([#11481](https://github.com/PyTorchLightning/pytorch-lightning/pull/11481))
 
 
@@ -469,21 +506,53 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
 - Fixed type promotion when tensors of higher category than float are logged ([#11401](https://github.com/PyTorchLightning/pytorch-lightning/pull/11401))
 
 
-- Fixed the lr-scheduler state not being dumped to checkpoint when using the deepspeed strategy ([#11307](https://github.com/PyTorchLightning/pytorch-lightning/pull/11307))
+- Fixed `SimpleProfiler` summary ([#11414](https://github.com/PyTorchLightning/pytorch-lightning/pull/11414))
 
 
-- Fixed `SimpleProfiler` summary ([#11414](https://github.com/PyTorchLightning/pytorch-lightning/pull/11414))
+- Fixed bug where progress bar was not being disabled when not in rank zero during predict ([#11377](https://github.com/PyTorchLightning/pytorch-lightning/pull/11377))
+
+
+- Fixed an issue to avoid val bar disappear after `trainer.validate()` ([#11700](https://github.com/PyTorchLightning/pytorch-lightning/pull/11700))
+
+
+- Fixed the mid-epoch warning call while resuming training ([#11556](https://github.com/PyTorchLightning/pytorch-lightning/pull/11556))
+
+
+- Fixed an issue in `RichProgressbar` to display the metrics logged only on main progress bar ([#11690](https://github.com/PyTorchLightning/pytorch-lightning/pull/11690))
+
+
+- Fixed `RichProgressBar` progress when refresh rate does not evenly divide the total counter ([#11668](https://github.com/PyTorchLightning/pytorch-lightning/pull/11668))
+
 
+- Fixed `RichProgressBar` progress validation bar total when using multiple validation runs within a single training epoch ([#11668](https://github.com/PyTorchLightning/pytorch-lightning/pull/11668))
 
-- Disbled sampler replacement when using `IterableDataset` ([#11507](https://github.com/PyTorchLightning/pytorch-lightning/pull/11507))
 
+- The `RichProgressBar` now correctly shows the `on_epoch` logged values on train epoch end ([#11689](https://github.com/PyTorchLightning/pytorch-lightning/pull/11689))
 
-- The Rich progress bar now correctly shows the `on_epoch` logged values on train epoch end ([#11689](https://github.com/PyTorchLightning/pytorch-lightning/pull/11689))
+
+- Fixed check for available modules ([#11526](https://github.com/PyTorchLightning/pytorch-lightning/pull/11526))
 
 
 - Fixed an issue to avoid validation loop run on restart ([#11552](https://github.com/PyTorchLightning/pytorch-lightning/pull/11552))
 
 
+- Fixed an issue to make the `step` argument in `WandbLogger.log_image` work ([#11716](https://github.com/PyTorchLightning/pytorch-lightning/pull/11716))
+
+
+## [1.5.9] - 2022-01-20
+
+### Fixed
+
+- Pinned sphinx-autodoc-typehints with <v1.15 ([#11400](https://github.com/PyTorchLightning/pytorch-lightning/pull/11400))
+- Skipped testing with PyTorch 1.7 and Python 3.9 on Ubuntu ([#11217](https://github.com/PyTorchLightning/pytorch-lightning/pull/11217))
+- Fixed type promotion when tensors of higher category than float are logged ([#11401](https://github.com/PyTorchLightning/pytorch-lightning/pull/11401))
+- Fixed the format of the configuration saved automatically by the CLI's `SaveConfigCallback` ([#11532](https://github.com/PyTorchLightning/pytorch-lightning/pull/11532))
+
+### Changed
+- Changed `LSFEnvironment` to use `LSB_DJOB_RANKFILE` environment variable instead of `LSB_HOSTS` for determining node rank and main address ([#10825](https://github.com/PyTorchLightning/pytorch-lightning/pull/10825))
+- Disabled sampler replacement when using `IterableDataset` ([#11507](https://github.com/PyTorchLightning/pytorch-lightning/pull/11507))
+
+
 ## [1.5.8] - 2022-01-05
 
 ### Fixed
 
@@ -282,6 +282,7 @@ Lightning allows multiple ways of training
 - DistributedDataParallel (``strategy='ddp_spawn'``) (multiple-gpus across many machines (spawn based)).
 - DistributedDataParallel 2 (``strategy='ddp2'``) (DP in a machine, DDP across machines).
 - Horovod (``strategy='horovod'``) (multi-machine, multi-gpu, configured at runtime)
+- Bagua (``strategy='bagua'``) (multiple-gpus across many machines with advanced training algorithms)
 - TPUs (``tpu_cores=8|x``) (tpu or TPU pod)
 
 .. note::
@@ -489,6 +490,120 @@ number of worker processes:
 See the official `Horovod documentation <https://horovod.readthedocs.io/en/stable>`_ for details
 on installation and performance tuning.
 
+
+Bagua
+^^^^^
+`Bagua <https://github.com/BaguaSys/bagua>`_ is a deep learning training acceleration framework which supports
+multiple advanced distributed training algorithms including:
+
+- `Gradient AllReduce <https://tutorials.baguasys.com/algorithms/gradient-allreduce>`_ for centralized synchronous communication, where gradients are averaged among all workers.
+- `Decentralized SGD <https://tutorials.baguasys.com/algorithms/decentralized>`_ for decentralized synchronous communication, where each worker exchanges data with one or a few specific workers.
+- `ByteGrad <https://tutorials.baguasys.com/algorithms/bytegrad>`_ and `QAdam <https://tutorials.baguasys.com/algorithms/q-adam>`_ for low precision communication, where data is compressed into low precision before communication.
+- `Asynchronous Model Average <https://tutorials.baguasys.com/algorithms/async-model-average>`_ for asynchronous communication, where workers are not required to be synchronized in the same iteration in a lock-step style.
+
+By default, Bagua uses *Gradient AllReduce* algorithm, which is also the algorithm implemented in Distributed Data Parallel and Horovod,
+but Bagua can usually produce a higher training throughput due to its backend written in Rust.
+
+.. code-block:: python
+
+    # train on 2 GPUs (using Bagua mode)
+    trainer = Trainer(strategy="bagua", accelerator="gpu", devices=4)
+
+
+By specifying the ``algorithm`` in the ``BaguaStrategy``, you can select more advanced training algorithms featured by Bagua:
+
+
+.. code-block:: python
+
+    # train on 4 GPUs, using Bagua Gradient AllReduce algorithm
+    trainer = Trainer(
+        strategy=BaguaStrategy(algorithm="gradient_allreduce"),
+        accelerator="gpu",
+        devices=4,
+    )
+
+    # train on 4 GPUs, using Bagua ByteGrad algorithm
+    trainer = Trainer(
+        strategy=BaguaStrategy(algorithm="bytegrad"),
+        accelerator="gpu",
+        devices=4,
+    )
+
+    # train on 4 GPUs, using Bagua Decentralized SGD
+    trainer = Trainer(
+        strategy=BaguaStrategy(algorithm="decentralized"),
+        accelerator="gpu",
+        devices=4,
+    )
+
+    # train on 4 GPUs, using Bagua Low Precision Decentralized SGD
+    trainer = Trainer(
+        strategy=BaguaStrategy(algorithm="low_precision_decentralized"),
+        accelerator="gpu",
+        devices=4,
+    )
+
+    # train on 4 GPUs, using Asynchronous Model Average algorithm, with a synchronization interval of 100ms
+    trainer = Trainer(
+        strategy=BaguaStrategy(algorithm="async", sync_interval_ms=100),
+        accelerator="gpu",
+        devices=4,
+    )
+
+To use *QAdam*, we need to initialize
+`QAdamOptimizer <https://bagua.readthedocs.io/en/latest/autoapi/bagua/torch_api/algorithms/q_adam/index.html#bagua.torch_api.algorithms.q_adam.QAdamOptimizer>`_ first:
+
+.. code-block:: python
+
+    from pytorch_lightning.strategies import BaguaStrategy
+    from bagua.torch_api.algorithms.q_adam import QAdamOptimizer
+
+
+    class MyModel(pl.LightningModule):
+        ...
+
+        def configure_optimizers(self):
+            # initialize QAdam Optimizer
+            return QAdamOptimizer(self.parameters(), lr=0.05, warmup_steps=100)
+
+
+    model = MyModel()
+    trainer = Trainer(
+        accelerator="gpu",
+        devices=4,
+        strategy=BaguaStrategy(algorithm="qadam"),
+    )
+    trainer.fit(model)
+
+Bagua relies on its own `launcher <https://tutorials.baguasys.com/getting-started/#launch-job>`_ to schedule jobs.
+Below, find examples using ``bagua.distributed.launch`` which follows ``torch.distributed.launch`` API:
+
+.. code-block:: bash
+
+    # start training with 8 GPUs on a single node
+    python -m bagua.distributed.launch --nproc_per_node=8 train.py
+
+If the ssh service is available with passwordless login on each node, you can launch the distributed job on a
+single node with ``baguarun`` which has a similar syntax as ``mpirun``. When staring the job, ``baguarun`` will
+automatically spawn new processes on each of your training node provided by ``--host_list`` option and each node in it
+is described as an ip address followed by a ssh port.
+
+.. code-block:: bash
+
+    # Run on node1 (or node2) to start training on two nodes (node1 and node2), 8 GPUs per node
+    baguarun --host_list hostname1:ssh_port1,hostname2:ssh_port2 --nproc_per_node=8 --master_port=port1 train.py
+
+
+.. note:: You can also start training in the same way as Distributed Data Parallel. However, system optimizations like
+    `Bagua-Net <https://tutorials.baguasys.com/more-optimizations/bagua-net>`_ and
+    `Performance autotuning <https://tutorials.baguasys.com/performance-autotuning/>`_ can only be enabled through bagua
+    launcher. It is worth noting that with ``Bagua-Net``, Distributed Data Parallel can also achieve
+    better performance without modifying the training script.
+
+
+See `Bagua Tutorials <https://tutorials.baguasys.com/>`_ for more details on installation and advanced features.
+
+
 DP/DDP2 caveats
 ^^^^^^^^^^^^^^^
 In DP and DDP2 each GPU within a machine sees a portion of a batch.
 
@@ -18,12 +18,12 @@ PyTorch Lightning supports profiling standard actions in the training loop out o
 
 - on_epoch_start
 - on_epoch_end
-- on_batch_start
+- on_train_batch_start
 - model_forward
 - model_backward
 - on_after_backward
 - optimizer_step
-- on_batch_end
+- on_train_batch_end
 - training_step_end
 - on_training_end
 - etc...
 
@@ -43,6 +43,7 @@ Strategy API
     :nosignatures:
     :template: classtemplate.rst
 
+    BaguaStrategy
     DDP2Strategy
     DDPFullyShardedStrategy
     DDPShardedStrategy