From 88c0262ecc7b5c09a2f8f68c60f3cd982f322883 Mon Sep 17 00:00:00 2001
From: SeanNaren <sean@grid.ai>
Date: Thu, 8 Apr 2021 12:34:02 +0100
Subject: [PATCH 01/10] Added advanced gpu section

---
 docs/source/advanced/multi_gpu.rst           | 498 ------------------
 docs/source/advanced/optimized_multi_gpu.rst | 513 +++++++++++++++++++
 docs/source/index.rst                        |   1 +
 3 files changed, 514 insertions(+), 498 deletions(-)
 create mode 100644 docs/source/advanced/optimized_multi_gpu.rst

diff --git a/docs/source/advanced/multi_gpu.rst b/docs/source/advanced/multi_gpu.rst
index 22a0f3bc8b64f..ab468af6830a4 100644
--- a/docs/source/advanced/multi_gpu.rst
+++ b/docs/source/advanced/multi_gpu.rst
@@ -596,504 +596,6 @@ If you need your own way to init PyTorch DDP you can override :meth:`pytorch_lig
 If you also need to use your own DDP implementation, override :meth:`pytorch_lightning.plugins.training_type.ddp.DDPPlugin.configure_ddp`.
 
 
-----------
-
-.. _model-parallelism:
-
-Model Parallelism [BETA]
-------------------------
-
-Model Parallelism tackles training large models on distributed systems, by modifying distributed communications and memory management of the model.
-Unlike data parallelism, the model is partitioned in various ways across the GPUs, in most cases to reduce the memory overhead when training large models.
-This is useful when dealing with large Transformer based models, or in environments where GPU memory is limited.
-
-Lightning currently offers the following methods to leverage model parallelism:
-
-- Sharded Training (partitioning your gradients and optimizer state across multiple GPUs, for reduced memory overhead with **no performance loss**)
-- Sequential Model Parallelism with Checkpointing (partition your :class:`nn.Sequential <torch.nn.Sequential>` module across multiple GPUs, leverage checkpointing and microbatching for further memory improvements and device utilization)
-
-.. _sharded:
-
-Sharded Training
-^^^^^^^^^^^^^^^^
-Lightning integration of optimizer sharded training provided by `FairScale <https://github.com/facebookresearch/fairscale>`_.
-The technique can be found within `DeepSpeed ZeRO <https://arxiv.org/abs/1910.02054>`_ and
-`ZeRO-2 <https://www.microsoft.com/en-us/research/blog/zero-2-deepspeed-shattering-barriers-of-deep-learning-speed-scale/>`_,
-however the implementation is built from the ground up to be pytorch compatible and standalone.
-Sharded Training allows you to maintain GPU scaling efficiency, whilst reducing memory overhead drastically. In short, expect normal linear scaling, and significantly reduced memory usage when training large models.
-
-Sharded Training still utilizes Data Parallel Training under the hood, except optimizer states and gradients are sharded across GPUs.
-This means the memory overhead per GPU is lower, as each GPU only has to maintain a partition of your optimizer state and gradients.
-
-The benefits vary by model and parameter sizes, but we've recorded up to a 63% memory reduction per GPU allowing us to double our model sizes. Because of extremely efficient communication,
-these benefits in multi-GPU setups are almost free and throughput scales well with multi-node setups.
-
-Below we use the `NeMo Transformer Lightning Language Modeling example <https://github.com/NVIDIA/NeMo/tree/main/examples/nlp/language_modeling>`_ to benchmark the maximum batch size and model size that can be fit on 8 A100 GPUs for DDP vs Sharded Training.
-Note that the benefits can still be obtained using 2 or more GPUs, and for even larger batch sizes you can scale to multiple nodes.
-
-**Increase Your Batch Size**
-
-Use Sharded Training to scale your batch size further using the same compute. This will reduce your overall epoch time.
-
-+----------------------+-----------------------+----------------+---------------------+
-| Distributed Training | Model Size (Millions) | Max Batch Size | Percentage Gain (%) |
-+======================+=======================+================+=====================+
-| Native DDP           | 930                   | 32             | -                   |
-+----------------------+-----------------------+----------------+---------------------+
-| Sharded DDP          | 930                   | **52**         | **48%**             |
-+----------------------+-----------------------+----------------+---------------------+
-
-**Increase Your Model Size**
-
-Use Sharded Training to scale your model size further using the same compute.
-
-+----------------------+------------+---------------------------+---------------------+
-| Distributed Training | Batch Size | Max Model Size (Millions) | Percentage Gain (%) |
-+======================+============+===========================+=====================+
-| Native DDP           | 32         | 930                       | -                   |
-+----------------------+------------+---------------------------+---------------------+
-| Sharded DDP          | 32         | **1404**                  | **41%**             |
-+----------------------+------------+---------------------------+---------------------+
-| Native DDP           | 8          | 1572                      | -                   |
-+----------------------+------------+---------------------------+---------------------+
-| Sharded DDP          | 8          | **2872**                  | **59%**             |
-+----------------------+------------+---------------------------+---------------------+
-
-It is highly recommended to use Sharded Training in multi-GPU environments where memory is limited, or where training larger models are beneficial (500M+ parameter models).
-A technical note: as batch size scales, storing activations for the backwards pass becomes the bottleneck in training. As a result, sharding optimizer state and gradients becomes less impactful.
-Work within the future will bring optional sharding to activations and model parameters to reduce memory further, but come with a speed cost.
-
-To use Sharded Training, you need to first install FairScale using the command below.
-
-.. code-block:: bash
-
-    pip install fairscale
-
-
-.. code-block:: python
-
-    # train using Sharded DDP
-    trainer = Trainer(accelerator='ddp', plugins='ddp_sharded')
-
-Sharded Training can work across all DDP variants by adding the additional ``--plugins ddp_sharded`` flag.
-
-Internally we re-initialize your optimizers and shard them across your machines and processes. We handle all communication using PyTorch distributed, so no code changes are required.
-
-----------
-
-.. _deep_speed:
-
-DeepSpeed
-^^^^^^^^^
-
-.. note::
-    The DeepSpeed plugin is in beta and the API is subject to change. Please create an `issue <https://github.com/PyTorchLightning/pytorch-lightning/issues>`_ if you run into any issues.
-
-`DeepSpeed <https://github.com/microsoft/DeepSpeed>`_ is a deep learning training optimization library, providing the means to train massive billion parameter models at scale.
-Using the DeepSpeed plugin, we were able to **train model sizes of 10 Billion parameters and above**, with a lot of useful information in this `benchmark <https://github.com/huggingface/transformers/issues/9996>`_ and the DeepSpeed `docs <https://www.deepspeed.ai/tutorials/megatron/>`_.
-DeepSpeed also offers lower level training optimizations, and efficient optimizers such as `1-bit Adam <https://www.deepspeed.ai/tutorials/onebit-adam/>`_. We recommend using DeepSpeed in environments where speed and memory optimizations are important (such as training large billion parameter models).
-
-To use DeepSpeed, you first need to install DeepSpeed using the commands below.
-
-.. code-block:: bash
-
-    pip install deepspeed
-
-If you run into an issue with the install or later in training, ensure that the CUDA version of the pytorch you've installed matches your locally installed CUDA (you can see which one has been recognized by running ``nvcc --version``).
-
-.. note::
-    Currently ``resume_from_checkpoint`` and manual optimization are not supported.
-
-    DeepSpeed currently only supports single optimizer, single scheduler within the training loop.
-
-DeepSpeed ZeRO Stage 2
-""""""""""""""""""""""
-
-By default, we enable `DeepSpeed ZeRO Stage 2 <https://www.deepspeed.ai/tutorials/zero/#zero-overview>`_, which partitions your optimizer states (Stage 1) and your gradients (Stage 2) across your GPUs to reduce memory. In most cases, this is more efficient or at parity with DDP, primarily due to the optimized custom communications written by the DeepSpeed team.
-As a result, benefits can also be seen on a single GPU. Do note that the default bucket sizes allocate around ``3.6GB`` of VRAM to use during distributed communications, which can be tweaked when instantiating the plugin described in a few sections below.
-
-.. note::
-    To use ZeRO, you must use ``precision=16``.
-
-.. code-block:: python
-
-    from pytorch_lightning import Trainer
-
-    model = MyModel()
-    trainer = Trainer(gpus=4, plugins='deepspeed', precision=16)
-    trainer.fit(model)
-
-
-DeepSpeed ZeRO Stage 2 Offload
-""""""""""""""""""""""""""""""
-
-Below we show an example of running `ZeRO-Offload <https://www.deepspeed.ai/tutorials/zero-offload/>`_. ZeRO-Offload leverages the host CPU to offload optimizer memory/computation, reducing the overall memory consumption.
-
-.. note::
-    To use ZeRO-Offload, you must use ``precision=16``.
-
-.. code-block:: python
-
-    from pytorch_lightning import Trainer
-    from pytorch_lightning.plugins import DeepSpeedPlugin
-
-    model = MyModel()
-    trainer = Trainer(gpus=4, plugins=DeepSpeedPlugin(cpu_offload=True), precision=16)
-    trainer.fit(model)
-
-
-This can also be done via the command line using a Pytorch Lightning script:
-
-.. code-block:: bash
-
-    python train.py --plugins deepspeed --precision 16 --gpus 4
-
-
-You can also modify the ZeRO-Offload parameters via the plugin as below.
-
-.. code-block:: python
-
-    from pytorch_lightning import Trainer
-    from pytorch_lightning.plugins import DeepSpeedPlugin
-
-    model = MyModel()
-    trainer = Trainer(gpus=4, plugins=DeepSpeedPlugin(cpu_offload=True, allgather_bucket_size=5e8, reduce_bucket_size=5e8), precision=16)
-    trainer.fit(model)
-
-
-.. note::
-    We suggest tuning the ``allgather_bucket_size`` parameter and ``reduce_bucket_size`` parameter to find optimum parameters based on your model size.
-    These control how large a buffer we limit the model to using when reducing gradients/gathering updated parameters. Smaller values will result in less memory, but tradeoff with speed.
-
-    DeepSpeed allocates a reduce buffer size `multiplied by 4.5x <https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/zero/stage2.py#L1594-L1607>`_ so take that into consideration when tweaking the parameters.
-
-    The plugin sets a reasonable default of ``2e8``, which should work for most low VRAM GPUs (less than ``7GB``), allocating roughly ``3.6GB`` of VRAM as buffer. Higher VRAM GPUs should aim for values around ``5e8``.
-
-For even more speed benefit, DeepSpeed offers an optimized CPU version of ADAM called `DeepSpeedCPUAdam <https://deepspeed.readthedocs.io/en/latest/optimizers.html#adam-cpu>`_ to run the offloaded computation, which is faster than the standard PyTorch implementation.
-
-.. code-block:: python
-
-    import pytorch_lightning
-    from pytorch_lightning import Trainer
-    from pytorch_lightning.plugins import DeepSpeedPlugin
-    from deepspeed.ops.adam import DeepSpeedCPUAdam
-
-    class MyModel(pl.LightningModule):
-        ...
-        def configure_optimizers(self):
-            # DeepSpeedCPUAdam provides 5x to 7x speedup over torch.optim.adam(w)
-            return DeepSpeedCPUAdam(self.parameters())
-
-    model = MyModel()
-    trainer = Trainer(gpus=4, plugins=DeepSpeedPlugin(cpu_offload=True), precision=16)
-    trainer.fit(model)
-
-DeepSpeed ZeRO Stage 3
-""""""""""""""""""""""
-
-DeepSpeed ZeRO Stage 3 shards the optimizer states, gradients and the model parameters (also optionally activations). Sharding model parameters and activations comes with an increase in distributed communication, however allows you to scale your models massively from one GPU to multiple GPUs.
-**The DeepSpeed team report the ability to fine-tune models with over 40B parameters on a single GPU and over 2 Trillion parameters on 512 GPUs.** For more information we suggest checking the `DeepSpeed ZeRO-3 Offload documentation <https://www.deepspeed.ai/news/2021/03/07/zero3-offload.html>`__.
-
-We've ran benchmarks for all these features and given a simple example of how all these features work in Lightning, which you can see at `minGPT <https://github.com/SeanNaren/minGPT/tree/stage3>`_.
-
-Currently this functionality is only available on master and will be included in our next 1.3 Release Candidate and 1.3 release.
-
-.. code-block:: python
-
-    pip install https://github.com/PyTorchLightning/pytorch-lightning/archive/refs/heads/master.zip
-
-
-To reach the highest memory efficiency or model size, you must:
-
-1. Use the DeepSpeed Plugin with the stage 3 parameter
-2. Use CPU Offloading to offload weights to CPU, plus have a reasonable amount of CPU RAM to offload onto
-3. Use DeepSpeed Activation Checkpointing to shard activations
-
-Below we describe how to enable all of these to see benefit. **With all these improvements we reached 45 Billion parameters training a GPT model on 8 GPUs with ~1TB of CPU RAM available**.
-
-Also please have a look at our :ref:`deepspeed-zero-stage-3-tips` which contains a lot of helpful information when configuring your own models.
-
-.. note::
-    Currently we only support non-elastic checkpointing. This means saving the model across GPUs will save shards of the model on all processes, which will then require the same amount of GPUS to load.
-    This additionally means for inference you must use the ``Trainer.test`` or ``Trainer.predict`` functionality as described below, to ensure we set up the distributed environment correctly.
-
-    This limitation is actively being worked on and will be resolved in the near future.
-
-.. code-block:: python
-
-    from pytorch_lightning import Trainer
-    from pytorch_lightning.plugins import DeepSpeedPlugin
-    from deepspeed.ops.adam import FusedAdam
-
-    class MyModel(pl.LightningModule):
-        ...
-        def configure_optimizers(self):
-            return FusedAdam(self.parameters())
-
-    model = MyModel()
-    trainer = Trainer(gpus=4, plugins=DeepSpeedPlugin(stage=3), precision=16)
-    trainer.fit(model)
-
-    trainer.test()
-    trainer.predict()
-
-
-Shard Model Instantly to Reduce Initialization Time/Memory
-""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
-
-When instantiating really large models, it is sometimes necessary to shard the model layers instantly.
-
-This is the case if layers may not fit on one single machines CPU or GPU memory, but would fit once sharded across multiple machines.
-We expose a hook that layers initialized within the hook will be sharded instantly on a per layer basis, allowing you to instantly shard models.
-
-This reduces the time taken to initialize very large models, as well as ensure we do not run out of memory when instantiating larger models. For more information you can refer to the DeepSpeed docs for `Constructing Massive Models <https://deepspeed.readthedocs.io/en/latest/zero3.html>`_.
-
-.. note::
-    When using the ``configure_sharded_model`` hook to shard models, note that ``LightningModule.load_from_checkpoint`` may not work for loading saved checkpoints. If you've trained on one GPU, you can manually instantiate the model and call the hook,
-    however when using multiple GPUs, this will not work as ``LightningModule.load_from_checkpoint`` doesn't support sharded checkpoints.
-
-    We recommend using ``Trainer.test`` or ``Trainer.predict`` for inference.
-
-.. code-block:: python
-
-    from pytorch_lightning import Trainer
-    from pytorch_lightning.plugins import DeepSpeedPlugin
-    from deepspeed.ops.adam import FusedAdam
-
-    class MyModel(pl.LightningModule):
-        ...
-        def configure_sharded_model(self):
-            # Created within sharded model context, modules are instantly sharded across processes
-            # as soon as they are made.
-            self.block = nn.Sequential(nn.Linear(32, 32), nn.ReLU())
-
-        def configure_optimizers(self):
-            return FusedAdam(self.parameters())
-
-    model = MyModel()
-    trainer = Trainer(gpus=4, plugins=DeepSpeedPlugin(stage=3), precision=16)
-    trainer.fit(model)
-
-    trainer.test()
-    trainer.predict()
-
-
-DeepSpeed ZeRO Stage 3 Offload
-""""""""""""""""""""""""""""""
-
-DeepSpeed ZeRO Stage 3 Offloads optimizer state, gradients to the host CPU to reduce memory usage as ZeRO Stage 2 does, however additionally allows you to offload the parameters as well for even more memory saving.
-
-.. code-block:: python
-
-    from pytorch_lightning import Trainer
-    from pytorch_lightning.plugins import DeepSpeedPlugin
-
-    # Enable CPU Offloading
-    model = MyModel()
-    trainer = Trainer(gpus=4, plugins=DeepSpeedPlugin(stage=3, cpu_offload=True), precision=16)
-    trainer.fit(model)
-
-    # Enable CPU Offloading, and offload parameters as well to CPU when possible
-    model = MyModel()
-    trainer = Trainer(gpus=4, plugins=DeepSpeedPlugin(stage=3, cpu_offload=True, cpu_offload_params=True), precision=16)
-    trainer.fit(model)
-
-
-DeepSpeed Activation Checkpointing
-""""""""""""""""""""""""""""""""""
-
-Activation checkpointing frees activations from memory as soon as they are not needed during the forward pass.
-They are then re-computed for the backwards pass as needed.
-
-This saves memory when training larger models however requires using a checkpoint function to run the module as shown below.
-
-.. code-block:: python
-
-    from pytorch_lightning import Trainer
-    from pytorch_lightning.plugins import DeepSpeedPlugin
-    import deepspeed
-
-
-    class MyModel(pl.LightningModule):
-        ...
-
-        def configure_sharded_model(self):
-            self.block = nn.Sequential(nn.Linear(32, 32), nn.ReLU())
-
-        def forward(self, x):
-            # Use the DeepSpeed checkpointing function instead of calling the module directly
-            output = deepspeed.checkpointing.checkpoint(self.block, x)
-            return output
-
-
-    model = MyModel()
-    trainer = Trainer(
-        gpus=4,
-        plugins=DeepSpeedPlugin(
-            stage=3,
-            cpu_offload=True,  # Enable CPU Offloading
-            partition_activations=True,  # Optionally move activations to CPU if you have enough memory
-            cpu_checkpointing=True  # Optionally Partition activations across machines
-        ),
-        precision=16
-    )
-    trainer.fit(model)
-
-
-.. _deepspeed-zero-stage-3-tips:
-
-DeepSpeed ZeRO Stage 3 Tips
-"""""""""""""""""""""""""""
-
-Here is some helpful information when setting up DeepSpeed ZeRO Stage 3 with Lightning.
-
-* If you're using Adam or AdamW, ensure to use FusedAdam or DeepSpeedCPUAdam (for CPU Offloading) rather than the default torch optimizers as they come with large speed benefits
-* Treat your GPU/CPU memory as one large pool. In some cases, you may not want to offload certain things (like activations) to provide even more space to offload model parameters
-* When offloading to the CPU, make sure to bump up the batch size as GPU memory will be freed
-
-
-Custom DeepSpeed Config
-"""""""""""""""""""""""
-
-In some cases you may want to define your own DeepSpeed Config, to access all parameters defined. We've exposed most of the important parameters, however, there may be debugging parameters to enable. Also, DeepSpeed allows the use of custom DeepSpeed optimizers and schedulers defined within a config file that is supported.
-
-.. note::
-    All plugin default parameters will be ignored when a config object is passed.
-    All compatible arguments can be seen in the `DeepSpeed docs <https://www.deepspeed.ai/docs/config-json/>`_.
-
-.. code-block:: python
-
-    from pytorch_lightning import Trainer
-    from pytorch_lightning.plugins import DeepSpeedPlugin
-
-    deepspeed_config = {
-        "zero_allow_untested_optimizer": True,
-        "optimizer": {
-            "type": "OneBitAdam",
-            "params": {
-                "lr": 3e-5,
-                "betas": [0.998, 0.999],
-                "eps": 1e-5,
-                "weight_decay": 1e-9,
-                "cuda_aware": True,
-            },
-        },
-        'scheduler': {
-            "type": "WarmupLR",
-            "params": {
-                "last_batch_iteration": -1,
-                "warmup_min_lr": 0,
-                "warmup_max_lr": 3e-5,
-                "warmup_num_steps": 100,
-            }
-        },
-        "zero_optimization": {
-            "stage": 2, # Enable Stage 2 ZeRO (Optimizer/Gradient state partitioning)
-            "cpu_offload": True, # Enable Offloading optimizer state/calculation to the host CPU
-            "contiguous_gradients": True, # Reduce gradient fragmentation.
-            "overlap_comm": True, # Overlap reduce/backward operation of gradients for speed.
-            "allgather_bucket_size": 2e8, # Number of elements to all gather at once.
-            "reduce_bucket_size": 2e8, # Number of elements we reduce/allreduce at once.
-        }
-    }
-
-    model = MyModel()
-    trainer = Trainer(gpus=4, plugins=DeepSpeedPlugin(deepspeed_config), precision=16)
-    trainer.fit(model)
-
-
-We support taking the config as a json formatted file:
-
-.. code-block:: python
-
-    from pytorch_lightning import Trainer
-    from pytorch_lightning.plugins import DeepSpeedPlugin
-
-    model = MyModel()
-    trainer = Trainer(gpus=4, plugins=DeepSpeedPlugin("/path/to/deepspeed_config.json"), precision=16)
-    trainer.fit(model)
-
-
-You can use also use an environment variable via your PyTorch Lightning script:
-
-.. code-block:: bash
-
-    PL_DEEPSPEED_CONFIG_PATH=/path/to/deepspeed_config.json python train.py --plugins deepspeed
-
-
-----------
-
-.. _sequential-parallelism:
-
-Sequential Model Parallelism with Checkpointing
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-PyTorch Lightning integration for Sequential Model Parallelism using `FairScale <https://github.com/facebookresearch/fairscale>`_.
-Sequential Model Parallelism splits a sequential module onto multiple GPUs, reducing peak GPU memory requirements substantially.
-We also provide auto-balancing techniques through FairScale, to find optimal balances for the model across GPUs.
-In addition, we use Gradient Checkpointing to reduce GPU memory requirements further, and micro-batches to minimizing device under-utilization automatically.
-
-Reference: https://arxiv.org/abs/1811.06965
-
-.. note:: RPCSequentialPlugin is currently supported only for Pytorch 1.6.
-
-To get started, install FairScale using the command below. We install a specific branch which contains PyTorch related fixes for Sequential Parallelism.
-
-.. code-block:: bash
-
-     pip install https://github.com/PyTorchLightning/fairscale/archive/pl_1.2.0.zip
-
-To use Sequential Model Parallelism, you must define a  :class:`nn.Sequential <torch.nn.Sequential>` module that defines the layers you wish to parallelize across GPUs.
-This should be kept within the ``sequential_module`` variable within your ``LightningModule`` like below.
-
-.. code-block:: python
-
-    from pytorch_lightning.plugins.training_type.rpc_sequential import RPCSequentialPlugin
-    from pytorch_lightning import LightningModule
-
-    class MyModel(LightningModule):
-        def __init__(self):
-            ...
-            self.sequential_module = nn.Sequential(my_layers)
-
-    # Split my module across 4 gpus, one layer each
-    model = MyModel()
-    plugin = RPCSequentialPlugin(balance=[1, 1, 1, 1])
-    trainer = Trainer(accelerator='ddp', gpus=4, plugins=[plugin])
-    trainer.fit(model)
-
-
-We provide a minimal example of Sequential Model Parallelism using a convolutional model training on cifar10, split onto GPUs `here <https://github.com/PyTorchLightning/pytorch-lightning/tree/master/pl_examples/basic_examples/conv_sequential_example.py>`_.
-To run the example, you need to install `Bolts <https://github.com/PyTorchLightning/pytorch-lightning-bolts>`_. Install with ``pip install pytorch-lightning-bolts``.
-
-When running the Sequential Model Parallelism example on 2 GPUS we achieve these memory savings.
-
-.. list-table:: GPU Memory Utilization
-   :widths: 25 25 50
-   :header-rows: 1
-
-   * - GPUS
-     - Without Balancing
-     - With Balancing
-   * - Gpu 0
-     - 4436 MB
-     - 1554 MB
-   * - Gpu 1
-     - ~0
-     - 994 MB
-
-To run the example with Sequential Model Parallelism:
-
-.. code-block:: bash
-
-    python pl_examples/basic_examples/conv_sequential_example.py --batch_size 1024 --gpus 2 --accelerator ddp --use_ddp_sequential
-
-To run the same example without Sequential Model Parallelism:
-
-.. code-block:: bash
-
-    python pl_examples/basic_examples/conv_sequential_example.py --batch_size 1024 --gpus 1
-
-
 Batch size
 ----------
 When using distributed training make sure to modify your learning rate according to your effective
diff --git a/docs/source/advanced/optimized_multi_gpu.rst b/docs/source/advanced/optimized_multi_gpu.rst
new file mode 100644
index 0000000000000..442e523016d74
--- /dev/null
+++ b/docs/source/advanced/optimized_multi_gpu.rst
@@ -0,0 +1,513 @@
+Memory Optimized Multi-GPU Training
+===================================
+
+When you want to train larger parameter models or fit larger batch sizes on your multi-gpu compute, Lightning provides advanced optimized distributed training to support these cases.
+
+For example if you'd like to train a large billion parameter transformer model, or to scale your batch size when training a semi-supervised learning model, using a Lightning optimized distributed training plugin will offer substantial improvements
+in memory usage. Note that some of the extreme memory saving configurations will affect the speed of training. This Speed/Memory trade-off in most cases can be adjusted.
+
+Choosing a Distributed Plugin
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. note::
+    These plugins shard the model states across your GPUs; just in different ways. This means as you scale up the number of GPUs,
+    you may be able to reach the number of model parameters you'd like to train using plugins that have less of a speed degradation.
+
+    For example when using 128 GPUs, you can scale to large 10/20B parameter models using just DeepSpeed ZeRO Stage 2.
+
+- I want to reach the largest **batch size**, with *minimal* speed degradation - Use :ref:`sharded` or :ref:`deepspeed-zero-stage-2`
+- I want to reach the largest **model size**, with *minimal* speed degradation - Use :ref:`deepspeed-zero-stage-2`
+- I want to reach the largest **batch size**, and I don't mind a *small* speed hit - Use :ref:`deepspeed-zero-stage-3`
+- I want to reach the largest **model size**, and I don't mind a *small* speed hit - Use :ref:`deepspeed-zero-stage-3`
+- I want to reach the largest **batch size**, and I don't mind a speed hit - Use :ref:`deepspeed-zero-stage-3-offload` and :ref:`deepspeed-activation-checkpointing`
+- I want to reach the largest **model size**, and I don't mind a speed hit - Use :ref:`deepspeed-zero-stage-3-offload` and :ref:`deepspeed-activation-checkpointing`
+
+.. _sharded:
+
+Sharded Training
+^^^^^^^^^^^^^^^^
+Lightning integration of optimizer sharded training provided by `FairScale <https://github.com/facebookresearch/fairscale>`_.
+The technique can be found within `DeepSpeed ZeRO <https://arxiv.org/abs/1910.02054>`_ and
+`ZeRO-2 <https://www.microsoft.com/en-us/research/blog/zero-2-deepspeed-shattering-barriers-of-deep-learning-speed-scale/>`_,
+however the implementation is built from the ground up to be pytorch compatible and standalone.
+Sharded Training allows you to maintain GPU scaling efficiency, whilst reducing memory overhead drastically. In short, expect normal linear scaling, and significantly reduced memory usage when training large models.
+
+Sharded Training still utilizes Data Parallel Training under the hood, except optimizer states and gradients are sharded across GPUs.
+This means the memory overhead per GPU is lower, as each GPU only has to maintain a partition of your optimizer state and gradients.
+
+The benefits vary by model and parameter sizes, but we've recorded up to a 63% memory reduction per GPU allowing us to double our model sizes. Because of extremely efficient communication,
+these benefits in multi-GPU setups are almost free and throughput scales well with multi-node setups.
+
+Below we use the `NeMo Transformer Lightning Language Modeling example <https://github.com/NVIDIA/NeMo/tree/main/examples/nlp/language_modeling>`_ to benchmark the maximum batch size and model size that can be fit on 8 A100 GPUs for DDP vs Sharded Training.
+Note that the benefits can still be obtained using 2 or more GPUs, and for even larger batch sizes you can scale to multiple nodes.
+
+**Increase Your Batch Size**
+
+Use Sharded Training to scale your batch size further using the same compute. This will reduce your overall epoch time.
+
++----------------------+-----------------------+----------------+---------------------+
+| Distributed Training | Model Size (Millions) | Max Batch Size | Percentage Gain (%) |
++======================+=======================+================+=====================+
+| Native DDP           | 930                   | 32             | -                   |
++----------------------+-----------------------+----------------+---------------------+
+| Sharded DDP          | 930                   | **52**         | **48%**             |
++----------------------+-----------------------+----------------+---------------------+
+
+**Increase Your Model Size**
+
+Use Sharded Training to scale your model size further using the same compute.
+
++----------------------+------------+---------------------------+---------------------+
+| Distributed Training | Batch Size | Max Model Size (Millions) | Percentage Gain (%) |
++======================+============+===========================+=====================+
+| Native DDP           | 32         | 930                       | -                   |
++----------------------+------------+---------------------------+---------------------+
+| Sharded DDP          | 32         | **1404**                  | **41%**             |
++----------------------+------------+---------------------------+---------------------+
+| Native DDP           | 8          | 1572                      | -                   |
++----------------------+------------+---------------------------+---------------------+
+| Sharded DDP          | 8          | **2872**                  | **59%**             |
++----------------------+------------+---------------------------+---------------------+
+
+It is highly recommended to use Sharded Training in multi-GPU environments where memory is limited, or where training larger models are beneficial (500M+ parameter models).
+A technical note: as batch size scales, storing activations for the backwards pass becomes the bottleneck in training. As a result, sharding optimizer state and gradients becomes less impactful.
+Work within the future will bring optional sharding to activations and model parameters to reduce memory further, but come with a speed cost.
+
+To use Sharded Training, you need to first install FairScale using the command below.
+
+.. code-block:: bash
+
+    pip install fairscale
+
+
+.. code-block:: python
+
+    # train using Sharded DDP
+    trainer = Trainer(accelerator='ddp', plugins='ddp_sharded')
+
+Sharded Training can work across all DDP variants by adding the additional ``--plugins ddp_sharded`` flag.
+
+Internally we re-initialize your optimizers and shard them across your machines and processes. We handle all communication using PyTorch distributed, so no code changes are required.
+
+----------
+
+.. _deep_speed:
+
+DeepSpeed
+^^^^^^^^^
+
+.. note::
+    The DeepSpeed plugin is in beta and the API is subject to change. Please create an `issue <https://github.com/PyTorchLightning/pytorch-lightning/issues>`_ if you run into any issues.
+
+`DeepSpeed <https://github.com/microsoft/DeepSpeed>`_ is a deep learning training optimization library, providing the means to train massive billion parameter models at scale.
+Using the DeepSpeed plugin, we were able to **train model sizes of 10 Billion parameters and above**, with a lot of useful information in this `benchmark <https://github.com/huggingface/transformers/issues/9996>`_ and the DeepSpeed `docs <https://www.deepspeed.ai/tutorials/megatron/>`_.
+DeepSpeed also offers lower level training optimizations, and efficient optimizers such as `1-bit Adam <https://www.deepspeed.ai/tutorials/onebit-adam/>`_. We recommend using DeepSpeed in environments where speed and memory optimizations are important (such as training large billion parameter models).
+
+To use DeepSpeed, you first need to install DeepSpeed using the commands below.
+
+.. code-block:: bash
+
+    pip install deepspeed
+
+If you run into an issue with the install or later in training, ensure that the CUDA version of the pytorch you've installed matches your locally installed CUDA (you can see which one has been recognized by running ``nvcc --version``).
+
+.. note::
+    Currently ``resume_from_checkpoint`` and manual optimization are not supported.
+
+    DeepSpeed currently only supports single optimizer, single scheduler within the training loop.
+
+.. _deepspeed-zero-stage-2:
+
+DeepSpeed ZeRO Stage 2
+""""""""""""""""""""""
+
+By default, we enable `DeepSpeed ZeRO Stage 2 <https://www.deepspeed.ai/tutorials/zero/#zero-overview>`_, which partitions your optimizer states (Stage 1) and your gradients (Stage 2) across your GPUs to reduce memory. In most cases, this is more efficient or at parity with DDP, primarily due to the optimized custom communications written by the DeepSpeed team.
+As a result, benefits can also be seen on a single GPU. Do note that the default bucket sizes allocate around ``3.6GB`` of VRAM to use during distributed communications, which can be tweaked when instantiating the plugin described in a few sections below.
+
+.. note::
+    To use ZeRO, you must use ``precision=16``.
+
+.. code-block:: python
+
+    from pytorch_lightning import Trainer
+
+    model = MyModel()
+    trainer = Trainer(gpus=4, plugins='deepspeed', precision=16)
+    trainer.fit(model)
+
+.. _deepspeed-zero-stage-2-offload:
+
+DeepSpeed ZeRO Stage 2 Offload
+""""""""""""""""""""""""""""""
+
+Below we show an example of running `ZeRO-Offload <https://www.deepspeed.ai/tutorials/zero-offload/>`_. ZeRO-Offload leverages the host CPU to offload optimizer memory/computation, reducing the overall memory consumption.
+
+.. note::
+    To use ZeRO-Offload, you must use ``precision=16``.
+
+.. code-block:: python
+
+    from pytorch_lightning import Trainer
+    from pytorch_lightning.plugins import DeepSpeedPlugin
+
+    model = MyModel()
+    trainer = Trainer(gpus=4, plugins=DeepSpeedPlugin(cpu_offload=True), precision=16)
+    trainer.fit(model)
+
+
+This can also be done via the command line using a Pytorch Lightning script:
+
+.. code-block:: bash
+
+    python train.py --plugins deepspeed --precision 16 --gpus 4
+
+
+You can also modify the ZeRO-Offload parameters via the plugin as below.
+
+.. code-block:: python
+
+    from pytorch_lightning import Trainer
+    from pytorch_lightning.plugins import DeepSpeedPlugin
+
+    model = MyModel()
+    trainer = Trainer(gpus=4, plugins=DeepSpeedPlugin(cpu_offload=True, allgather_bucket_size=5e8, reduce_bucket_size=5e8), precision=16)
+    trainer.fit(model)
+
+
+.. note::
+    We suggest tuning the ``allgather_bucket_size`` parameter and ``reduce_bucket_size`` parameter to find optimum parameters based on your model size.
+    These control how large a buffer we limit the model to using when reducing gradients/gathering updated parameters. Smaller values will result in less memory, but tradeoff with speed.
+
+    DeepSpeed allocates a reduce buffer size `multiplied by 4.5x <https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/zero/stage2.py#L1594-L1607>`_ so take that into consideration when tweaking the parameters.
+
+    The plugin sets a reasonable default of ``2e8``, which should work for most low VRAM GPUs (less than ``7GB``), allocating roughly ``3.6GB`` of VRAM as buffer. Higher VRAM GPUs should aim for values around ``5e8``.
+
+For even more speed benefit, DeepSpeed offers an optimized CPU version of ADAM called `DeepSpeedCPUAdam <https://deepspeed.readthedocs.io/en/latest/optimizers.html#adam-cpu>`_ to run the offloaded computation, which is faster than the standard PyTorch implementation.
+
+.. code-block:: python
+
+    import pytorch_lightning
+    from pytorch_lightning import Trainer
+    from pytorch_lightning.plugins import DeepSpeedPlugin
+    from deepspeed.ops.adam import DeepSpeedCPUAdam
+
+    class MyModel(pl.LightningModule):
+        ...
+        def configure_optimizers(self):
+            # DeepSpeedCPUAdam provides 5x to 7x speedup over torch.optim.adam(w)
+            return DeepSpeedCPUAdam(self.parameters())
+
+    model = MyModel()
+    trainer = Trainer(gpus=4, plugins=DeepSpeedPlugin(cpu_offload=True), precision=16)
+    trainer.fit(model)
+
+.. _deepspeed-zero-stage-3:
+
+DeepSpeed ZeRO Stage 3
+""""""""""""""""""""""
+
+DeepSpeed ZeRO Stage 3 shards the optimizer states, gradients and the model parameters (also optionally activations). Sharding model parameters and activations comes with an increase in distributed communication, however allows you to scale your models massively from one GPU to multiple GPUs.
+**The DeepSpeed team report the ability to fine-tune models with over 40B parameters on a single GPU and over 2 Trillion parameters on 512 GPUs.** For more information we suggest checking the `DeepSpeed ZeRO-3 Offload documentation <https://www.deepspeed.ai/news/2021/03/07/zero3-offload.html>`__.
+
+We've ran benchmarks for all these features and given a simple example of how all these features work in Lightning, which you can see at `minGPT <https://github.com/SeanNaren/minGPT/tree/stage3>`_.
+
+Currently this functionality is only available on master and will be included in our next 1.3 Release Candidate and 1.3 release.
+
+.. code-block:: python
+
+    pip install https://github.com/PyTorchLightning/pytorch-lightning/archive/refs/heads/master.zip
+
+
+To reach the highest memory efficiency or model size, you must:
+
+1. Use the DeepSpeed Plugin with the stage 3 parameter
+2. Use CPU Offloading to offload weights to CPU, plus have a reasonable amount of CPU RAM to offload onto
+3. Use DeepSpeed Activation Checkpointing to shard activations
+
+Below we describe how to enable all of these to see benefit. **With all these improvements we reached 45 Billion parameters training a GPT model on 8 GPUs with ~1TB of CPU RAM available**.
+
+Also please have a look at our :ref:`deepspeed-zero-stage-3-tips` which contains a lot of helpful information when configuring your own models.
+
+.. note::
+    Currently we only support non-elastic checkpointing. This means saving the model across GPUs will save shards of the model on all processes, which will then require the same amount of GPUS to load.
+    This additionally means for inference you must use the ``Trainer.test`` or ``Trainer.predict`` functionality as described below, to ensure we set up the distributed environment correctly.
+
+    This limitation is actively being worked on and will be resolved in the near future.
+
+.. code-block:: python
+
+    from pytorch_lightning import Trainer
+    from pytorch_lightning.plugins import DeepSpeedPlugin
+    from deepspeed.ops.adam import FusedAdam
+
+    class MyModel(pl.LightningModule):
+        ...
+        def configure_optimizers(self):
+            return FusedAdam(self.parameters())
+
+    model = MyModel()
+    trainer = Trainer(gpus=4, plugins=DeepSpeedPlugin(stage=3), precision=16)
+    trainer.fit(model)
+
+    trainer.test()
+    trainer.predict()
+
+
+Shard Model Instantly to Reduce Initialization Time/Memory
+""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+
+When instantiating really large models, it is sometimes necessary to shard the model layers instantly.
+
+This is the case if layers may not fit on one single machines CPU or GPU memory, but would fit once sharded across multiple machines.
+We expose a hook that layers initialized within the hook will be sharded instantly on a per layer basis, allowing you to instantly shard models.
+
+This reduces the time taken to initialize very large models, as well as ensure we do not run out of memory when instantiating larger models. For more information you can refer to the DeepSpeed docs for `Constructing Massive Models <https://deepspeed.readthedocs.io/en/latest/zero3.html>`_.
+
+.. note::
+    When using the ``configure_sharded_model`` hook to shard models, note that ``LightningModule.load_from_checkpoint`` may not work for loading saved checkpoints. If you've trained on one GPU, you can manually instantiate the model and call the hook,
+    however when using multiple GPUs, this will not work as ``LightningModule.load_from_checkpoint`` doesn't support sharded checkpoints.
+
+    We recommend using ``Trainer.test`` or ``Trainer.predict`` for inference.
+
+.. code-block:: python
+
+    from pytorch_lightning import Trainer
+    from pytorch_lightning.plugins import DeepSpeedPlugin
+    from deepspeed.ops.adam import FusedAdam
+
+    class MyModel(pl.LightningModule):
+        ...
+        def configure_sharded_model(self):
+            # Created within sharded model context, modules are instantly sharded across processes
+            # as soon as they are made.
+            self.block = nn.Sequential(nn.Linear(32, 32), nn.ReLU())
+
+        def configure_optimizers(self):
+            return FusedAdam(self.parameters())
+
+    model = MyModel()
+    trainer = Trainer(gpus=4, plugins=DeepSpeedPlugin(stage=3), precision=16)
+    trainer.fit(model)
+
+    trainer.test()
+    trainer.predict()
+
+
+.. _deepspeed-zero-stage-3-offload:
+
+DeepSpeed ZeRO Stage 3 Offload
+""""""""""""""""""""""""""""""
+
+DeepSpeed ZeRO Stage 3 Offloads optimizer state, gradients to the host CPU to reduce memory usage as ZeRO Stage 2 does, however additionally allows you to offload the parameters as well for even more memory saving.
+
+.. code-block:: python
+
+    from pytorch_lightning import Trainer
+    from pytorch_lightning.plugins import DeepSpeedPlugin
+
+    # Enable CPU Offloading
+    model = MyModel()
+    trainer = Trainer(gpus=4, plugins=DeepSpeedPlugin(stage=3, cpu_offload=True), precision=16)
+    trainer.fit(model)
+
+    # Enable CPU Offloading, and offload parameters as well to CPU when possible
+    model = MyModel()
+    trainer = Trainer(gpus=4, plugins=DeepSpeedPlugin(stage=3, cpu_offload=True, cpu_offload_params=True), precision=16)
+    trainer.fit(model)
+
+
+.. _deepspeed-activation-checkpointing:
+
+DeepSpeed Activation Checkpointing
+""""""""""""""""""""""""""""""""""
+
+Activation checkpointing frees activations from memory as soon as they are not needed during the forward pass.
+They are then re-computed for the backwards pass as needed.
+
+This saves memory when training larger models however requires using a checkpoint function to run the module as shown below.
+
+.. code-block:: python
+
+    from pytorch_lightning import Trainer
+    from pytorch_lightning.plugins import DeepSpeedPlugin
+    import deepspeed
+
+
+    class MyModel(pl.LightningModule):
+        ...
+
+        def configure_sharded_model(self):
+            self.block = nn.Sequential(nn.Linear(32, 32), nn.ReLU())
+
+        def forward(self, x):
+            # Use the DeepSpeed checkpointing function instead of calling the module directly
+            output = deepspeed.checkpointing.checkpoint(self.block, x)
+            return output
+
+
+    model = MyModel()
+    trainer = Trainer(
+        gpus=4,
+        plugins=DeepSpeedPlugin(
+            stage=3,
+            cpu_offload=True,  # Enable CPU Offloading
+            partition_activations=True,  # Optionally move activations to CPU if you have enough memory
+            cpu_checkpointing=True  # Optionally Partition activations across machines
+        ),
+        precision=16
+    )
+    trainer.fit(model)
+
+
+.. _deepspeed-zero-stage-3-tips:
+
+DeepSpeed ZeRO Stage 3 Tips
+"""""""""""""""""""""""""""
+
+Here is some helpful information when setting up DeepSpeed ZeRO Stage 3 with Lightning.
+
+* If you're using Adam or AdamW, ensure to use FusedAdam or DeepSpeedCPUAdam (for CPU Offloading) rather than the default torch optimizers as they come with large speed benefits
+* Treat your GPU/CPU memory as one large pool. In some cases, you may not want to offload certain things (like activations) to provide even more space to offload model parameters
+* When offloading to the CPU, make sure to bump up the batch size as GPU memory will be freed
+
+
+Custom DeepSpeed Config
+"""""""""""""""""""""""
+
+In some cases you may want to define your own DeepSpeed Config, to access all parameters defined. We've exposed most of the important parameters, however, there may be debugging parameters to enable. Also, DeepSpeed allows the use of custom DeepSpeed optimizers and schedulers defined within a config file that is supported.
+
+.. note::
+    All plugin default parameters will be ignored when a config object is passed.
+    All compatible arguments can be seen in the `DeepSpeed docs <https://www.deepspeed.ai/docs/config-json/>`_.
+
+.. code-block:: python
+
+    from pytorch_lightning import Trainer
+    from pytorch_lightning.plugins import DeepSpeedPlugin
+
+    deepspeed_config = {
+        "zero_allow_untested_optimizer": True,
+        "optimizer": {
+            "type": "OneBitAdam",
+            "params": {
+                "lr": 3e-5,
+                "betas": [0.998, 0.999],
+                "eps": 1e-5,
+                "weight_decay": 1e-9,
+                "cuda_aware": True,
+            },
+        },
+        'scheduler': {
+            "type": "WarmupLR",
+            "params": {
+                "last_batch_iteration": -1,
+                "warmup_min_lr": 0,
+                "warmup_max_lr": 3e-5,
+                "warmup_num_steps": 100,
+            }
+        },
+        "zero_optimization": {
+            "stage": 2, # Enable Stage 2 ZeRO (Optimizer/Gradient state partitioning)
+            "cpu_offload": True, # Enable Offloading optimizer state/calculation to the host CPU
+            "contiguous_gradients": True, # Reduce gradient fragmentation.
+            "overlap_comm": True, # Overlap reduce/backward operation of gradients for speed.
+            "allgather_bucket_size": 2e8, # Number of elements to all gather at once.
+            "reduce_bucket_size": 2e8, # Number of elements we reduce/allreduce at once.
+        }
+    }
+
+    model = MyModel()
+    trainer = Trainer(gpus=4, plugins=DeepSpeedPlugin(deepspeed_config), precision=16)
+    trainer.fit(model)
+
+
+We support taking the config as a json formatted file:
+
+.. code-block:: python
+
+    from pytorch_lightning import Trainer
+    from pytorch_lightning.plugins import DeepSpeedPlugin
+
+    model = MyModel()
+    trainer = Trainer(gpus=4, plugins=DeepSpeedPlugin("/path/to/deepspeed_config.json"), precision=16)
+    trainer.fit(model)
+
+
+You can use also use an environment variable via your PyTorch Lightning script:
+
+.. code-block:: bash
+
+    PL_DEEPSPEED_CONFIG_PATH=/path/to/deepspeed_config.json python train.py --plugins deepspeed
+
+
+----------
+
+.. _sequential-parallelism:
+
+Sequential Model Parallelism with Checkpointing
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+PyTorch Lightning integration for Sequential Model Parallelism using `FairScale <https://github.com/facebookresearch/fairscale>`_.
+Sequential Model Parallelism splits a sequential module onto multiple GPUs, reducing peak GPU memory requirements substantially.
+We also provide auto-balancing techniques through FairScale, to find optimal balances for the model across GPUs.
+In addition, we use Gradient Checkpointing to reduce GPU memory requirements further, and micro-batches to minimizing device under-utilization automatically.
+
+Reference: https://arxiv.org/abs/1811.06965
+
+.. note:: RPCSequentialPlugin is currently supported only for Pytorch 1.6.
+
+To get started, install FairScale using the command below. We install a specific branch which contains PyTorch related fixes for Sequential Parallelism.
+
+.. code-block:: bash
+
+     pip install https://github.com/PyTorchLightning/fairscale/archive/pl_1.2.0.zip
+
+To use Sequential Model Parallelism, you must define a  :class:`nn.Sequential <torch.nn.Sequential>` module that defines the layers you wish to parallelize across GPUs.
+This should be kept within the ``sequential_module`` variable within your ``LightningModule`` like below.
+
+.. code-block:: python
+
+    from pytorch_lightning.plugins.training_type.rpc_sequential import RPCSequentialPlugin
+    from pytorch_lightning import LightningModule
+
+    class MyModel(LightningModule):
+        def __init__(self):
+            ...
+            self.sequential_module = nn.Sequential(my_layers)
+
+    # Split my module across 4 gpus, one layer each
+    model = MyModel()
+    plugin = RPCSequentialPlugin(balance=[1, 1, 1, 1])
+    trainer = Trainer(accelerator='ddp', gpus=4, plugins=[plugin])
+    trainer.fit(model)
+
+
+We provide a minimal example of Sequential Model Parallelism using a convolutional model training on cifar10, split onto GPUs `here <https://github.com/PyTorchLightning/pytorch-lightning/tree/master/pl_examples/basic_examples/conv_sequential_example.py>`_.
+To run the example, you need to install `Bolts <https://github.com/PyTorchLightning/pytorch-lightning-bolts>`_. Install with ``pip install pytorch-lightning-bolts``.
+
+When running the Sequential Model Parallelism example on 2 GPUS we achieve these memory savings.
+
+.. list-table:: GPU Memory Utilization
+   :widths: 25 25 50
+   :header-rows: 1
+
+   * - GPUS
+     - Without Balancing
+     - With Balancing
+   * - Gpu 0
+     - 4436 MB
+     - 1554 MB
+   * - Gpu 1
+     - ~0
+     - 994 MB
+
+To run the example with Sequential Model Parallelism:
+
+.. code-block:: bash
+
+    python pl_examples/basic_examples/conv_sequential_example.py --batch_size 1024 --gpus 2 --accelerator ddp --use_ddp_sequential
+
+To run the same example without Sequential Model Parallelism:
+
+.. code-block:: bash
+
+    python pl_examples/basic_examples/conv_sequential_example.py --batch_size 1024 --gpus 1
diff --git a/docs/source/index.rst b/docs/source/index.rst
index 030ab6d70aa3e..00294404b50da 100644
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -105,6 +105,7 @@ PyTorch Lightning Documentation
    common/lightning_cli
    advanced/lr_finder
    advanced/multi_gpu
+   advanced/optimized_multi_gpu
    advanced/multiple_loaders
    common/weights_loading
    common/optimizers

From 6b128e886cec6a531ec274b80112fc5b49e6880c Mon Sep 17 00:00:00 2001
From: SeanNaren <sean@grid.ai>
Date: Thu, 8 Apr 2021 16:44:41 +0100
Subject: [PATCH 02/10] Small changes

---
 docs/source/advanced/optimized_multi_gpu.rst | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/docs/source/advanced/optimized_multi_gpu.rst b/docs/source/advanced/optimized_multi_gpu.rst
index 442e523016d74..8a1eddb0c5dd7 100644
--- a/docs/source/advanced/optimized_multi_gpu.rst
+++ b/docs/source/advanced/optimized_multi_gpu.rst
@@ -6,14 +6,13 @@ When you want to train larger parameter models or fit larger batch sizes on your
 For example if you'd like to train a large billion parameter transformer model, or to scale your batch size when training a semi-supervised learning model, using a Lightning optimized distributed training plugin will offer substantial improvements
 in memory usage. Note that some of the extreme memory saving configurations will affect the speed of training. This Speed/Memory trade-off in most cases can be adjusted.
 
-Choosing a Distributed Plugin
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Choosing an Optimized Distributed Plugin
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-.. note::
-    These plugins shard the model states across your GPUs; just in different ways. This means as you scale up the number of GPUs,
-    you may be able to reach the number of model parameters you'd like to train using plugins that have less of a speed degradation.
+These optimized Multi-GPU plugins shard the model states across your GPUs; just in different ways. This means as you scale up the number of GPUs,
+you may be able to reach the number of model parameters you'd like to train using plugins that have less of a speed degradation.
 
-    For example when using 128 GPUs, you can scale to large 10/20B parameter models using just DeepSpeed ZeRO Stage 2.
+For example when using 128 GPUs, you can scale to large 10 to 20 Billion parameter models using just DeepSpeed ZeRO Stage 2.
 
 - I want to reach the largest **batch size**, with *minimal* speed degradation - Use :ref:`sharded` or :ref:`deepspeed-zero-stage-2`
 - I want to reach the largest **model size**, with *minimal* speed degradation - Use :ref:`deepspeed-zero-stage-2`

From b8815014c682f095168f255259cd9a86eb58f27f Mon Sep 17 00:00:00 2001
From: SeanNaren <sean@grid.ai>
Date: Wed, 28 Apr 2021 14:35:50 +0100
Subject: [PATCH 03/10] Better documentation

---
 docs/source/advanced/optimized_multi_gpu.rst | 297 ++++++++++++++-----
 1 file changed, 223 insertions(+), 74 deletions(-)

diff --git a/docs/source/advanced/optimized_multi_gpu.rst b/docs/source/advanced/optimized_multi_gpu.rst
index 8a1eddb0c5dd7..8f8b752258e7b 100644
--- a/docs/source/advanced/optimized_multi_gpu.rst
+++ b/docs/source/advanced/optimized_multi_gpu.rst
@@ -1,25 +1,44 @@
 Memory Optimized Multi-GPU Training
 ===================================
 
-When you want to train larger parameter models or fit larger batch sizes on your multi-gpu compute, Lightning provides advanced optimized distributed training to support these cases.
+When training large models or fitting larger batch sizes on multi-gpu compute, Lightning provides advanced optimized multi-gpu plugins to support these cases.
 
 For example if you'd like to train a large billion parameter transformer model, or to scale your batch size when training a semi-supervised learning model, using a Lightning optimized distributed training plugin will offer substantial improvements
-in memory usage. Note that some of the extreme memory saving configurations will affect the speed of training. This Speed/Memory trade-off in most cases can be adjusted.
+in memory usage.
 
-Choosing an Optimized Distributed Plugin
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Note that some of the extreme memory saving configurations will affect the speed of training. This Speed/Memory trade-off in most cases can be adjusted.
 
-These optimized Multi-GPU plugins shard the model states across your GPUs; just in different ways. This means as you scale up the number of GPUs,
-you may be able to reach the number of model parameters you'd like to train using plugins that have less of a speed degradation.
+Some of these memory efficient plugins rely on offloading onto other forms of memory, such as CPU RAM or NVMe. This means you can even see memory benefits on a **single GPU**, using a plugin such as :ref:`deepspeed-zero-stage-3-offload`.
 
-For example when using 128 GPUs, you can scale to large 10 to 20 Billion parameter models using just DeepSpeed ZeRO Stage 2.
+Choosing an Optimized Multi-GPU Plugin
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-- I want to reach the largest **batch size**, with *minimal* speed degradation - Use :ref:`sharded` or :ref:`deepspeed-zero-stage-2`
-- I want to reach the largest **model size**, with *minimal* speed degradation - Use :ref:`deepspeed-zero-stage-2`
-- I want to reach the largest **batch size**, and I don't mind a *small* speed hit - Use :ref:`deepspeed-zero-stage-3`
-- I want to reach the largest **model size**, and I don't mind a *small* speed hit - Use :ref:`deepspeed-zero-stage-3`
-- I want to reach the largest **batch size**, and I don't mind a speed hit - Use :ref:`deepspeed-zero-stage-3-offload` and :ref:`deepspeed-activation-checkpointing`
-- I want to reach the largest **model size**, and I don't mind a speed hit - Use :ref:`deepspeed-zero-stage-3-offload` and :ref:`deepspeed-activation-checkpointing`
+Currently all Memory Optimized Multi-GPU plugins shard the model states across your GPUs; just in different ways.
+
+This means as you scale up the number of GPUs, you can reach the number of model parameters you'd like to train.
+
+Pre-training vs Fine-tuning
+"""""""""""""""""""""""""""
+
+When fine-tuning, we often use a magnitude less data compared to pre-training a model. This is important when choosing a distributed plugin as usually for pre-training, **we are compute bound**.
+This means we cannot sacrifice throughput as much as if we were fine-tuning, because in fine-tuning the data requirement is smaller.
+
+Overall:
+
+* When **fine-tuning** a model, use advanced memory efficient plugins such as :ref:`deepspeed-zero-stage-3` or :ref:`deepspeed-zero-stage-3-offload`, allowing you to fine-tune larger models if you are limited on compute
+* When **pre-training** a model, use simpler optimizations such :ref:`sharded`, :ref:`deepspeed-zero-stage-2` or :ref:`fully-sharded`, scaling the number of GPUs to reach larger parameter sizes
+* For both fine-tuning and pre-training, use :ref:`deepspeed-activation-checkpointing` or :ref:`fairscale-activation-checkpointing` as the throughput degradation is not significant
+
+For example when using 128 GPUs, you can **pre-train** large 10 to 20 Billion parameter models using :ref:`deepspeed-zero-stage-2` without having to take a performance hit with more advanced optimized multi-gpu plugins.
+
+But for **fine-tuning** a model, you can reach 10 to 20 Billion parameter models using :ref:`deepspeed-zero-stage-3-offload` on a **single GPU**. This does come with a significant throughput hit, which needs to be weighed accordingly.
+
+When Shouldn't I use an Optimized Multi-GPU Plugin?
+"""""""""""""""""""""""""""""""""""""""""""""""""""
+
+Sharding techniques help when model sizes are large (500M+ parameters). We've seen benefits from 500M+, however in cases where your model is small (say ResNet50 of around 80M Parameters) it may be best to stick to normal distributed training.
+
+----------
 
 .. _sharded:
 
@@ -37,40 +56,9 @@ This means the memory overhead per GPU is lower, as each GPU only has to maintai
 The benefits vary by model and parameter sizes, but we've recorded up to a 63% memory reduction per GPU allowing us to double our model sizes. Because of extremely efficient communication,
 these benefits in multi-GPU setups are almost free and throughput scales well with multi-node setups.
 
-Below we use the `NeMo Transformer Lightning Language Modeling example <https://github.com/NVIDIA/NeMo/tree/main/examples/nlp/language_modeling>`_ to benchmark the maximum batch size and model size that can be fit on 8 A100 GPUs for DDP vs Sharded Training.
-Note that the benefits can still be obtained using 2 or more GPUs, and for even larger batch sizes you can scale to multiple nodes.
-
-**Increase Your Batch Size**
-
-Use Sharded Training to scale your batch size further using the same compute. This will reduce your overall epoch time.
-
-+----------------------+-----------------------+----------------+---------------------+
-| Distributed Training | Model Size (Millions) | Max Batch Size | Percentage Gain (%) |
-+======================+=======================+================+=====================+
-| Native DDP           | 930                   | 32             | -                   |
-+----------------------+-----------------------+----------------+---------------------+
-| Sharded DDP          | 930                   | **52**         | **48%**             |
-+----------------------+-----------------------+----------------+---------------------+
-
-**Increase Your Model Size**
-
-Use Sharded Training to scale your model size further using the same compute.
-
-+----------------------+------------+---------------------------+---------------------+
-| Distributed Training | Batch Size | Max Model Size (Millions) | Percentage Gain (%) |
-+======================+============+===========================+=====================+
-| Native DDP           | 32         | 930                       | -                   |
-+----------------------+------------+---------------------------+---------------------+
-| Sharded DDP          | 32         | **1404**                  | **41%**             |
-+----------------------+------------+---------------------------+---------------------+
-| Native DDP           | 8          | 1572                      | -                   |
-+----------------------+------------+---------------------------+---------------------+
-| Sharded DDP          | 8          | **2872**                  | **59%**             |
-+----------------------+------------+---------------------------+---------------------+
-
 It is highly recommended to use Sharded Training in multi-GPU environments where memory is limited, or where training larger models are beneficial (500M+ parameter models).
 A technical note: as batch size scales, storing activations for the backwards pass becomes the bottleneck in training. As a result, sharding optimizer state and gradients becomes less impactful.
-Work within the future will bring optional sharding to activations and model parameters to reduce memory further, but come with a speed cost.
+Use :ref:`fairscale-activation-checkpointing` or :ref:`fully-sharded` to see even more benefit at the cost of some throughput.
 
 To use Sharded Training, you need to first install FairScale using the command below.
 
@@ -82,7 +70,7 @@ To use Sharded Training, you need to first install FairScale using the command b
 .. code-block:: python
 
     # train using Sharded DDP
-    trainer = Trainer(accelerator='ddp', plugins='ddp_sharded')
+    trainer = Trainer(plugins='ddp_sharded')
 
 Sharded Training can work across all DDP variants by adding the additional ``--plugins ddp_sharded`` flag.
 
@@ -90,7 +78,150 @@ Internally we re-initialize your optimizers and shard them across your machines
 
 ----------
 
-.. _deep_speed:
+.. _fully-sharded:
+
+Fully Sharded Training
+^^^^^^^^^^^^^^^^^^^^^^
+
+.. note::
+    Fully Sharded Training is in beta and the API is subject to change. Please create an `issue <https://github.com/PyTorchLightning/pytorch-lightning/issues>`_ if you run into any issues.
+
+`Fully Sharded <https://fairscale.readthedocs.io/en/latest/api/nn/fsdp.html>`__ shards optimizer state, gradients and parameters across data parallel workers. This allows you to fit much larger models onto multiple GPUs into memory.
+
+By default, Fully Sharded acts similar to :ref:`sharded` which shards optimizer states and gradients. If you can train with default Fully Sharded, it is recommended to just use :ref:`sharded`.
+
+Shard Parameters to Reach 10+ Billion Parameters
+""""""""""""""""""""""""""""""""""""""""""""""""
+
+To reach larger parameter sizes and be memory efficient, we have to shard parameters. There are various ways to enable this.
+
+Auto Wrap
+"""""""""
+
+``auto_wrap`` will recursively wrap modules within the ``LightningModule`` with nested Fully Sharded Wrappers,
+signalling that we'd like to partition these modules across data parallel devices, discarding the full weights when not required (information `here <https://fairscale.readthedocs.io/en/latest/api/nn/fsdp_tips.html>`__).
+
+Enabling `auto_wrap` doesn't require code changes, however can have varying level of success based on the complexity of your model. **Auto Wrap does not support models with shared parameters**, use :ref:`manual-wrap` instead.
+
+.. code-block:: python
+
+    from pytorch_lightning import Trainer
+    from pytorch_lightning.plugins import DeepSpeedPlugin
+
+    class MyModel(pl.LightningModule):
+        ...
+
+    model = MyModel()
+    trainer = Trainer(gpus=4, plugins='ddp_fully_sharded_auto_wrap', precision=16)
+    trainer.fit(model)
+
+    trainer.test()
+    trainer.predict()
+
+
+.. _manual-wrap:
+
+Manual Wrap
+"""""""""""
+
+To activate parameter sharding, you can also wrap layers using provided ``wrap`` or ``auto_wrap`` functions as described below.
+
+When not using Fully Sharded these wrap functions are a no-op. This means once the changes have been made, there is no need to remove the changes for other plugins.
+
+This is a requirement for really large models and also saves on instantiation time as modules are sharded instantly, rather than after the entire model is created in memory.
+
+.. code-block:: python
+
+    import torch
+    import torch.nn as nn
+    from pytorch_lightning import Trainer
+    from pytorch_lightning.plugins import DeepSpeedPlugin
+    from fairscale.nn import checkpoint_wrapper, auto_wrap, wrap
+
+    class MyModel(pl.LightningModule):
+        ...
+        def configure_sharded_model(self):
+            # Created within sharded model context, modules are instantly sharded across processes
+            # as soon as they are wrapped with ``wrap`` or ``auto_wrap``
+
+             # Wraps the layer in a Fully Sharded Wrapper automatically
+            linear_layer = wrap(nn.Linear(32, 32))
+
+            # Wraps the module recursively
+            # based on a minimum number of parameters (default 100M parameters)
+            block = auto_wrap(
+                nn.Sequential(
+                    nn.Linear(32, 32),
+                    nn.ReLU()
+                )
+            )
+
+            # For best memory efficiency,
+            # add fairscale activation checkpointing
+            final_block = auto_wrap(
+                checkpoint_wrapper(
+                    nn.Sequential(
+                        nn.Linear(32, 32),
+                        nn.ReLU()
+                    )
+                )
+            )
+            self.block = nn.Sequential(
+                linear_layer,
+                nn.ReLU(),
+                block,
+                final_block
+            )
+
+        def configure_optimizers(self):
+            return torch.optim.AdamW(self.parameters())
+
+    model = MyModel()
+    trainer = Trainer(gpus=4, plugins='ddp_fully_sharded', precision=16)
+    trainer.fit(model)
+
+    trainer.test()
+    trainer.predict()
+
+
+----------
+
+.. _fairscale-activation-checkpointing:
+
+FairScale Activation Checkpointing
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Activation checkpointing frees activations from memory as soon as they are not needed during the forward pass.
+They are then re-computed for the backwards pass as needed.
+
+This saves memory when training larger models however requires wrapping modules you'd like to use activation checkpointing on. See `here <https://fairscale.readthedocs.io/en/latest/api/nn/misc/checkpoint_activations.html>`__ for more information.
+
+.. code-block:: python
+
+    from pytorch_lightning import Trainer
+    from pytorch_lightning.plugins import DeepSpeedPlugin
+    from fairscale.nn import checkpoint_wrapper
+
+
+    class MyModel(pl.LightningModule):
+        def __init__(self):
+            # Wrap layer using checkpoint_wrapper
+            linear_layer = checkpoint_wrapper(nn.Linear(32, 32))
+            self.block = nn.Sequential(linear_layer, nn.ReLU())
+
+        def configure_sharded_model(self):
+            # Can be defined within this function as well
+            # for when using Fully Sharded.
+            linear_layer = checkpoint_wrapper(nn.Linear(32, 32))
+            self.block = nn.Sequential(linear_layer, nn.ReLU())
+
+        def forward(self, x):
+            # Use the DeepSpeed checkpointing function instead of calling the module directly
+            output = deepspeed.checkpointing.checkpoint(self.block, x)
+            return output
+
+
+.. _deepspeed:
 
 DeepSpeed
 ^^^^^^^^^
@@ -98,10 +229,22 @@ DeepSpeed
 .. note::
     The DeepSpeed plugin is in beta and the API is subject to change. Please create an `issue <https://github.com/PyTorchLightning/pytorch-lightning/issues>`_ if you run into any issues.
 
-`DeepSpeed <https://github.com/microsoft/DeepSpeed>`_ is a deep learning training optimization library, providing the means to train massive billion parameter models at scale.
-Using the DeepSpeed plugin, we were able to **train model sizes of 10 Billion parameters and above**, with a lot of useful information in this `benchmark <https://github.com/huggingface/transformers/issues/9996>`_ and the DeepSpeed `docs <https://www.deepspeed.ai/tutorials/megatron/>`_.
+`DeepSpeed <https://github.com/microsoft/DeepSpeed>`__ is a deep learning training optimization library, providing the means to train massive billion parameter models at scale.
+Using the DeepSpeed plugin, we were able to **train model sizes of 10 Billion parameters and above**, with a lot of useful information in this `benchmark <https://github.com/huggingface/transformers/issues/9996>`_ and the `DeepSpeed docs <https://www.deepspeed.ai/tutorials/megatron/>`__.
 DeepSpeed also offers lower level training optimizations, and efficient optimizers such as `1-bit Adam <https://www.deepspeed.ai/tutorials/onebit-adam/>`_. We recommend using DeepSpeed in environments where speed and memory optimizations are important (such as training large billion parameter models).
 
+Below is a summary of all the configurations of DeepSpeed.
+
+* :ref:`deepspeed-zero-stage-2` - **Shard optimizer states and gradients**, remains at parity with DDP with memory improvement
+
+* :ref:`deepspeed-zero-stage-2-offload` - **Offload optimizer states and gradients to CPU**. Increases communication, but significant memory improvement
+
+* :ref:`deepspeed-zero-stage-3` - **Shard optimizer states, gradients, (Optional) activations and parameters**. Increases communication volume, but even more memory improvement
+
+* :ref:`deepspeed-zero-stage-3-offload` - **Offload optimizer states, gradients, (Optional) activations and parameters to CPU**. Increases communication, but even more signficant memory improvement.
+
+* :ref:`deepspeed-activation-checkpointing` - **Free activations after forward pass**. Increases computation, but provides memory improvement for all stages.
+
 To use DeepSpeed, you first need to install DeepSpeed using the commands below.
 
 .. code-block:: bash
@@ -111,7 +254,6 @@ To use DeepSpeed, you first need to install DeepSpeed using the commands below.
 If you run into an issue with the install or later in training, ensure that the CUDA version of the pytorch you've installed matches your locally installed CUDA (you can see which one has been recognized by running ``nvcc --version``).
 
 .. note::
-    Currently ``resume_from_checkpoint`` and manual optimization are not supported.
 
     DeepSpeed currently only supports single optimizer, single scheduler within the training loop.
 
@@ -131,9 +273,14 @@ As a result, benefits can also be seen on a single GPU. Do note that the default
     from pytorch_lightning import Trainer
 
     model = MyModel()
-    trainer = Trainer(gpus=4, plugins='deepspeed', precision=16)
+    trainer = Trainer(gpus=4, plugins='deepspeed_stage_2', precision=16)
     trainer.fit(model)
 
+.. code-block:: bash
+
+    python train.py --plugins deepspeed_stage_2 --precision 16 --gpus 4
+
+
 .. _deepspeed-zero-stage-2-offload:
 
 DeepSpeed ZeRO Stage 2 Offload
@@ -150,7 +297,7 @@ Below we show an example of running `ZeRO-Offload <https://www.deepspeed.ai/tuto
     from pytorch_lightning.plugins import DeepSpeedPlugin
 
     model = MyModel()
-    trainer = Trainer(gpus=4, plugins=DeepSpeedPlugin(cpu_offload=True), precision=16)
+    trainer = Trainer(gpus=4, plugins='deepspeed_stage_2_offload', precision=16)
     trainer.fit(model)
 
 
@@ -158,7 +305,7 @@ This can also be done via the command line using a Pytorch Lightning script:
 
 .. code-block:: bash
 
-    python train.py --plugins deepspeed --precision 16 --gpus 4
+    python train.py --plugins deepspeed_stage_2_offload --precision 16 --gpus 4
 
 
 You can also modify the ZeRO-Offload parameters via the plugin as below.
@@ -197,9 +344,10 @@ For even more speed benefit, DeepSpeed offers an optimized CPU version of ADAM c
             return DeepSpeedCPUAdam(self.parameters())
 
     model = MyModel()
-    trainer = Trainer(gpus=4, plugins=DeepSpeedPlugin(cpu_offload=True), precision=16)
+    trainer = Trainer(gpus=4, plugins='deepspeed_stage_2_offload' precision=16)
     trainer.fit(model)
 
+
 .. _deepspeed-zero-stage-3:
 
 DeepSpeed ZeRO Stage 3
@@ -227,12 +375,6 @@ Below we describe how to enable all of these to see benefit. **With all these im
 
 Also please have a look at our :ref:`deepspeed-zero-stage-3-tips` which contains a lot of helpful information when configuring your own models.
 
-.. note::
-    Currently we only support non-elastic checkpointing. This means saving the model across GPUs will save shards of the model on all processes, which will then require the same amount of GPUS to load.
-    This additionally means for inference you must use the ``Trainer.test`` or ``Trainer.predict`` functionality as described below, to ensure we set up the distributed environment correctly.
-
-    This limitation is actively being worked on and will be resolved in the near future.
-
 .. code-block:: python
 
     from pytorch_lightning import Trainer
@@ -245,7 +387,7 @@ Also please have a look at our :ref:`deepspeed-zero-stage-3-tips` which contains
             return FusedAdam(self.parameters())
 
     model = MyModel()
-    trainer = Trainer(gpus=4, plugins=DeepSpeedPlugin(stage=3), precision=16)
+    trainer = Trainer(gpus=4, plugins='deepspeed_stage_3', precision=16)
     trainer.fit(model)
 
     trainer.test()
@@ -262,14 +404,9 @@ We expose a hook that layers initialized within the hook will be sharded instant
 
 This reduces the time taken to initialize very large models, as well as ensure we do not run out of memory when instantiating larger models. For more information you can refer to the DeepSpeed docs for `Constructing Massive Models <https://deepspeed.readthedocs.io/en/latest/zero3.html>`_.
 
-.. note::
-    When using the ``configure_sharded_model`` hook to shard models, note that ``LightningModule.load_from_checkpoint`` may not work for loading saved checkpoints. If you've trained on one GPU, you can manually instantiate the model and call the hook,
-    however when using multiple GPUs, this will not work as ``LightningModule.load_from_checkpoint`` doesn't support sharded checkpoints.
-
-    We recommend using ``Trainer.test`` or ``Trainer.predict`` for inference.
-
 .. code-block:: python
 
+    import torch.nn as nn
     from pytorch_lightning import Trainer
     from pytorch_lightning.plugins import DeepSpeedPlugin
     from deepspeed.ops.adam import FusedAdam
@@ -285,7 +422,7 @@ This reduces the time taken to initialize very large models, as well as ensure w
             return FusedAdam(self.parameters())
 
     model = MyModel()
-    trainer = Trainer(gpus=4, plugins=DeepSpeedPlugin(stage=3), precision=16)
+    trainer = Trainer(gpus=4, plugins='deepspeed_stage_3', precision=16)
     trainer.fit(model)
 
     trainer.test()
@@ -306,12 +443,16 @@ DeepSpeed ZeRO Stage 3 Offloads optimizer state, gradients to the host CPU to re
 
     # Enable CPU Offloading
     model = MyModel()
-    trainer = Trainer(gpus=4, plugins=DeepSpeedPlugin(stage=3, cpu_offload=True), precision=16)
+    trainer = Trainer(gpus=4, plugins='deepspeed_stage_3_offload', precision=16)
     trainer.fit(model)
 
-    # Enable CPU Offloading, and offload parameters as well to CPU when possible
+    # Enable CPU Offloading, and offload parameters to CPU
     model = MyModel()
-    trainer = Trainer(gpus=4, plugins=DeepSpeedPlugin(stage=3, cpu_offload=True, cpu_offload_params=True), precision=16)
+    trainer = Trainer(
+        gpus=4,
+        plugins=DeepSpeedPlugin(stage=3, cpu_offload=True, cpu_offload_params=True),
+        precision=16
+    )
     trainer.fit(model)
 
 
@@ -345,13 +486,21 @@ This saves memory when training larger models however requires using a checkpoin
 
 
     model = MyModel()
+
+
+    trainer = Trainer(
+        gpus=4,
+        plugins='deepspeed_stage_3_offload',
+        precision=16
+    )
+
+    # Enable CPU Activation Checkpointing
     trainer = Trainer(
         gpus=4,
         plugins=DeepSpeedPlugin(
             stage=3,
             cpu_offload=True,  # Enable CPU Offloading
-            partition_activations=True,  # Optionally move activations to CPU if you have enough memory
-            cpu_checkpointing=True  # Optionally Partition activations across machines
+            cpu_checkpointing=True  # (Optional) offload activations to CPU
         ),
         precision=16
     )
@@ -368,7 +517,7 @@ Here is some helpful information when setting up DeepSpeed ZeRO Stage 3 with Lig
 * If you're using Adam or AdamW, ensure to use FusedAdam or DeepSpeedCPUAdam (for CPU Offloading) rather than the default torch optimizers as they come with large speed benefits
 * Treat your GPU/CPU memory as one large pool. In some cases, you may not want to offload certain things (like activations) to provide even more space to offload model parameters
 * When offloading to the CPU, make sure to bump up the batch size as GPU memory will be freed
-
+* We also support sharded checkpointing. By passing ``save_full_weights=False`` to the ``DeepSpeedPlugin``, we'll save shards of the model which allows you to save extremely large models. However to load the model and run test/validation/predict you must use the Trainer object.
 
 Custom DeepSpeed Config
 """""""""""""""""""""""

From 809f0854907ffe09a6c19e2ae933057137ca51ff Mon Sep 17 00:00:00 2001
From: SeanNaren <sean@grid.ai>
Date: Wed, 28 Apr 2021 20:45:48 +0100
Subject: [PATCH 04/10] Address code review

---
 docs/source/advanced/optimized_multi_gpu.rst | 26 +++++++++-----------
 1 file changed, 11 insertions(+), 15 deletions(-)

diff --git a/docs/source/advanced/optimized_multi_gpu.rst b/docs/source/advanced/optimized_multi_gpu.rst
index 8f8b752258e7b..8b0a1ab5dbdd6 100644
--- a/docs/source/advanced/optimized_multi_gpu.rst
+++ b/docs/source/advanced/optimized_multi_gpu.rst
@@ -1,10 +1,7 @@
-Memory Optimized Multi-GPU Training
-===================================
+Advanced GPU Optimized Training
+===============================
 
-When training large models or fitting larger batch sizes on multi-gpu compute, Lightning provides advanced optimized multi-gpu plugins to support these cases.
-
-For example if you'd like to train a large billion parameter transformer model, or to scale your batch size when training a semi-supervised learning model, using a Lightning optimized distributed training plugin will offer substantial improvements
-in memory usage.
+When training large models or fitting larger batch sizes on multi-gpu compute, Lightning provides advanced optimized multi-gpu plugins to support these cases. For example if you'd like to train a large billion parameter transformer model, or to scale your batch size when training a semi-supervised learning model, using a Lightning optimized distributed training plugin will offer substantial improvements in GPU memory usage.
 
 Note that some of the extreme memory saving configurations will affect the speed of training. This Speed/Memory trade-off in most cases can be adjusted.
 
@@ -13,14 +10,12 @@ Some of these memory efficient plugins rely on offloading onto other forms of me
 Choosing an Optimized Multi-GPU Plugin
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-Currently all Memory Optimized Multi-GPU plugins shard the model states across your GPUs; just in different ways.
-
-This means as you scale up the number of GPUs, you can reach the number of model parameters you'd like to train.
+Unlike PyTorch Distributed Data Parallel (DDP) where the maximum trainable model size and batch size do not change with respect to the number of GPUs, memory optimized plugins can accommodate bigger model and larger batch as more GPUs are used. This means as you scale up the number of GPUs, you can reach the number of model parameters you'd like to train.
 
 Pre-training vs Fine-tuning
 """""""""""""""""""""""""""
 
-When fine-tuning, we often use a magnitude less data compared to pre-training a model. This is important when choosing a distributed plugin as usually for pre-training, **we are compute bound**.
+When fine-tuning, we often use a magnitude less data compared to pre-training a model. This is important when choosing a distributed plugin as usually for pre-training, **where we are compute bound**.
 This means we cannot sacrifice throughput as much as if we were fine-tuning, because in fine-tuning the data requirement is smaller.
 
 Overall:
@@ -36,7 +31,7 @@ But for **fine-tuning** a model, you can reach 10 to 20 Billion parameter models
 When Shouldn't I use an Optimized Multi-GPU Plugin?
 """""""""""""""""""""""""""""""""""""""""""""""""""
 
-Sharding techniques help when model sizes are large (500M+ parameters). We've seen benefits from 500M+, however in cases where your model is small (say ResNet50 of around 80M Parameters) it may be best to stick to normal distributed training.
+Sharding techniques help when model sizes are fairly large; roughly 500M+ parameters is where we've seen benefits. However, in cases where your model is small (ResNet50 of around 80M Parameters) it may be best to stick to normal distributed training, unless you are using unusually large batch sizes.
 
 ----------
 
@@ -48,12 +43,12 @@ Lightning integration of optimizer sharded training provided by `FairScale <http
 The technique can be found within `DeepSpeed ZeRO <https://arxiv.org/abs/1910.02054>`_ and
 `ZeRO-2 <https://www.microsoft.com/en-us/research/blog/zero-2-deepspeed-shattering-barriers-of-deep-learning-speed-scale/>`_,
 however the implementation is built from the ground up to be pytorch compatible and standalone.
-Sharded Training allows you to maintain GPU scaling efficiency, whilst reducing memory overhead drastically. In short, expect normal linear scaling, and significantly reduced memory usage when training large models.
+Sharded Training allows you to maintain GPU scaling efficiency, whilst reducing memory overhead drastically. In short, expect near-normal linear scaling (if your network allows), and significantly reduced memory usage when training large models.
 
 Sharded Training still utilizes Data Parallel Training under the hood, except optimizer states and gradients are sharded across GPUs.
 This means the memory overhead per GPU is lower, as each GPU only has to maintain a partition of your optimizer state and gradients.
 
-The benefits vary by model and parameter sizes, but we've recorded up to a 63% memory reduction per GPU allowing us to double our model sizes. Because of extremely efficient communication,
+The benefits vary by model and parameter sizes, but we've recorded up to a 63% memory reduction per GPU allowing us to double our model sizes. Because of efficient communication,
 these benefits in multi-GPU setups are almost free and throughput scales well with multi-node setups.
 
 It is highly recommended to use Sharded Training in multi-GPU environments where memory is limited, or where training larger models are beneficial (500M+ parameter models).
@@ -191,8 +186,9 @@ This is a requirement for really large models and also saves on instantiation ti
 FairScale Activation Checkpointing
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-Activation checkpointing frees activations from memory as soon as they are not needed during the forward pass.
-They are then re-computed for the backwards pass as needed.
+Activation checkpointing frees activations from memory as soon as they are not needed during the forward pass. They are then re-computed for the backwards pass as needed.
+
+FairScales' checkpointing wrapper also handles batch norm layers correctly unlike the PyTorch implementation, ensuring stats are tracked correctly due to the multiple forward passes.
 
 This saves memory when training larger models however requires wrapping modules you'd like to use activation checkpointing on. See `here <https://fairscale.readthedocs.io/en/latest/api/nn/misc/checkpoint_activations.html>`__ for more information.
 

From 09c54d703e8a487b375663612914d70902fc02b7 Mon Sep 17 00:00:00 2001
From: SeanNaren <sean@grid.ai>
Date: Thu, 29 Apr 2021 10:12:57 +0100
Subject: [PATCH 05/10] Add warning about using trainer.model, clean up some of
 the examples

---
 docs/source/advanced/optimized_multi_gpu.rst | 24 +++++++++++++-------
 1 file changed, 16 insertions(+), 8 deletions(-)

diff --git a/docs/source/advanced/optimized_multi_gpu.rst b/docs/source/advanced/optimized_multi_gpu.rst
index 8b0a1ab5dbdd6..9f462948a2ab3 100644
--- a/docs/source/advanced/optimized_multi_gpu.rst
+++ b/docs/source/advanced/optimized_multi_gpu.rst
@@ -85,6 +85,20 @@ Fully Sharded Training
 
 By default, Fully Sharded acts similar to :ref:`sharded` which shards optimizer states and gradients. If you can train with default Fully Sharded, it is recommended to just use :ref:`sharded`.
 
+.. warning::
+    Due to the behaviour of Fully Sharded, when defining optimizers in ``configure_optimizers`` you must use ``self.trainer.model`` as described below, which is the sharded model.
+
+.. code-block:: python
+
+    from pytorch_lightning import Trainer
+
+    class MyModel(pl.LightningModule):
+        ...
+        def configure_optimizers(self):
+            # Replace torch.optim.AdamW(self.parameters())
+            return torch.optim.AdamW(self.trainer.model.parameters())
+
+
 Shard Parameters to Reach 10+ Billion Parameters
 """"""""""""""""""""""""""""""""""""""""""""""""
 
@@ -101,10 +115,11 @@ Enabling `auto_wrap` doesn't require code changes, however can have varying leve
 .. code-block:: python
 
     from pytorch_lightning import Trainer
-    from pytorch_lightning.plugins import DeepSpeedPlugin
 
     class MyModel(pl.LightningModule):
         ...
+        def configure_optimizers(self):
+            return torch.optim.AdamW(self.trainer.model.parameters())
 
     model = MyModel()
     trainer = Trainer(gpus=4, plugins='ddp_fully_sharded_auto_wrap', precision=16)
@@ -130,7 +145,6 @@ This is a requirement for really large models and also saves on instantiation ti
     import torch
     import torch.nn as nn
     from pytorch_lightning import Trainer
-    from pytorch_lightning.plugins import DeepSpeedPlugin
     from fairscale.nn import checkpoint_wrapper, auto_wrap, wrap
 
     class MyModel(pl.LightningModule):
@@ -195,7 +209,6 @@ This saves memory when training larger models however requires wrapping modules
 .. code-block:: python
 
     from pytorch_lightning import Trainer
-    from pytorch_lightning.plugins import DeepSpeedPlugin
     from fairscale.nn import checkpoint_wrapper
 
 
@@ -211,11 +224,6 @@ This saves memory when training larger models however requires wrapping modules
             linear_layer = checkpoint_wrapper(nn.Linear(32, 32))
             self.block = nn.Sequential(linear_layer, nn.ReLU())
 
-        def forward(self, x):
-            # Use the DeepSpeed checkpointing function instead of calling the module directly
-            output = deepspeed.checkpointing.checkpoint(self.block, x)
-            return output
-
 
 .. _deepspeed:
 

From a99f3d94920996c4362e0fc7606787bd04a91397 Mon Sep 17 00:00:00 2001
From: SeanNaren <sean@grid.ai>
Date: Tue, 4 May 2021 13:12:07 +0100
Subject: [PATCH 06/10] Add section for ddp, remove references and old
 sequential documentation

---
 ...timized_multi_gpu.rst => advanced_gpu.rst} | 136 +++++++++++-------
 docs/source/advanced/training_tricks.rst      |   9 +-
 docs/source/benchmarking/performance.rst      |  27 +---
 docs/source/index.rst                         |   2 +-
 4 files changed, 94 insertions(+), 80 deletions(-)
 rename docs/source/advanced/{optimized_multi_gpu.rst => advanced_gpu.rst} (85%)

diff --git a/docs/source/advanced/optimized_multi_gpu.rst b/docs/source/advanced/advanced_gpu.rst
similarity index 85%
rename from docs/source/advanced/optimized_multi_gpu.rst
rename to docs/source/advanced/advanced_gpu.rst
index 9f462948a2ab3..df7cf7fd2b187 100644
--- a/docs/source/advanced/optimized_multi_gpu.rst
+++ b/docs/source/advanced/advanced_gpu.rst
@@ -1,14 +1,16 @@
 Advanced GPU Optimized Training
 ===============================
 
-When training large models or fitting larger batch sizes on multi-gpu compute, Lightning provides advanced optimized multi-gpu plugins to support these cases. For example if you'd like to train a large billion parameter transformer model, or to scale your batch size when training a semi-supervised learning model, using a Lightning optimized distributed training plugin will offer substantial improvements in GPU memory usage.
+When training large models, fitting larger batch sizes or trying to increase throughput using multi-gpu compute, Lightning provides advanced optimized multi-gpu plugins to support these cases. For example if you'd like to train a large billion parameter transformer model, or to scale your batch size when training a semi-supervised learning model, using a Lightning optimized distributed training plugin will offer substantial improvements in GPU memory usage.
 
 Note that some of the extreme memory saving configurations will affect the speed of training. This Speed/Memory trade-off in most cases can be adjusted.
 
 Some of these memory efficient plugins rely on offloading onto other forms of memory, such as CPU RAM or NVMe. This means you can even see memory benefits on a **single GPU**, using a plugin such as :ref:`deepspeed-zero-stage-3-offload`.
 
-Choosing an Optimized Multi-GPU Plugin
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Choosing an Advanced Distributed GPU Plugin
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+If you would like to stick with PyTorch DDP, see :ref:`ddp-optimizations`.
 
 Unlike PyTorch Distributed Data Parallel (DDP) where the maximum trainable model size and batch size do not change with respect to the number of GPUs, memory optimized plugins can accommodate bigger model and larger batch as more GPUs are used. This means as you scale up the number of GPUs, you can reach the number of model parameters you'd like to train.
 
@@ -28,8 +30,8 @@ For example when using 128 GPUs, you can **pre-train** large 10 to 20 Billion pa
 
 But for **fine-tuning** a model, you can reach 10 to 20 Billion parameter models using :ref:`deepspeed-zero-stage-3-offload` on a **single GPU**. This does come with a significant throughput hit, which needs to be weighed accordingly.
 
-When Shouldn't I use an Optimized Multi-GPU Plugin?
-"""""""""""""""""""""""""""""""""""""""""""""""""""
+When Shouldn't I use an Optimized Distributed Plugin?
+"""""""""""""""""""""""""""""""""""""""""""""""""""""
 
 Sharding techniques help when model sizes are fairly large; roughly 500M+ parameters is where we've seen benefits. However, in cases where your model is small (ResNet50 of around 80M Parameters) it may be best to stick to normal distributed training, unless you are using unusually large batch sizes.
 
@@ -591,75 +593,107 @@ You can use also use an environment variable via your PyTorch Lightning script:
 
     PL_DEEPSPEED_CONFIG_PATH=/path/to/deepspeed_config.json python train.py --plugins deepspeed
 
+.. _ddp-optimizations:
 
-----------
+DDP Optimizations
+^^^^^^^^^^^^^^^^^
 
-.. _sequential-parallelism:
 
-Sequential Model Parallelism with Checkpointing
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-PyTorch Lightning integration for Sequential Model Parallelism using `FairScale <https://github.com/facebookresearch/fairscale>`_.
-Sequential Model Parallelism splits a sequential module onto multiple GPUs, reducing peak GPU memory requirements substantially.
-We also provide auto-balancing techniques through FairScale, to find optimal balances for the model across GPUs.
-In addition, we use Gradient Checkpointing to reduce GPU memory requirements further, and micro-batches to minimizing device under-utilization automatically.
+Gradients as Bucket View
+""""""""""""""""""""""""
 
-Reference: https://arxiv.org/abs/1811.06965
+Enabling ``gradient_as_bucket_view=True`` in the ``DDPPlugin`` will make gradients views point to different offsets of the ``allreduce`` communication buckets. See `DistributedDataParallel <https://pytorch.org/docs/master/_modules/torch/nn/parallel/distributed.html#DistributedDataParallel>`__ for more information.
 
-.. note:: RPCSequentialPlugin is currently supported only for Pytorch 1.6.
+This can reduce peak memory usage and throughput as saved memory will be equal to the total gradient memory + removes the need to copy gradients to the ``allreduce`` communication buckets.
 
-To get started, install FairScale using the command below. We install a specific branch which contains PyTorch related fixes for Sequential Parallelism.
+.. note::
 
-.. code-block:: bash
+    When ``gradient_as_bucket_view=True`` you cannot call ``detach_()`` on gradients. If hitting such errors, please fix it by referring to the :meth:`~torch.optim.Optimizer.zero_grad` function in ``torch/optim/optimizer.py`` as a solution (`source <https://pytorch.org/docs/master/_modules/torch/nn/parallel/distributed.html#DistributedDataParallel>`__).
+
+.. code-block:: python
 
-     pip install https://github.com/PyTorchLightning/fairscale/archive/pl_1.2.0.zip
+    from pytorch_lightning import Trainer
+    from pytorch_lightning.plugins import DDPPlugin
 
-To use Sequential Model Parallelism, you must define a  :class:`nn.Sequential <torch.nn.Sequential>` module that defines the layers you wish to parallelize across GPUs.
-This should be kept within the ``sequential_module`` variable within your ``LightningModule`` like below.
+    model = MyModel()
+    trainer = Trainer(gpus=4, plugins=DDPPlugin(gradient_as_bucket_view=True))
+    trainer.fit(model)
 
-.. code-block:: python
+DDP Communication Hooks
+"""""""""""""""""""""""
 
-    from pytorch_lightning.plugins.training_type.rpc_sequential import RPCSequentialPlugin
-    from pytorch_lightning import LightningModule
+DDP Communication hooks is an interface to control how gradients are communicated across workers, overriding the standard allreduce in DistributedDataParallel. This allows you to enable performance improving communication hooks when using multiple nodes.
 
-    class MyModel(LightningModule):
-        def __init__(self):
-            ...
-            self.sequential_module = nn.Sequential(my_layers)
+.. note::
+    DDP communication hooks needs pytorch version at least 1.8.0
+
+Enable `FP16 Compress Hook for multi-node throughput improvement <https://pytorch.org/docs/stable/ddp_comm_hooks.html#torch.distributed.algorithms.ddp_comm_hooks.default_hooks.fp16_compress_hook>`__:
+
+.. code-block:: python
+
+    from pytorch_lightning import Trainer
+    from pytorch_lightning.plugins import DDPPlugin
+    from torch.distributed.algorithms.ddp_comm_hooks import (
+            default_hooks as default,
+            powerSGD_hook as powerSGD,
+    )
 
-    # Split my module across 4 gpus, one layer each
     model = MyModel()
-    plugin = RPCSequentialPlugin(balance=[1, 1, 1, 1])
-    trainer = Trainer(accelerator='ddp', gpus=4, plugins=[plugin])
+    trainer = Trainer(gpus=4, plugins=DDPPlugin(ddp_comm_hook=default.fp16_compress_hook))
     trainer.fit(model)
 
+Enable `PowerSGD for multi-node throughput improvement <https://pytorch.org/docs/stable/ddp_comm_hooks.html#powersgd-communication-hook>`__:
 
-We provide a minimal example of Sequential Model Parallelism using a convolutional model training on cifar10, split onto GPUs `here <https://github.com/PyTorchLightning/pytorch-lightning/tree/master/pl_examples/basic_examples/conv_sequential_example.py>`_.
-To run the example, you need to install `Bolts <https://github.com/PyTorchLightning/pytorch-lightning-bolts>`_. Install with ``pip install pytorch-lightning-bolts``.
+.. note::
 
-When running the Sequential Model Parallelism example on 2 GPUS we achieve these memory savings.
+    PowerSGD typically requires extra memory of the same size as the model’s gradients to enable error feedback, which can compensate for biased compressed communication and improve accuracy (`source <https://pytorch.org/docs/stable/ddp_comm_hooks.html#powersgd-hooks>`__).
 
-.. list-table:: GPU Memory Utilization
-   :widths: 25 25 50
-   :header-rows: 1
+.. code-block:: python
 
-   * - GPUS
-     - Without Balancing
-     - With Balancing
-   * - Gpu 0
-     - 4436 MB
-     - 1554 MB
-   * - Gpu 1
-     - ~0
-     - 994 MB
+    from pytorch_lightning import Trainer
+    from pytorch_lightning.plugins import DDPPlugin
+    from torch.distributed.algorithms.ddp_comm_hooks import powerSGD_hook as powerSGD
 
-To run the example with Sequential Model Parallelism:
+    model = MyModel()
+    trainer = Trainer(
+        gpus=4,
+        plugins=DDPPlugin(
+            ddp_comm_state=powerSGD.PowerSGDState(
+                process_group=None,
+                matrix_approximation_rank=1,
+                start_powerSGD_iter=5000,
+            ),
+            ddp_comm_hook=powerSGD.powerSGD_hook,
+        )
+    )
+    trainer.fit(model)
 
-.. code-block:: bash
 
-    python pl_examples/basic_examples/conv_sequential_example.py --batch_size 1024 --gpus 2 --accelerator ddp --use_ddp_sequential
+Combine hooks for accumulated benefit:
 
-To run the same example without Sequential Model Parallelism:
+.. note::
+    DDP communication wrappers needs pytorch version at least 1.9.0
 
-.. code-block:: bash
+.. code-block:: python
 
-    python pl_examples/basic_examples/conv_sequential_example.py --batch_size 1024 --gpus 1
+    from pytorch_lightning import Trainer
+    from pytorch_lightning.plugins import DDPPlugin
+    from torch.distributed.algorithms.ddp_comm_hooks import (
+            default_hooks as default,
+            powerSGD_hook as powerSGD,
+    )
+
+    model = MyModel()
+    trainer = Trainer(
+        gpus=4,
+        plugins=DDPPlugin(
+            ddp_comm_state=powerSGD.PowerSGDState(
+                process_group=None,
+                matrix_approximation_rank=1,
+                start_powerSGD_iter=5000,
+            ),
+            ddp_comm_hook=powerSGD.powerSGD_hook,
+            ddp_comm_wrapper=default.fp16_compress_wrapper,
+        )
+    )
+    trainer.fit(model)
diff --git a/docs/source/advanced/training_tricks.rst b/docs/source/advanced/training_tricks.rst
index c3b232b41c13c..f7a349bf61739 100644
--- a/docs/source/advanced/training_tricks.rst
+++ b/docs/source/advanced/training_tricks.rst
@@ -149,9 +149,8 @@ The algorithm in short works by:
 .. warning:: Batch size finder is not supported for DDP yet, it is coming soon.
 
 
-Sequential Model Parallelism with Checkpointing
----------------------------------------------------------------------
-PyTorch Lightning integration for Sequential Model Parallelism using `FairScale <https://github.com/facebookresearch/fairscale>`_.
-Sequential Model Parallelism splits a sequential module onto multiple GPUs, reducing peak GPU memory requirements substantially.
+Advanced GPU Optimizations
+--------------------------
 
-For more information, refer to :ref:`sequential-parallelism`.
+When training on single or multiple GPU machines, Lightning offers a host of advanced optimizations to improve throughput, memory effeciency and model scaling.
+Refer to :doc:`Advanced GPU Optimized Training for more details <../advanced/advanced_gpu>`.
diff --git a/docs/source/benchmarking/performance.rst b/docs/source/benchmarking/performance.rst
index db66ad419fc48..217cf7f5779a5 100644
--- a/docs/source/benchmarking/performance.rst
+++ b/docs/source/benchmarking/performance.rst
@@ -132,30 +132,11 @@ However, know that 16-bit and multi-processing (any DDP) can have issues. Here a
 
 ----------
 
-Use Sharded DDP for GPU memory and scaling optimization
--------------------------------------------------------
+Advanced GPU Optimizations
+--------------------------
 
-Sharded DDP is a lightning integration of `DeepSpeed ZeRO <https://arxiv.org/abs/1910.02054>`_ and
-`ZeRO-2 <https://www.microsoft.com/en-us/research/blog/zero-2-deepspeed-shattering-barriers-of-deep-learning-speed-scale/>`_
-provided by `Fairscale <https://github.com/facebookresearch/fairscale>`_.
-
-When training on multiple GPUs sharded DDP can assist to increase memory efficiency substantially, and in some cases performance on multi-node is better than traditional DDP.
-This is due to efficient communication and parallelization under the hood.
-
-To use Optimizer Sharded Training, refer to :ref:`model-parallelism`.
-
-Sharded DDP can work across all DDP variants by adding the additional ``--plugins ddp_sharded`` flag.
-
-Refer to the :doc:`distributed computing guide for more details <../advanced/multi_gpu>`.
-
-----------
-
-Sequential Model Parallelism with Checkpointing
------------------------------------------------
-PyTorch Lightning integration for Sequential Model Parallelism using `FairScale <https://github.com/facebookresearch/fairscale>`_.
-Sequential Model Parallelism splits a sequential module onto multiple GPUs, reducing peak GPU memory requirements substantially.
-
-For more information, refer to :ref:`sequential-parallelism`.
+When training on single or multiple GPU machines, Lightning offers a host of advanced optimizations to improve throughput, memory effeciency and model scaling.
+Refer to :doc:`Advanced GPU Optimized Training for more details <../advanced/advanced_gpu>`.
 
 ----------
 
diff --git a/docs/source/index.rst b/docs/source/index.rst
index 8d5dc4fdacc8b..71ad835e02d31 100644
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -104,7 +104,7 @@ PyTorch Lightning Documentation
    common/lightning_cli
    advanced/lr_finder
    advanced/multi_gpu
-   advanced/optimized_multi_gpu
+   advanced/advanced_gpu
    advanced/multiple_loaders
    common/weights_loading
    common/optimizers

From b222b2d7dfdc302a16a74c8d2e0d446c4f7f8ea3 Mon Sep 17 00:00:00 2001
From: SeanNaren <sean@grid.ai>
Date: Thu, 6 May 2021 10:37:03 +0100
Subject: [PATCH 07/10] Remove Fully Sharded documentation for now

---
 docs/source/advanced/advanced_gpu.rst | 132 +-------------------------
 1 file changed, 2 insertions(+), 130 deletions(-)

diff --git a/docs/source/advanced/advanced_gpu.rst b/docs/source/advanced/advanced_gpu.rst
index df7cf7fd2b187..7f959db9752a7 100644
--- a/docs/source/advanced/advanced_gpu.rst
+++ b/docs/source/advanced/advanced_gpu.rst
@@ -23,7 +23,7 @@ This means we cannot sacrifice throughput as much as if we were fine-tuning, bec
 Overall:
 
 * When **fine-tuning** a model, use advanced memory efficient plugins such as :ref:`deepspeed-zero-stage-3` or :ref:`deepspeed-zero-stage-3-offload`, allowing you to fine-tune larger models if you are limited on compute
-* When **pre-training** a model, use simpler optimizations such :ref:`sharded`, :ref:`deepspeed-zero-stage-2` or :ref:`fully-sharded`, scaling the number of GPUs to reach larger parameter sizes
+* When **pre-training** a model, use simpler optimizations such :ref:`sharded`, :ref:`deepspeed-zero-stage-2`, scaling the number of GPUs to reach larger parameter sizes
 * For both fine-tuning and pre-training, use :ref:`deepspeed-activation-checkpointing` or :ref:`fairscale-activation-checkpointing` as the throughput degradation is not significant
 
 For example when using 128 GPUs, you can **pre-train** large 10 to 20 Billion parameter models using :ref:`deepspeed-zero-stage-2` without having to take a performance hit with more advanced optimized multi-gpu plugins.
@@ -55,7 +55,7 @@ these benefits in multi-GPU setups are almost free and throughput scales well wi
 
 It is highly recommended to use Sharded Training in multi-GPU environments where memory is limited, or where training larger models are beneficial (500M+ parameter models).
 A technical note: as batch size scales, storing activations for the backwards pass becomes the bottleneck in training. As a result, sharding optimizer state and gradients becomes less impactful.
-Use :ref:`fairscale-activation-checkpointing` or :ref:`fully-sharded` to see even more benefit at the cost of some throughput.
+Use :ref:`fairscale-activation-checkpointing` to see even more benefit at the cost of some throughput.
 
 To use Sharded Training, you need to first install FairScale using the command below.
 
@@ -73,128 +73,6 @@ Sharded Training can work across all DDP variants by adding the additional ``--p
 
 Internally we re-initialize your optimizers and shard them across your machines and processes. We handle all communication using PyTorch distributed, so no code changes are required.
 
-----------
-
-.. _fully-sharded:
-
-Fully Sharded Training
-^^^^^^^^^^^^^^^^^^^^^^
-
-.. note::
-    Fully Sharded Training is in beta and the API is subject to change. Please create an `issue <https://github.com/PyTorchLightning/pytorch-lightning/issues>`_ if you run into any issues.
-
-`Fully Sharded <https://fairscale.readthedocs.io/en/latest/api/nn/fsdp.html>`__ shards optimizer state, gradients and parameters across data parallel workers. This allows you to fit much larger models onto multiple GPUs into memory.
-
-By default, Fully Sharded acts similar to :ref:`sharded` which shards optimizer states and gradients. If you can train with default Fully Sharded, it is recommended to just use :ref:`sharded`.
-
-.. warning::
-    Due to the behaviour of Fully Sharded, when defining optimizers in ``configure_optimizers`` you must use ``self.trainer.model`` as described below, which is the sharded model.
-
-.. code-block:: python
-
-    from pytorch_lightning import Trainer
-
-    class MyModel(pl.LightningModule):
-        ...
-        def configure_optimizers(self):
-            # Replace torch.optim.AdamW(self.parameters())
-            return torch.optim.AdamW(self.trainer.model.parameters())
-
-
-Shard Parameters to Reach 10+ Billion Parameters
-""""""""""""""""""""""""""""""""""""""""""""""""
-
-To reach larger parameter sizes and be memory efficient, we have to shard parameters. There are various ways to enable this.
-
-Auto Wrap
-"""""""""
-
-``auto_wrap`` will recursively wrap modules within the ``LightningModule`` with nested Fully Sharded Wrappers,
-signalling that we'd like to partition these modules across data parallel devices, discarding the full weights when not required (information `here <https://fairscale.readthedocs.io/en/latest/api/nn/fsdp_tips.html>`__).
-
-Enabling `auto_wrap` doesn't require code changes, however can have varying level of success based on the complexity of your model. **Auto Wrap does not support models with shared parameters**, use :ref:`manual-wrap` instead.
-
-.. code-block:: python
-
-    from pytorch_lightning import Trainer
-
-    class MyModel(pl.LightningModule):
-        ...
-        def configure_optimizers(self):
-            return torch.optim.AdamW(self.trainer.model.parameters())
-
-    model = MyModel()
-    trainer = Trainer(gpus=4, plugins='ddp_fully_sharded_auto_wrap', precision=16)
-    trainer.fit(model)
-
-    trainer.test()
-    trainer.predict()
-
-
-.. _manual-wrap:
-
-Manual Wrap
-"""""""""""
-
-To activate parameter sharding, you can also wrap layers using provided ``wrap`` or ``auto_wrap`` functions as described below.
-
-When not using Fully Sharded these wrap functions are a no-op. This means once the changes have been made, there is no need to remove the changes for other plugins.
-
-This is a requirement for really large models and also saves on instantiation time as modules are sharded instantly, rather than after the entire model is created in memory.
-
-.. code-block:: python
-
-    import torch
-    import torch.nn as nn
-    from pytorch_lightning import Trainer
-    from fairscale.nn import checkpoint_wrapper, auto_wrap, wrap
-
-    class MyModel(pl.LightningModule):
-        ...
-        def configure_sharded_model(self):
-            # Created within sharded model context, modules are instantly sharded across processes
-            # as soon as they are wrapped with ``wrap`` or ``auto_wrap``
-
-             # Wraps the layer in a Fully Sharded Wrapper automatically
-            linear_layer = wrap(nn.Linear(32, 32))
-
-            # Wraps the module recursively
-            # based on a minimum number of parameters (default 100M parameters)
-            block = auto_wrap(
-                nn.Sequential(
-                    nn.Linear(32, 32),
-                    nn.ReLU()
-                )
-            )
-
-            # For best memory efficiency,
-            # add fairscale activation checkpointing
-            final_block = auto_wrap(
-                checkpoint_wrapper(
-                    nn.Sequential(
-                        nn.Linear(32, 32),
-                        nn.ReLU()
-                    )
-                )
-            )
-            self.block = nn.Sequential(
-                linear_layer,
-                nn.ReLU(),
-                block,
-                final_block
-            )
-
-        def configure_optimizers(self):
-            return torch.optim.AdamW(self.parameters())
-
-    model = MyModel()
-    trainer = Trainer(gpus=4, plugins='ddp_fully_sharded', precision=16)
-    trainer.fit(model)
-
-    trainer.test()
-    trainer.predict()
-
-
 ----------
 
 .. _fairscale-activation-checkpointing:
@@ -220,12 +98,6 @@ This saves memory when training larger models however requires wrapping modules
             linear_layer = checkpoint_wrapper(nn.Linear(32, 32))
             self.block = nn.Sequential(linear_layer, nn.ReLU())
 
-        def configure_sharded_model(self):
-            # Can be defined within this function as well
-            # for when using Fully Sharded.
-            linear_layer = checkpoint_wrapper(nn.Linear(32, 32))
-            self.block = nn.Sequential(linear_layer, nn.ReLU())
-
 
 .. _deepspeed:
 

From 0f743533a6ba320918fbee86d46b803360994eb7 Mon Sep 17 00:00:00 2001
From: Sean Naren <sean@grid.ai>
Date: Thu, 6 May 2021 12:16:07 +0100
Subject: [PATCH 08/10] Apply suggestions from code review
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>
---
 docs/source/advanced/advanced_gpu.rst    | 10 +++++-----
 docs/source/advanced/training_tricks.rst |  2 +-
 docs/source/benchmarking/performance.rst |  2 +-
 3 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/docs/source/advanced/advanced_gpu.rst b/docs/source/advanced/advanced_gpu.rst
index 7f959db9752a7..39da871b1ff1e 100644
--- a/docs/source/advanced/advanced_gpu.rst
+++ b/docs/source/advanced/advanced_gpu.rst
@@ -1,23 +1,23 @@
 Advanced GPU Optimized Training
 ===============================
 
-When training large models, fitting larger batch sizes or trying to increase throughput using multi-gpu compute, Lightning provides advanced optimized multi-gpu plugins to support these cases. For example if you'd like to train a large billion parameter transformer model, or to scale your batch size when training a semi-supervised learning model, using a Lightning optimized distributed training plugin will offer substantial improvements in GPU memory usage.
+When training large models, fitting larger batch sizes, or trying to increase throughput using multi-GPU compute, Lightning provides advanced optimized distributed training plugins to support these cases and offer substantial improvements in memory usage.
 
 Note that some of the extreme memory saving configurations will affect the speed of training. This Speed/Memory trade-off in most cases can be adjusted.
 
-Some of these memory efficient plugins rely on offloading onto other forms of memory, such as CPU RAM or NVMe. This means you can even see memory benefits on a **single GPU**, using a plugin such as :ref:`deepspeed-zero-stage-3-offload`.
+Some of these memory-efficient plugins rely on offloading onto other forms of memory, such as CPU RAM or NVMe. This means you can even see memory benefits on a **single GPU**, using a plugin such as :ref:`deepspeed-zero-stage-3-offload`.
 
 Choosing an Advanced Distributed GPU Plugin
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 If you would like to stick with PyTorch DDP, see :ref:`ddp-optimizations`.
 
-Unlike PyTorch Distributed Data Parallel (DDP) where the maximum trainable model size and batch size do not change with respect to the number of GPUs, memory optimized plugins can accommodate bigger model and larger batch as more GPUs are used. This means as you scale up the number of GPUs, you can reach the number of model parameters you'd like to train.
+Unlike PyTorch's DistributedDataParallel (DDP) where the maximum trainable model size and batch size do not change with respect to the number of GPUs, memory-optimized plugins can accommodate bigger models and larger batches as more GPUs are used. This means as you scale up the number of GPUs, you can reach the number of model parameters you'd like to train.
 
 Pre-training vs Fine-tuning
 """""""""""""""""""""""""""
 
-When fine-tuning, we often use a magnitude less data compared to pre-training a model. This is important when choosing a distributed plugin as usually for pre-training, **where we are compute bound**.
+When fine-tuning, we often use a magnitude less data compared to pre-training a model. This is important when choosing a distributed plugin as usually for pre-training, **where we are compute-bound**.
 This means we cannot sacrifice throughput as much as if we were fine-tuning, because in fine-tuning the data requirement is smaller.
 
 Overall:
@@ -33,7 +33,7 @@ But for **fine-tuning** a model, you can reach 10 to 20 Billion parameter models
 When Shouldn't I use an Optimized Distributed Plugin?
 """""""""""""""""""""""""""""""""""""""""""""""""""""
 
-Sharding techniques help when model sizes are fairly large; roughly 500M+ parameters is where we've seen benefits. However, in cases where your model is small (ResNet50 of around 80M Parameters) it may be best to stick to normal distributed training, unless you are using unusually large batch sizes.
+Sharding techniques help when model sizes are fairly large; roughly 500M+ parameters is where we've seen benefits. However, in cases where your model is small (ResNet50 of around 80M Parameters) it may be best to stick to normal distributed training, unless you are using unusually large batch sizes or inputs.
 
 ----------
 
diff --git a/docs/source/advanced/training_tricks.rst b/docs/source/advanced/training_tricks.rst
index f7a349bf61739..83d6f78e5c04d 100644
--- a/docs/source/advanced/training_tricks.rst
+++ b/docs/source/advanced/training_tricks.rst
@@ -152,5 +152,5 @@ The algorithm in short works by:
 Advanced GPU Optimizations
 --------------------------
 
-When training on single or multiple GPU machines, Lightning offers a host of advanced optimizations to improve throughput, memory effeciency and model scaling.
+When training on single or multiple GPU machines, Lightning offers a host of advanced optimizations to improve throughput, memory efficiency, and model scaling.
 Refer to :doc:`Advanced GPU Optimized Training for more details <../advanced/advanced_gpu>`.
diff --git a/docs/source/benchmarking/performance.rst b/docs/source/benchmarking/performance.rst
index 217cf7f5779a5..6e2b546fb275f 100644
--- a/docs/source/benchmarking/performance.rst
+++ b/docs/source/benchmarking/performance.rst
@@ -135,7 +135,7 @@ However, know that 16-bit and multi-processing (any DDP) can have issues. Here a
 Advanced GPU Optimizations
 --------------------------
 
-When training on single or multiple GPU machines, Lightning offers a host of advanced optimizations to improve throughput, memory effeciency and model scaling.
+When training on single or multiple GPU machines, Lightning offers a host of advanced optimizations to improve throughput, memory efficiency, and model scaling.
 Refer to :doc:`Advanced GPU Optimized Training for more details <../advanced/advanced_gpu>`.
 
 ----------

From e9230d07d3d2869cbe62f140a66c9acfcd94ade0 Mon Sep 17 00:00:00 2001
From: SeanNaren <sean@grid.ai>
Date: Thu, 6 May 2021 12:16:19 +0100
Subject: [PATCH 09/10] Address code review

---
 docs/source/advanced/advanced_gpu.rst | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/docs/source/advanced/advanced_gpu.rst b/docs/source/advanced/advanced_gpu.rst
index 39da871b1ff1e..d4a45493eb0f8 100644
--- a/docs/source/advanced/advanced_gpu.rst
+++ b/docs/source/advanced/advanced_gpu.rst
@@ -73,8 +73,6 @@ Sharded Training can work across all DDP variants by adding the additional ``--p
 
 Internally we re-initialize your optimizers and shard them across your machines and processes. We handle all communication using PyTorch distributed, so no code changes are required.
 
-----------
-
 .. _fairscale-activation-checkpointing:
 
 FairScale Activation Checkpointing
@@ -94,9 +92,8 @@ This saves memory when training larger models however requires wrapping modules
 
     class MyModel(pl.LightningModule):
         def __init__(self):
-            # Wrap layer using checkpoint_wrapper
-            linear_layer = checkpoint_wrapper(nn.Linear(32, 32))
-            self.block = nn.Sequential(linear_layer, nn.ReLU())
+            # Wrap layers using checkpoint_wrapper
+            self.block = checkpoint_wrapper(nn.Sequential(nn.Linear(32, 32), nn.ReLU()))
 
 
 .. _deepspeed:
@@ -465,6 +462,8 @@ You can use also use an environment variable via your PyTorch Lightning script:
 
     PL_DEEPSPEED_CONFIG_PATH=/path/to/deepspeed_config.json python train.py --plugins deepspeed
 
+----------
+
 .. _ddp-optimizations:
 
 DDP Optimizations

From 74e7fc335ab2163e2e689262537377d3faa76bca Mon Sep 17 00:00:00 2001
From: SeanNaren <sean@grid.ai>
Date: Thu, 6 May 2021 12:17:36 +0100
Subject: [PATCH 10/10] Address code review

---
 docs/source/advanced/advanced_gpu.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/advanced/advanced_gpu.rst b/docs/source/advanced/advanced_gpu.rst
index d4a45493eb0f8..8146744b521db 100644
--- a/docs/source/advanced/advanced_gpu.rst
+++ b/docs/source/advanced/advanced_gpu.rst
@@ -33,7 +33,7 @@ But for **fine-tuning** a model, you can reach 10 to 20 Billion parameter models
 When Shouldn't I use an Optimized Distributed Plugin?
 """""""""""""""""""""""""""""""""""""""""""""""""""""
 
-Sharding techniques help when model sizes are fairly large; roughly 500M+ parameters is where we've seen benefits. However, in cases where your model is small (ResNet50 of around 80M Parameters) it may be best to stick to normal distributed training, unless you are using unusually large batch sizes or inputs.
+Sharding techniques help when model sizes are fairly large; roughly 500M+ parameters is where we've seen benefits. However, in cases where your model is small (ResNet50 of around 80M Parameters) it may be best to stick to ordinary distributed training, unless you are using unusually large batch sizes or inputs.
 
 ----------