diff --git a/doc/api/training/distributed.rst b/doc/api/training/distributed.rst
index 21837bc1e4..be050d1011 100644
--- a/doc/api/training/distributed.rst
+++ b/doc/api/training/distributed.rst
@@ -22,10 +22,19 @@ The SageMaker Distributed Data Parallel Library
 The SageMaker Distributed Model Parallel Library
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-.. toctree::
-   :maxdepth: 2
-
-   smd_model_parallel
-   smp_versions/latest
-   smd_model_parallel_general
-   smd_model_parallel_release_notes/smd_model_parallel_change_log
+.. note::
+
+    Since the release of the SageMaker model parallelism (SMP) version 2 in December 2023,
+    this documentation is no longer supported for maintenence.
+    The live documentation is available at
+    `SageMaker model parallelism library v2
+    <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-v2.html>`_
+    in the *Amazon SageMaker User Guide*.
+
+    The documentation for the SMP library v1.x is archived and available at
+    `Run distributed training with the SageMaker model parallelism library
+    <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel.html>`_
+    in the *Amazon SageMaker User Guide*,
+    and the SMP v1.x API reference is available in the
+    `SageMaker Python SDK v2.199.0 documentation
+    <https://sagemaker.readthedocs.io/en/v2.199.0/api/training/distributed.html#the-sagemaker-distributed-model-parallel-library>`_.
diff --git a/doc/api/training/smd_model_parallel.rst b/doc/api/training/smd_model_parallel.rst
deleted file mode 100644
index 635dcd582d..0000000000
--- a/doc/api/training/smd_model_parallel.rst
+++ /dev/null
@@ -1,43 +0,0 @@
-The SageMaker Distributed Model Parallel Library Overview
----------------------------------------------------------
-
-The Amazon SageMaker distributed model parallel library is a model parallelism library for training
-large deep learning models that were previously difficult to train due to GPU memory limitations.
-The library automatically and efficiently splits a model across multiple GPUs and instances and coordinates model training,
-allowing you to increase prediction accuracy by creating larger models with more parameters.
-
-You can use the library to automatically partition your existing TensorFlow and PyTorch workloads
-across multiple GPUs with minimal code changes. The library's API can be accessed through the Amazon SageMaker SDK.
-
-.. tip::
-
-  We recommend that you use this API documentation along with the conceptual guide at
-  `SageMaker's Distributed Model Parallel
-  <http://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel.html>`_
-  in the *Amazon SageMaker developer guide*.
-  The conceptual guide includes the following topics:
-
-  - An overview of model parallelism, and the library's
-    `core features <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-core-features.html>`_,
-    and `extended features for PyTorch <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch.html>`_.
-  - Instructions on how to modify `TensorFlow
-    <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-customize-training-script-tf.html>`_
-    and `PyTorch
-    <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-customize-training-script-pt.html>`_
-    training scripts.
-  - Instructions on how to `run a distributed training job using the SageMaker Python SDK
-    and the SageMaker model parallel library
-    <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-sm-sdk.html>`_.
-  - `Configuration tips and pitfalls
-    <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-customize-tips-pitfalls.html>`_.
-
-
-.. important::
-   The model parallel library only supports SageMaker training jobs using CUDA 11.
-   Make sure you use the pre-built Deep Learning Containers.
-   If you want to extend or customize your own training image,
-   you must use a CUDA 11 base image. For more information, see `Extend a Prebuilt Docker
-   Container that Contains SageMaker's Distributed Model Parallel Library
-   <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-sm-sdk.html#model-parallel-customize-container>`_
-   and `Create Your Own Docker Container with the SageMaker Distributed Model Parallel Library
-   <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-sm-sdk.html#model-parallel-bring-your-own-container>`_.
diff --git a/doc/api/training/smd_model_parallel_general.rst b/doc/api/training/smd_model_parallel_general.rst
deleted file mode 100644
index e626ad9083..0000000000
--- a/doc/api/training/smd_model_parallel_general.rst
+++ /dev/null
@@ -1,465 +0,0 @@
-.. _sm-sdk-modelparallel-general:
-
-#############################################################
-Run a Distributed Training Job Using the SageMaker Python SDK
-#############################################################
-
-Walk through the following pages to learn about the SageMaker model parallel library's APIs
-to configure and enable distributed model parallelism
-through an Amazon SageMaker estimator.
-
-.. _sm-sdk-modelparallel-params:
-
-Configuration Parameters for ``distribution``
-=============================================
-
-Amazon SageMaker's TensorFlow and PyTorch estimator objects contain a ``distribution`` parameter,
-which you can use to enable and specify parameters for SageMaker distributed training.
-The SageMaker model parallel library internally uses MPI.
-To use model parallelism, both ``smdistributed`` and MPI must be enabled
-through the ``distribution`` parameter.
-
-The following code example is a template of setting up model parallelism for a PyTorch estimator.
-
-.. code:: python
-
-  import sagemaker
-  from sagemaker.pytorch import PyTorch
-
-  smp_options = {
-      "enabled":True,
-      "parameters": {
-          ...
-      }
-  }
-
-  mpi_options = {
-      "enabled" : True,
-      ...
-  }
-
-  smdmp_estimator = PyTorch(
-      ...
-      distribution={
-          "smdistributed": {"modelparallel": smp_options},
-          "mpi": mpi_options
-      }
-  )
-
-  smdmp_estimator.fit()
-
-.. tip::
-
-  This page provides you a complete list of parameters you can use
-  when you construct a SageMaker estimator and configure for distributed training.
-
-  To find examples of how to construct a SageMaker estimator with the distributed training parameters, see
-  `Launch a SageMaker Distributed Model Parallel Training Job <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-sm-sdk.html>`_
-  in the `SageMaker's Distributed Model Parallel developer guide <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel.html>`_.
-
-.. contents:: Table of Contents
-  :depth: 3
-  :local:
-
-Parameters for ``smdistributed``
-----------------------------------
-
-You can use the following parameters to initialize the library
-configuring a dictionary for ``modelparallel``, which goes
-into the ``smdistributed`` option for the ``distribution`` parameter.
-
-.. note::
-
-    ``partitions`` for TensorFlow and ``pipeline_parallel_degree`` for PyTorch are required parameters.
-    All other parameters in the following
-    table are optional.
-
-Common Parameters
-~~~~~~~~~~~~~~~~~
-
-.. list-table::
-   :widths: 10 20 10 60
-   :header-rows: 1
-
-   * - Parameter
-     - Type / Valid values
-     - Default
-     - Description
-   * - ``partitions`` for TensorFlow and PyTorch with smdistributed-modelparallel<v1.6,
-       ``pipeline_parallel_degree`` for PyTorch v1.8.1 with smdistributed-modelparallel>=v1.6)
-     - int
-     -
-     - **Required.** The number of partitions to split the model into.
-       In case of ``pipeline_parallel_degree`` for PyTorch, this is the number of devices
-       over which pipeline parallelism will be performed.
-   * - ``microbatches``
-     - int
-     - 1
-     - The number of microbatches to perform pipelining over. 1 means no pipelining.
-       Batch size must be divisible by the number of microbatches.
-   * - ``pipeline``
-     - ``"interleaved"`` or ``"simple"``
-     - ``"interleaved"``
-     - The pipeline schedule.
-   * - ``optimize``
-     - ``"memory"`` or ``"speed"``
-     - ``"memory"``
-     - Determines the distribution mechanism of transformer layers.
-       If optimizing ``speed``, there will be less communication across tensor-parallel ranks
-       and layer normalization will not be distributed. However, there will be duplicate activations
-       stored across tensor-parallel ranks.
-       If optimizing ``memory``, there will be no redundant activations stored,
-       but this will result in more communication overhead across tensor parallel ranks.
-   * - ``placement_strategy``
-     - ``"cluster"``, ``"spread"``, or a permutation of the string ``D``, ``P``, and ``T``.
-     - ``"cluster"``
-     - Determines the mapping of model partitions onto physical devices.
-       When hybrid model/data parallelism is used, ``cluster`` places a single model replica in
-       neighboring device IDs. Contrarily, ``spread`` places a model replica as far as possible.
-       For more information, see :ref:`ranking-basics`.
-
-       In case of the permutation letters, ``D`` stands for reduced-data parallelism,
-       ``P`` stands for pipeline parallelism,
-       and ``T`` stands for tensor parallelism.
-       ``spread`` is equivalent to ``"TPD"``, and ``cluster`` is equivalent to ``"DPT"``.
-       For more information, see :ref:`ranking-basics-tensor-parallelism`.
-
-       Note: For TensorFlow, tensor parallelism is not implemented and
-       available parameter values are only ``"spread"`` and ``"cluster"``.
-   * - ``auto_partition``
-     - bool
-     - ``True``
-     - Enable auto-partitioning. If disabled, ``default_partition`` parameter must be provided.
-   * - ``default_partition``
-     - int
-     - ``0``
-     - **Required** if ``auto_partition`` is false. The partition ID to place operations/modules
-       that are not placed in any ``smp.partition`` contexts.
-
-TensorFlow-specific Parameters
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. list-table::
-   :widths: 10 20 10 60
-   :header-rows: 1
-
-   * - Parameter
-     - Type / Valid values
-     - Default
-     - Description
-   * - ``contiguous``
-     - bool
-     - ``True``
-     - Whether the model partitions should be contiguous. If true, each partition forms a connected component in the computational graph, unless the graph itself is not connected.
-   * - ``horovod``
-     - bool
-     - ``False``
-     - Must be set to ``True`` if hybrid model/data parallelism is used and the data parallelism (DP) framework is Horovod.
-
-
-PyTorch-specific Parameters
-~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. list-table::
-  :widths: 10 20 10 60
-  :header-rows: 1
-
-  * - Parameter
-    - Type / Valid values
-    - Default
-    - Description
-  * - ``memory_weight``
-    - float [0.0, 1.0]
-    - ``0.2`` if ``optimize`` is ``"speed"``, else ``0.8``
-    - The weight of memory balancing in the auto-partitioni ng objective, as opposed to balancing computational load. If 0.0, the library only tries to balance computation; if 1.0 the library only tries to balance the memory use. Any value in between interpolates between these extremes.
-  * - ``ddp``
-    - bool
-    - ``False``
-    - Must be set to True if hybrid model/data parallelism is used with DistributedDataParallel. DistributedDataParallel is used with NCCL backend, and uses the MASTER_PORT provided by SageMaker.
-  * - ``active_microbatches`` (**smdistributed-modelparallel**>=v1.3)
-    - int
-    - ``partitions`` + 2
-    - This is the maximum number of microbatches that are simultaneously in execution during pipelining. Jointly scaling batch size and number of microbatches can often mitigate the pipeline bubble overhead, but that can lead to increased memory usage if too many microbatches are simultaneously in execution. In such cases setting the number of active microbatches to a lower number can help control memory usage. By default this is set to two plus the number of partitions of the model.
-  * - ``deterministic_server`` (**smdistributed-modelparallel**>=v1.3)
-    - bool
-    - ``False``
-    - Setting this to true ensures that the execution server for pipelining executes requests in the same order across all data parallel ranks.
-  * -  ``offload_activations`` (**smdistributed-modelparallel**>=v1.6)
-    - bool
-    - False
-    - Enables activation
-      offloading. To improve GPU memory usage, use activation offloading
-      only when (1) the ``microbatches`` and ``active_microbatches`` are
-      greater than 1, and (2) activation checkpointing is enabled for at
-      least one module in the model.
-  * - ``activation_loading_horizon`` (**smdistributed-modelparallel**>=v1.6)
-    - int
-    - 4
-    - Specify the number
-      of pipeline tasks. This determines how early the activations should
-      be loaded back to the GPU, expressed in number of pipeline tasks.
-      Smaller value indicates that activations are loaded closer in time to
-      when they are needed for backward pass. Setting this value too small
-      might improve memory usage, but might potentially cause throughput
-      loss and GPU bottlenecks during the CPU-to-GPU data transfer.
-  * - ``tensor_parallel_degree`` (**smdistributed-modelparallel**>=v1.6)
-    - int
-    - 1
-    - The number of devices over which the tensor parallel modules will be distributed.
-      If ``tensor_parallel_degree`` is greater than 1, then ``ddp`` must be set to ``True``.
-  * - ``fp16`` (**smdistributed-modelparallel**>=v1.10)
-    - bool
-    - ``False``
-    - To run FP16 training, add ``"fp16"'": True`` to the smp configuration.
-      Other APIs remain the same between FP16 and FP32.
-      If ``fp16`` is enabled and when user calls ``smp.DistributedModel``,
-      the model will be wrapped with ``FP16_Module``, which converts the model
-      to FP16 dtype and deals with forward pass in FP16.
-      If ``fp16`` is enabled and when user calls ``smp.DistributedOptimizer``,
-      the optimizer will be wrapped with ``FP16_Optimizer``.
-  * - ``fp16_params`` (**smdistributed-modelparallel**>=v1.6)
-    - bool
-    - ``False``
-    - If ``True``, the parameters of the distributed modules will be initialized in FP16.
-  * - ``shard_optimizer_state`` (**smdistributed-modelparallel**>=v1.6)
-    - bool
-    - ``False``
-    - If ``True``, the library shards the optimizer state of all parameters across
-      the data parallel processes which hold the same parameter.
-      This optimizer state sharding happens in a balanced manner.
-      Note that when sharding optimizer state, full optimizer saving is not currently supported.
-      Please save partial optimizer state. For more information about saving and loading checkpoints with
-      optimizer state sharding, see `Instructions for Checkpointing with Tensor Parallelism <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-saving-loading-checkpoints.html>`_.
-  * - ``prescaled_batch`` (**smdistributed-modelparallel**>=v1.6)
-    - bool
-    - ``False``
-    - If ``True`` and when ``smp.nn.DistributedTransformerLMHead`` is used
-      (this is typically used for GPT-2 or GPT-3 models),
-      the library assumes that the devices in the same tensor parallelism group
-      receive the same input data. Otherwise, it is assumed that they receive
-      different examples. To learn more, see :ref:`prescaled-batch`.
-  * - ``skip_tracing`` (**smdistributed-modelparallel**>=v1.6)
-    - bool
-    - False
-    - Skips the initial tracing step. This can be useful in very large models
-      where even model tracing at the CPU is not possible due to memory constraints.
-  * - ``sharded_data_parallel_degree`` (**smdistributed-modelparallel**>=v1.11)
-    - int
-    - 1
-    - To run a training job using sharded data parallelism, add this parameter and specify a number greater than 1.
-      Sharded data parallelism is a memory-saving distributed training technique that splits the training state of a model (model parameters, gradients, and optimizer states) across GPUs in a data parallel group.
-      For more information, see `Sharded Data Parallelism
-      <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-sharded-data-parallelism.html>`_.
-  * - ``sdp_reduce_bucket_size`` (**smdistributed-modelparallel**>=v1.11)
-    - int
-    - 5e8
-    - Configuration parameter for sharded data parallelism (for ``sharded_data_parallel_degree > 2``).
-      Specifies the size of PyTorch DDP gradient buckets in number of elements of the default dtype.
-  * - ``sdp_param_persistence_threshold`` (**smdistributed-modelparallel**>=v1.11)
-    - int
-    - 1e6
-    -  Specifies the size of a parameter tensor in number of elements that can persist at each GPU. Sharded data parallelism splits each parameter tensor across GPUs of a data parallel group. If the number of elements in the parameter tensor is smaller than this threshold, the parameter tensor is not split; this helps reduce communication overhead because the parameter tensor is replicated across data-parallel GPUs.
-  * - ``sdp_max_live_parameters`` (**smdistributed-modelparallel**>=v1.11)
-    - int
-    - 1e9
-    - Specifies the maximum number of parameters that can simultaneously be in a recombined training state during the forward and backward pass. Parameter fetching with the AllGather operation pauses when the number of active parameters reaches the given threshold. Note that increasing this parameter increases the memory footprint.
-  * - ``sdp_hierarchical_allgather`` (**smdistributed-modelparallel**>=v1.11)
-    - bool
-    - True
-    - If set to True, the AllGather operation runs hierarchically: it runs within each node first, and then runs across nodes. For multi-node distributed training jobs, the hierarchical AllGather operation is automatically activated.
-  * - ``sdp_gradient_clipping`` (**smdistributed-modelparallel**>=v1.11)
-    - float
-    - 1.0
-    - Specifies a threshold for gradient clipping the L2 norm of the gradients before propagating them backward through the model parameters. When sharded data parallelism is activated, gradient clipping is also activated. The default threshold is 1.0. Adjust this parameter if you have the exploding gradients problem.
-
-
-Parameters for ``mpi``
-----------------------
-
-For the ``"mpi"`` key, a dict must be passed which contains:
-
-* ``"enabled"``: Set to ``True`` to launch the training job with MPI.
-
-* ``"processes_per_host"``: Specifies the number of processes MPI should launch on each host.
-  In SageMaker a host is a single Amazon EC2 ml instance. The SageMaker distributed model parallel library maintains
-  a one-to-one mapping between processes and GPUs across model and data parallelism.
-  This means that SageMaker schedules each process on a single, separate GPU and no GPU contains more than one process.
-  If you are using PyTorch, you must restrict each process to its own device using
-  ``torch.cuda.set_device(smp.local_rank())``. To learn more, see
-  `Modify a PyTorch Training Script
-  <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-customize-training-script.html#model-parallel-customize-training-script-pt-16>`_.
-
-  .. important::
-   ``process_per_host`` must be less than or equal to the number of GPUs per instance, and typically will be equal to
-   the number of GPUs per instance.
-
-  For example, if you use one instance with 4-way model parallelism and 2-way data parallelism,
-  then processes_per_host should be 2 x 4 = 8. Therefore, you must choose an instance that has at least 8 GPUs,
-  such as an ml.p3.16xlarge.
-
-  The following image illustrates how 2-way data parallelism and 4-way model parallelism is distributed across 8 GPUs:
-  the model is partitioned across 4 GPUs, and each partition is added to 2 GPUs.
-
-  .. image:: smp_versions/model-data-parallel.png
-      :width: 650
-      :alt: 2-way data parallelism and 4-way model parallelism distributed across 8 GPUs
-
-
-* ``"custom_mpi_options"``: Use this key to pass any custom MPI options you might need.
-  To avoid Docker warnings from contaminating your training logs, we recommend the following flag.
-  ```--mca btl_vader_single_copy_mechanism none```
-
-
-.. _ranking-basics:
-
-Ranking Basics without Tensor Parallelism
-=========================================
-
-The library maintains a one-to-one mapping between processes and available GPUs:
-for each GPU, there is a corresponding CPU process. Each CPU process
-maintains a “rank” assigned by MPI, which is a 0-based unique index for
-the process. For instance, if a training job is launched with 4
-``p3dn.24xlarge`` instances using all its GPUs, there are 32 processes
-across all instances, and the ranks of these processes range from 0 to
-31.
-
-The ``local_rank`` of a process is the rank of the process among the
-processes in the same instance. This can range from 0 up to the number
-of GPUs in the instance, but can be lower if fewer processes than GPUs are
-launched in the instance. For instance, in the preceding
-example, ``local_rank``\ s of the processes will range from 0 to 7,
-since there are 8 GPUs in a ``p3dn.24xlarge`` instance.
-
-When model parallelism is used together with data parallelism (Horovod for TensorFlow
-and DDP for PyTorch), the library partitions the set of processes into
-disjoint \ ``mp_group``\ s. An ``mp_group`` is a subset of all processes
-that together hold a single, partitioned model replica.
-
-For instance, if
-a single node job is launched with 8 local processes with
-``partitions=2`` (meaning the model will be split into 2), there are
-four \ ``mp_group``\ s. The specific sets of processes that form the
-``mp_group``\ s can be adjusted by the ``placement_strategy`` option.
-
-- If ``placement_strategy`` is ``spread``, then the four
-  ``mp_group``\ s are ``[0, 4], [1, 5], [2, 6], [3, 7]``. The
-  ``mp_rank`` is the rank of a process within each ``mp_group``. For example,
-  the ``mp_rank`` is 0 for the processes 0, 1, 2, and 3, and the ``mp_rank`` is 1 for
-  the processes 4, 5, 6, and 7.
-
-  Analogously, the library defines ``dp_group``\ s as sets of processes that
-  all hold the same model partition, and perform data parallelism among
-  each other. If ``placement_strategy`` is ``spread``, there are two ``dp_group``\ s:
-  ``[0, 1, 2, 3]`` and ``[4, 5, 6, 7]``.
-
-  Since each process within the ``dp_group`` holds the same partition of
-  the model, and makes allreduce calls among themselves. Allreduce for
-  data parallelism does not take place *across* ``dp_group``\ s.
-  ``dp_rank`` is defined as the rank of a process within its ``dp_group``.
-  In the preceding example, the \ ``dp_rank`` of process 6 is 2.
-
-- If ``placement_strategy`` is ``cluster``, the four ``mp_group``\ s
-  become ``[0, 1], [2, 3], [4, 5], [6, 7]``, and the the two ``dp_group``\ s become
-  ``[0, 2, 4, 6]`` and ``[1, 3, 5, 7]``.
-
-.. _ranking-basics-tensor-parallelism:
-
-Placement Strategy with Tensor Parallelism
-==========================================
-
-In addition to the two placement strategies introduced in the previous section,
-the library provides additional placement strategies for extended tensor parallelism features
-for PyTorch. The additional placement strategies (parallelism types) are denoted as follows:
-
-- ``D`` stands for (reduced) data parallelism.
-- ``P`` stands for pipeline parallelism.
-- ``T`` stands for tensor parallelism.
-
-With given permutation of the tree letters, the library takes the right-most letter
-as the first strategy performs over the global ranks in ascending order.
-Contrarily, the parallelism type represented by the left-most letter is performed
-over the ranks that are as distant as possible.
-
-- **Example:** Given 8 devices with ``tp_size() == 2``,
-  ``pp_size() == 2``, ``rdp_size() == 2``
-
-  - ``placement_strategy: "DPT"`` gives
-
-    ==== ======== ======= =======
-    rank rdp_rank pp_rank tp_rank
-    ==== ======== ======= =======
-    0    0        0       0
-    1    0        0       1
-    2    0        1       0
-    3    0        1       1
-    4    1        0       0
-    5    1        0       1
-    6    1        1       0
-    7    1        1       1
-    ==== ======== ======= =======
-
-  - ``placement_strategy: "PTD"`` gives
-
-    ==== ======== ======= =======
-    rank rdp_rank pp_rank tp_rank
-    ==== ======== ======= =======
-    0    0        0       0
-    1    1        0       0
-    2    0        0       1
-    3    1        0       1
-    4    0        1       0
-    5    1        1       0
-    6    0        1       1
-    7    1        1       1
-    ==== ======== ======= =======
-
-Because the neighboring ranks are placed on the same instance with
-high-bandwidth NVLinks, it is recommended to place the
-parallelism type that has higher bandwidth requirements for your model
-on the right-most position in the ``placement_strategy`` string. Because
-tensor parallelism often requires frequent communication, placing
-``T`` in the right-most position is recommended (as in the default
-``"cluster"`` strategy). In many large models, keeping the default of
-``"cluster"`` would result in the best performance.
-
-
-.. _prescaled-batch:
-
-Prescaled Batch
-===============
-
-``prescaled_batch`` is a configuration parameter that can be useful for
-``DistributedTransformerLMHead``, which is used for GPT-2 and GPT-3.
-
-The way tensor parallelism works is that when a module is distributed,
-the inputs to the distributed module in different ``tp_rank``\ s gets
-shuffled around in a way that is sliced by the hidden dimension and
-scaled by the batch dimension. For example, if tensor parallel degree is
-8, the inputs to ``DistributedTransformer`` (a tensor with shape
-``[B, S, H]`` where ``B``\ =batch size, ``S``\ =sequence length,
-``H``\ =hidden width) in different ``tp_rank``\ s will be communicated
-around, and the shapes will become ``[8B, S, H/8]``. Each ``tp_rank``
-has the batch from all the peer ``tp_rank``\ s, but only the slice that
-interacts with their local partition of the module.
-
-By default, the library assumes that each ``tp_rank`` gets assigned a
-different batch, and performs the communication described above. If
-``prescaled_batch`` is true, then the library assumes that the input
-batch is already scaled (and is the same across the ``tp_rank``\ s), and
-only does the slicing. In the example above, the library assumes that
-input tensor has shape ``[8B, S, H]``, and only converts it into
-``[8B, S, H/8]``. So if ``prescaled_batch`` is true, it is the user’s
-responsibility to feed the same batch to the ``tp_rank``\ s in the same
-``TP_GROUP``. This can be done by doing the data sharding based on
-``smp.rdp_size()`` and ``smp.rdp_rank()``, instead of ``smp.dp_size()``
-and ``smp.dp_rank()``. When ``prescaled_batch`` is true, the global
-batch size is ``smp.rdp_size()`` multiplied by the per-``MP_GROUP``
-batch size. When ``prescaled_batch`` is false, global batch size is
-``smp.dp_size()`` multiplied by the per-``PP_GROUP`` batch size.
-
-If you use pipeline parallelism degree 1, then you can keep
-``prescaled_batch`` false (the default option). If you use a pipeline
-parallellism degree more than 1, it is recommended to use
-``prescaled_batch`` true, so that you can increase per-``MP_GROUP``
-batch size for efficient pipelining, without running into out-of-memory
-issues.
diff --git a/doc/api/training/smd_model_parallel_release_notes/smd_model_parallel_change_log.rst b/doc/api/training/smd_model_parallel_release_notes/smd_model_parallel_change_log.rst
deleted file mode 100644
index 9409d69aad..0000000000
--- a/doc/api/training/smd_model_parallel_release_notes/smd_model_parallel_change_log.rst
+++ /dev/null
@@ -1,902 +0,0 @@
-#############
-Release Notes
-#############
-
-New features, bug fixes, and improvements are regularly made to the SageMaker
-model parallelism library.
-
-
-SageMaker Distributed Model Parallel 1.15.0 Release Notes
-=========================================================
-
-*Date: Apr. 27. 2023*
-
-**Currency Updates**
-
-* Added support for PyTorch v2.0.0.
-  Note that the library does not support ``torch.compile`` in this release.
-
-**New Features**
-
-* Using sharded data parallelism with tensor parallelism together is now
-  available for PyTorch 1.13.1. It allows you to train with smaller global batch
-  sizes while scaling up to large clusters. For more information, see `Sharded
-  data parallelism with tensor parallelism <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-sharded-data-parallelism.html#model-parallel-extended-features-pytorch-sharded-data-parallelism-with-tensor-parallelism>`_
-  in the *Amazon SageMaker Developer Guide*.
-* Added support for saving and loading full model checkpoints when using sharded
-  data parallelism. This is enabled by using the standard checkpointing API,
-  ``smp.save_checkpoint`` with ``partial=False``.
-  Before, full checkpoints needed to be created by merging partial checkpoint
-  files after training finishes.
-* `DistributedTransformer <https://sagemaker.readthedocs.io/en/stable/api/training/smp_versions/latest/smd_model_parallel_pytorch_tensor_parallel.html#smdistributed.modelparallel.torch.nn.DistributedTransformerLayer>`_
-  now supports the ALiBi position embeddings.
-  When using DistributedTransformer, you can set the ``use_alibi`` parameter
-  to ``True`` to use the Triton-based flash attention kernels. This helps
-  evaluate sequences longer than those used for training.
-
-**Bug Fixes**
-
-* When using tensor parallelism, parameters were initialized multiple times
-  unncessarily. This release fixed the multiple initialization of parameters
-  so that each parameter is initialized exactly once.
-  It not only saves time, but also ensures that the random generator behavior
-  is similar to the non-tensor parallelism case.
-
-**Known issues**
-
-* Model initialization might take longer with PyTorch 2.0 than that with PyTorch 1.13.
-
-**Migration to AWS Deep Learning Containers**
-
-This version passed benchmark testing and is migrated to the following AWS Deep Learning Containers (DLC):
-
-- SageMaker training container for PyTorch v2.0.0
-
-  .. code::
-
-    763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:2.0.0-gpu-py310-cu118-ubuntu20.04-sagemaker
-
-- SageMaker training container for PyTorch v1.13.1
-
-  .. code::
-
-    763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.13.1-gpu-py39-cu117-ubuntu20.04-sagemaker
-
-Binary file of this version of the library for `custom container
-<https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-sm-sdk.html#model-parallel-bring-your-own-container>`_ users:
-
-- For PyTorch v2.0.0
-
-  .. code::
-
-    https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-2.0.0/build-artifacts/2023-04-14-20-14/smdistributed_modelparallel-1.15.0-cp310-cp310-linux_x86_64.whl
-
-- For PyTorch v1.13.1
-
-  .. code::
-
-    https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-1.13.1/build-artifacts/2023-04-17-15-49/smdistributed_modelparallel-1.15.0-cp39-cp39-linux_x86_64.whl
-
-----
-
-Release History
-===============
-
-SageMaker Distributed Model Parallel 1.14.0 Release Notes
----------------------------------------------------------
-
-*Date: Jan. 30. 2023*
-
-**Currency Updates**
-
-* Added support for PyTorch v1.13.1
-
-**Improvements**
-
-* Upgraded the flash-attention (https://github.com/HazyResearch/flash-attention) library to  v0.2.6.post1
-
-**Migration to AWS Deep Learning Containers**
-
-This version passed benchmark testing and is migrated to the following AWS Deep Learning Containers (DLC):
-
-- SageMaker training container for PyTorch v1.13.1
-
-  .. code::
-
-    763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.13.1-gpu-py39-cu117-ubuntu20.04-sagemaker
-
-
-Binary file of this version of the library for `custom container
-<https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-sm-sdk.html#model-parallel-bring-your-own-container>`_ users:
-
-- For PyTorch 1.13.1
-
-  .. code::
-
-    https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-1.13.1/build-artifacts/2023-01-19-18-35/smdistributed_modelparallel-1.14.0-cp39-cp39-linux_x86_64.whl
-
-
-SageMaker Distributed Model Parallel 1.13.0 Release Notes
----------------------------------------------------------
-
-*Date: Dec. 15. 2022*
-
-**New Features**
-
-* Sharded data parallelism now supports a new backend for collectives called *SMDDP Collectives*.
-  For supported scenarios, SMDDP Collectives are on by default for the AllGather operation.
-  For more information, see
-  `Sharded data parallelism with SMDDP Collectives
-  <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-sharded-data-parallelism.html#model-parallel-extended-features-pytorch-sharded-data-parallelism-smddp-collectives>`_
-  in the *Amazon SageMaker Developer Guide*.
-* Introduced FlashAttention for DistributedTransformer to improve memory usage and computational
-  performance of models such as GPT2, GPTNeo, GPTJ, GPTNeoX, BERT, and RoBERTa.
-
-**Bug Fixes**
-
-* Fixed initialization of ``lm_head`` in DistributedTransformer to use a provided range
-  for initialization, when weights are not tied with the embeddings.
-
-**Improvements**
-
-* When a module has no parameters, we have introduced an optimization to execute
-  such a module on the same rank as its parent during pipeline parallelism.
-
-**Migration to AWS Deep Learning Containers**
-
-This version passed benchmark testing and is migrated to the following AWS Deep Learning Containers (DLC):
-
-- SageMaker training container for PyTorch v1.12.1
-
-  .. code::
-
-    763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.12.1-gpu-py38-cu113-ubuntu20.04-sagemaker
-
-
-Binary file of this version of the library for `custom container
-<https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-sm-sdk.html#model-parallel-bring-your-own-container>`_ users:
-
-- For PyTorch 1.12.1
-
-  .. code::
-
-    https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-1.12.1/build-artifacts/2022-12-08-21-34/smdistributed_modelparallel-1.13.0-cp38-cp38-linux_x86_64.whl
-
-
-SageMaker Distributed Model Parallel 1.11.0 Release Notes
----------------------------------------------------------
-
-*Date: August. 17. 2022*
-
-**New Features**
-
-The following new features are added for PyTorch.
-
-* The library implements sharded data parallelism, which is a memory-saving
-  distributed training technique that splits the training state of a model
-  (model parameters, gradients, and optimizer states) across data parallel groups.
-  With sharded data parallelism, you can reduce the per-GPU memory footprint of
-  a model by sharding the training state over multiple GPUs. To learn more,
-  see `Sharded Data Parallelism
-  <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-sharded-data-parallelism.html>`_
-  in the *Amazon SageMaker Developer Guide*.
-
-**Migration to AWS Deep Learning Containers**
-
-This version passed benchmark testing and is migrated to the following AWS Deep Learning Containers (DLC):
-
-- DLC for PyTorch 1.12.0
-
-  .. code::
-
-    763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.12.0-gpu-py38-cu113-ubuntu20.04-sagemaker
-
-Binary file of this version of the library for `custom container
-<https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-sm-sdk.html#model-parallel-bring-your-own-container>`_ users:
-
-- For PyTorch 1.12.0
-
-  .. code::
-
-    https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-1.12.0/build-artifacts/2022-08-12-16-58/smdistributed_modelparallel-1.11.0-cp38-cp38-linux_x86_64.whl
-
-SageMaker Distributed Model Parallel 1.10.1 Release Notes
----------------------------------------------------------
-
-*Date: August. 8. 2022*
-
-**Currency Updates**
-
-* Added support for Transformers v4.21.
-
-
-**Migration to AWS Deep Learning Containers**
-
-This version passed benchmark testing and is migrated to the following AWS Deep Learning Containers (DLC):
-
-- DLC for PyTorch 1.11.0
-
-  .. code::
-
-    763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.11.0-gpu-py38-cu113-ubuntu20.04-sagemaker
-
-
-Binary file of this version of the library for `custom container
-<https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-sm-sdk.html#model-parallel-bring-your-own-container>`_ users:
-
-- For PyTorch 1.11.0
-
-  .. code::
-
-    https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-1.11.0/build-artifacts/2022-07-28-23-07/smdistributed_modelparallel-1.10.1-cp38-cp38-linux_x86_64.whl
-
-
-
-SageMaker Distributed Model Parallel 1.10.0 Release Notes
----------------------------------------------------------
-
-*Date: July. 19. 2022*
-
-**New Features**
-
-The following new features are added for PyTorch.
-
-* Added support for FP16 training by implementing smdistributed.modelparallel
-  modification of Apex FP16_Module and FP16_Optimizer. To learn more, see
-  `FP16 Training with Model Parallelism
-  <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-fp16.html>`_.
-* New checkpoint APIs for CPU memory usage optimization. To learn more, see
-  `Checkpointing Distributed Models and Optimizer States
-  <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-checkpoint.html>`_.
-
-**Improvements**
-
-* The SageMaker distributed model parallel library manages and optimizes CPU
-  memory by garbage-collecting non-local parameters in general and during checkpointing.
-* Changes in the `GPT-2 translate functions
-  <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-hugging-face.html>`_
-  (``smdistributed.modelparallel.torch.nn.huggingface.gpt2``)
-  to save memory by not maintaining two copies of weights at the same time.
-
-**Migration to AWS Deep Learning Containers**
-
-This version passed benchmark testing and is migrated to the following AWS Deep Learning Containers (DLC):
-
-- DLC for PyTorch 1.11.0
-
-  .. code::
-
-    763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.11.0-gpu-py38-cu113-ubuntu20.04-sagemaker
-
-- DLC for PyTorch 1.12.0
-
-  .. code::
-
-    763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.12.0-gpu-py38-cu113-ubuntu20.04-sagemaker
-
-Binary file of this version of the library for `custom container
-<https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-sm-sdk.html#model-parallel-bring-your-own-container>`_ users:
-
-- For PyTorch 1.11.0
-
-  .. code::
-
-    https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-1.11.0/build-artifacts/2022-07-11-19-23/smdistributed_modelparallel-1.10.0-cp38-cp38-linux_x86_64.whl
-
-- For PyTorch 1.12.0
-
-  .. code::
-
-    https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-1.12.0/build-artifacts/2022-07-11-19-23/smdistributed_modelparallel-1.10.0-cp38-cp38-linux_x86_64.whl
-
-
-SageMaker Distributed Model Parallel 1.9.0 Release Notes
---------------------------------------------------------
-
-*Date: May. 3. 2022*
-
-**Currency Updates**
-
-* Added support for PyTorch 1.11.0
-
-**Migration to AWS Deep Learning Containers**
-
-This version passed benchmark testing and is migrated to the following AWS Deep Learning Containers (DLC):
-
-- PyTorch 1.11.0 DLC
-
-  .. code::
-
-    763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.11.0-gpu-py38-cu113-ubuntu20.04-sagemaker
-
-Binary file of this version of the library for custom container users:
-
-  .. code::
-
-    https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-1.11.0/build-artifacts/2022-04-20-17-05/smdistributed_modelparallel-1.9.0-cp38-cp38-linux_x86_64.whl
-
-
-
-SageMaker Distributed Model Parallel 1.8.1 Release Notes
---------------------------------------------------------
-
-*Date: April. 23. 2022*
-
-**New Features**
-
-* Added support for more configurations of the Hugging Face Transformers GPT-2 and GPT-J models
-  with tensor parallelism: ``scale_attn_weights``, ``scale_attn_by_inverse_layer_idx``,
-  ``reorder_and_upcast_attn``. To learn more about these features, please refer to
-  the following model configuration classes
-  in the *Hugging Face Transformers documentation*:
-
-  * `transformers.GPT2Config <https://huggingface.co/docs/transformers/model_doc/gpt2#transformers.GPT2Config>`_
-  * `transformers.GPTJConfig <https://huggingface.co/docs/transformers/model_doc/gptj#transformers.GPTJConfig>`_
-
-* Added support for activation checkpointing of modules which pass keyword value arguments
-  and arbitrary structures in their forward methods. This helps support
-  activation checkpointing with Hugging Face Transformers models even
-  when tensor parallelism is not enabled.
-
-**Bug Fixes**
-
-* Fixed a correctness issue with tensor parallelism for GPT-J model
-  which was due to improper scaling during gradient reduction
-  for some layer normalization modules.
-* Fixed the creation of unnecessary additional processes which take up some
-  GPU memory on GPU 0 when the :class:`smp.allgather` collective is called.
-
-**Improvements**
-
-* Improved activation offloading so that activations are preloaded on a
-  per-layer basis as opposed to all activations for a micro batch earlier.
-  This not only improves memory efficiency and performance, but also makes
-  activation offloading a useful feature for non-pipeline parallelism cases.
-
-**Migration to AWS Deep Learning Containers**
-
-This version passed benchmark testing and is migrated to the following AWS Deep Learning Containers:
-
-* HuggingFace 4.17.0 DLC with PyTorch 1.10.2
-
-    .. code::
-
-      763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-training:1.10.2-transformers4.17.0-gpu-py38-cu113-ubuntu20.04
-
-
-* The binary file of this version of the library for custom container users
-
-    .. code::
-
-      https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-1.10.0/build-artifacts/2022-04-14-03-58/smdistributed_modelparallel-1.8.1-cp38-cp38-linux_x86_64.whl
-
-
-SageMaker Distributed Model Parallel 1.8.0 Release Notes
---------------------------------------------------------
-
-*Date: March. 23. 2022*
-
-**New Features**
-
-* Added tensor parallelism support for the `GPT-J model
-  <https://huggingface.co/docs/transformers/model_doc/gptj>`_.
-  When using the GPT-J model of Hugging Face Transformers v4.17.0 with
-  tensor parallelism, the SageMaker model parallel library automatically
-  replaces the model with a tensor parallel distributed GPT-J model.
-  For more information, see `Support for Hugging Face Transformer Models
-  <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-hugging-face.html>`_
-  in the *Amazon SageMaker Model Parallel Training developer guide*.
-
-**Migration to AWS Deep Learning Containers**
-
-This version passed benchmark testing and is migrated to the following AWS Deep Learning Containers:
-
-* HuggingFace 4.17.0 DLC with PyTorch 1.10.2
-
-    .. code::
-
-      763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-training:1.10.2-transformers4.17.0-gpu-py38-cu113-ubuntu20.04
-
-
-The binary file of this version of the library for custom container users:
-
-    .. code::
-
-      https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-1.10.0/build-artifacts/2022-03-12-00-33/smdistributed_modelparallel-1.8.0-cp38-cp38-linux_x86_64.whl
-
-
-SageMaker Distributed Model Parallel 1.7.0 Release Notes
---------------------------------------------------------
-
-*Date: March. 07. 2022*
-
-**Currency Updates**
-
-* Support for PyTorch 1.10.2
-* Support for Hugging Face Transformers 4.16.2
-
-**Improvements**
-
-* Additional support for the :ref:`smdmp-pytorch-tensor-parallel`.
-
-  * Added support for FP32 residual addition to avoid overflow (NaN loss values)
-    for large models with more than 100 billion parameters when using FP16.
-    This is integrated to the following module:
-
-      * :class:`smp.nn.DistributedTransformerOutputLayer`
-
-
-  * Added support for the following two `NVIDIA Megatron fused kernels
-    <https://github.com/NVIDIA/Megatron-LM/tree/main/megatron/fused_kernels>`_:
-
-    * Fusion of attention masking and softmax (``fused_softmax``)
-    * Fusion of bias addition and Gelu activation (``fused_bias_gelu``)
-
-    To learn more about these options and how to use them,
-    see the :class:`smp.tensor_parallelism` context manager.
-
-
-
-**Migration to AWS Deep Learning Containers**
-
-This version passed benchmark testing and is migrated to the following AWS Deep Learning Containers:
-
-
-* PyTorch 1.10.2
-
-  .. code::
-
-    763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.10.2-gpu-py38-cu113-ubuntu20.04-sagemaker
-
-
-SageMaker Distributed Model Parallel 1.6.0 Release Notes
---------------------------------------------------------
-
-*Date: December. 20. 2021*
-
-**New Features**
-
-- **PyTorch**
-
-  - Added extended memory-saving features for PyTorch 1.8.1:
-
-    - `Tensor parallelism <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-tensor-parallelism.html>`_
-    - `Optimizer state sharding <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-optimizer-state-sharding.html>`_
-    - `Activation checkpointing <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-activation-checkpointing.html>`_
-    - `Activation offloading <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-activation-offloading.html>`_
-
-    For more information, see the following documentation:
-
-    - `SageMaker distributed model parallel developer guide <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch.html>`_
-    - `SageMaker distributed model parallel API documentation for v1.6.0 <https://sagemaker.readthedocs.io/en/stable/api/training/smp_versions/latest.html>`_
-
-**Migration to AWS Deep Learning Containers**
-
-This version passed benchmark testing and is migrated to the following
-AWS Deep Learning Container(s):
-
-- Deep Learning Container for PyTorch 1.8.1:
-
-  .. code::
-
-    763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.8.1-gpu-py36-cu111-ubuntu18.04
-
-
-
-SageMaker Distributed Model Parallel 1.5.0 Release Notes
---------------------------------------------------------
-
-*Date: November. 03. 2021*
-
-**New Features**
-
-- **PyTorch**
-
-  - Currency update for PyTorch 1.10.0
-
-**Migration to AWS Deep Learning Containers**
-
-This version passed benchmark testing and is migrated to the following
-AWS Deep Learning Containers:
-
-- Deep Learning Container for PyTorch 1.10.0:
-
-  .. code::
-
-    763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.10.0-gpu-py38-cu113-ubuntu20.04-sagemaker
-
-----
-
-SageMaker Distributed Model Parallel 1.4.0 Release Notes
---------------------------------------------------------
-
-*Date: June. 29. 2021*
-
-**New Features**
-
-- **TensorFlow**
-
-  - Added support for TensorFlow v2.5.0.
-  - Added support for ``keras.model.fit()``.
-
-**Migration to AWS Deep Learning Containers**
-
-This version passed benchmark testing and is migrated to the following
-AWS Deep Learning Containers:
-
-- Deep Learning Container for TensorFlow 2.5.0:
-
-  .. code::
-
-    763104351884.dkr.ecr.<region>.amazonaws.com/tensorflow-training:2.5.0-gpu-py37-cu112-ubuntu18.04-v1.0
-
-- Deep Learning Container for PyTorch 1.9.1:
-
-  .. code::
-
-    763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.9.1-gpu-py38-cu111-ubuntu20.04
-
-----
-
-SageMaker Distributed Model Parallel 1.3.1 Release Notes
---------------------------------------------------------
-
--  New Features
--  Bug Fixes
--  Known Issues
-
-**New Features**
-
-- **TensorFlow**
-
-  -  Exposes a new decorator ``register_post_partition_hook``. This allows
-     invoking the decorated methods just after model partition but before
-     executing the first step. For example loading a checkpoint. Refer to
-     the `SageMaker distributed model parallel API
-     documentation <https://sagemaker.readthedocs.io/en/stable/api/training/smp_versions/latest/smd_model_parallel_tensorflow.html>`__
-     for more information.
-
-**Bug Fixes**
-
-- **PyTorch**
-
-  -  Improved memory efficiency when using active microbatches by clearing
-     activations at end of each microbatch.
-
-- **TensorFlow**
-
-  -  Fixed issue that caused hangs when training some models with XLA
-     enabled.
-
-**Known Issues**
-
-- **PyTorch**
-
-  -  A crash was observed when ``optimizer.step()`` was called for certain
-     optimizers such as AdaDelta, when the partition on which this method
-     was called has no local parameters assigned to it after partitioning.
-     This is due to a bug in PyTorch which `has since been
-     fixed <https://github.com/pytorch/pytorch/pull/52944>`__. Till that
-     makes its way to the next release of PyTorch, only call
-     ``optimizer.step()`` on processes which have at least one local
-     parameter. This can be checked like this
-     ``len(list(model.local_parameters())) > 0``.
-
-  -  A performance regression still exists when training on SMP with
-     PyTorch 1.7.1 compared to 1.6. The rootcause was found to be the
-     slowdown in performance of ``.grad`` method calls in PyTorch 1.7.1
-     compared to 1.6. See the related discussion:
-     https://github.com/pytorch/pytorch/issues/50636. This issue does not
-     exist with PyTorch 1.8.
-
-----
-
-SageMaker Distributed Model Parallel 1.3.0 Release Notes
---------------------------------------------------------
-
--  New Features
--  Bug Fixes
--  Known Issues
-
-.. _new-features-1:
-
-**New Features**
-
-.. _pytorch-2:
-
-- **PyTorch**
-
-  Add support for PyTorch 1.8
-
-  -  Adds a new method to DistributedModel ``register_comm_hook`` (for
-     PyTorch 1.8 and newer only). This method behaves the same as the
-     corresponding method with the same name in
-     ``torch.DistributedDataParallel`` API. Refer to the `SageMaker
-     distributed model parallel API
-     documentation <https://sagemaker.readthedocs.io/en/stable/api/training/smd_model_parallel_pytorch.html#smp.DistributedModel>`__
-     for more information.
-
-**Improvements**
-
--  Adds a configuration ``active_microbatches`` to the SageMaker SDK API
-   for launching jobs, to control the number of active microbatches
-   during training. This helps limit memory usage in cases where the
-   number of microbatches is high. Refer to the `SageMaker Python SDK
-   parameters API
-   documentation <https://sagemaker.readthedocs.io/en/stable/api/training/smd_model_parallel_general.html>`__
-   for more information.
-
--  Adds a configuration ``deterministic_server`` to the SageMaker SDK
-   API for launching jobs, which ensures that the execution server for
-   pipeline parallelism processes requests in a deterministic order
-   across data parallel ranks. Refer to the `SageMaker Python SDK
-   parameters API
-   documentation <https://sagemaker.readthedocs.io/en/stable/api/training/smd_model_parallel_general.html>`__
-   for more information.
-
--  Parameter passing is now supported in ``module.forward`` methods for
-   DistributedModel and its submodules. This removes the restriction of
-   having to pass ``nn.Parameter`` to the ``__init__`` call and making
-   it a member of the module to use it. ## Bug Fixes
-
-.. _pytorch-3:
-
-- **PyTorch**
-
-  -  Fixed a case where training hangs due to a module having computation
-     which requires grads that is not used by the final output of the
-     module. Now such a situtation raises an error with suggestions on
-     making such computation compatible.
-
-  -  Fixed an issue with buffers which caused the buffers to not be on the
-     correct device after a model is partitioned, and not be synchronized
-     across steps (when ``broadcast_buffers`` is True). This could have
-     caused correctness issues in models with buffers.
-
-.. _known-issues-1:
-
-**Known Issues**
-
-.. _pytorch-4:
-
-- **PyTorch**
-
-  -  ``mp_barrier`` and ``get_mp_process_group`` are wrongly marked as
-     deprecated methods. Ignore the deprecation warning.
-
-  -  A crash was observed when ``optimizer.step()`` was called for certain
-     optimizers such as AdaDelta, when the partition on which this method
-     was called has no local parameters assigned to it after partitioning.
-     This is due to a bug in PyTorch which `has since been
-     fixed <https://github.com/pytorch/pytorch/pull/52944>`__. Till that
-     makes its way to the next release of PyTorch, only call
-     ``optimizer.step()`` on processes which have at least one local
-     parameter. This can be checked like this
-     ``len(list(model.local_parameters())) > 0``.
-
-  -  A performance regression still exists when training on SMP with
-     PyTorch 1.7.1 compared to 1.6. The rootcause was found to be the
-     slowdown in performance of ``.grad`` method calls in PyTorch 1.7.1
-     compared to 1.6. See the related discussion:
-     https://github.com/pytorch/pytorch/issues/50636. This issue does not
-     exist with PyTorch 1.8.
-
-----
-
-SageMaker Distributed Model Parallel 1.2.0 Release Notes
---------------------------------------------------------
-
--  New Features
--  Bug Fixes
--  Known Issues
-
-.. _new-features-2:
-
-**New Features**
-
-.. _pytorch-5:
-
-- **PyTorch**
-
-  Add support for PyTorch 1.7.1
-
-  -  Adds support for ``gradient_as_bucket_view`` (PyTorch 1.7.1 only),
-     ``find_unused_parameters`` (PyTorch 1.7.1 only) and
-     ``broadcast_buffers`` options to ``smp.DistributedModel``. These
-     options behave the same as the corresponding options (with the same
-     names) in ``torch.DistributedDataParallel`` API. Refer to the
-     `SageMaker distributed model parallel API
-     documentation <https://sagemaker.readthedocs.io/en/stable/api/training/smd_model_parallel_pytorch.html#smp.DistributedModel>`__
-     for more information.
-
-  -  Adds support for ``join`` (PyTorch 1.7.1 only) context manager, which
-     is to be used in conjunction with an instance of
-     ``smp.DistributedModel`` to be able to train with uneven inputs
-     across participating processes.
-
-  -  Adds support for ``_register_comm_hook`` (PyTorch 1.7.1 only) which
-     will register the callable as a communication hook for DDP. NOTE:
-     Like in DDP, this is an experimental API and subject to change.
-
-.. _tensorflow-2:
-
-- **Tensorflow**
-
-  -  Adds support for Tensorflow 2.4.1
-
-.. _bug-fixes-1:
-
-**Bug Fixes**
-
-.. _pytorch-6:
-
-- **PyTorch**
-
-  -  ``Serialization``: Fix a bug with serialization/flattening where
-     instances of subclasses of dict/OrderedDicts were
-     serialized/deserialized or internally flattened/unflattened as
-     regular dicts.
-
-.. _tensorflow-3:
-
-- **Tensorflow**
-
-  -  Fix a bug that may cause a hang during evaluation when there is no
-     model input for one partition.
-
-.. _known-issues-2:
-
-**Known Issues**
-
-.. _pytorch-7:
-
-- **PyTorch**
-
-  -  A performance regression was observed when training on SMP with
-     PyTorch 1.7.1 compared to 1.6.0. The rootcause was found to be the
-     slowdown in performance of ``.grad`` method calls in PyTorch 1.7.1
-     compared to 1.6.0. See the related discussion:
-     https://github.com/pytorch/pytorch/issues/50636.
-
-----
-
-SageMaker Distributed Model Parallel 1.1.0 Release Notes
---------------------------------------------------------
-
--  New Features
--  Bug Fixes
--  Improvements
--  Performance
--  Known Issues
-
-.. _new-features-3:
-
-**New Features**
-
-The following sections describe new feature releases that are common
-across frameworks and that are framework specific.
-
-**Common across frameworks***
-
-- Custom slicing support (``smp_slice`` method) for objects passed to ``smp.step`` decorated functions
-
-  To pass an object to ``smp.step`` that contains tensors that needs to be
-  split across microbatches and is not an instance of list, dict, tuple or
-  set, you should implement ``smp_slice`` method for the object.
-
-  Below is an example of how to use this with PyTorch:
-
-  .. code-block:: python
-
-    class CustomType:
-        def __init__(self, tensor):
-            self.data = tensor
-
-        # SMP will call this to invoke slicing on the object passing in total microbatches (num_mb)
-        # and the current microbatch index (mb).
-        def smp_slice(self, num_mb, mb, axis):
-            dim_size = list(self.data.size())[axis]
-
-            split_size = dim_size // num_mb
-            sliced_tensor = self.data.narrow(axis, mb * split_size, split_size)
-            return CustomType(sliced_tensor, self.other)
-
-    custom_obj = CustomType(torch.ones(4,))
-
-    @smp.step()
-    def step(custom_obj):
-        loss = model(custom_obj)
-        model.backward(loss)
-        return loss
-
-.. _pytorch-8:
-
-- **PyTorch**
-
-  - Add support for smp.DistributedModel.cpu()
-
-    ``smp.DistributedModel.cpu()``
-    `allgather <https://sagemaker.readthedocs.io/en/stable/api/training/smd_model_parallel_common_api.html#smp.allgather>`__\ s
-    parameters and buffers across all ``mp_ranks`` and moves them to the
-    CPU.
-
-  - Add ``trace_memory_usage`` option to ``smp.DistributedModel`` to measure memory usage per module
-
-    Adds ``trace_memory_usage`` option to ``smp.DistributedModel``. This
-    attempts to measure memory usage per module during tracing. If this is
-    disabled, memory usage is estimated through the sizes of tensors
-    returned from the module. This option is disabled by default.
-
-.. _bug-fixes-2:
-
-**Bug Fixes**
-
-.. _pytorch-9:
-
-- **PyTorch**
-
-  -  ``torch.nn.Sequential``: Fix a bug with ``torch.nn.Sequential`` which
-     causes a failure with the error message :
-     ``shouldnt go less than 0, there is a bug`` when the inputs to the
-     first module don’t require grads.
-
-  -  ``smp.DistributedModel``: Fix a bug with ``DistributedModel``
-     execution when a module has multiple parents. The bug surfaces with
-     the error message:
-     ``actual_parent should be different than module_execution_stack parent only for torch.nn.ModuleList``
-
-  -  ``apex.optimizers.FusedNovoGrad``: Fix a bug with
-     ``apex.optimizers.FusedNovoGrad`` which surfaces with the error
-     message: ``KeyError: 'exp_avg_sq'``
-
-**Improvements**
-
-*Usability*
-
-.. _pytorch-10:
-
-- **PyTorch**
-
-  -  ``smp.DistributedModel``: Improve the error message when the forward
-     pass on ``smp.DistributedModel`` is called outside the ``smp.step``
-     decorated function.
-
-  -  ``smp.load``: Add user friendly error messages when loading
-     checkpoints with ``smp.load``.
-
-*Partitioning Algorithm*
-
-.. _pytorch-11:
-
-- **PyTorch**
-
-  -  Better memory balancing by taking into account the existing modules
-     already assigned to the parent, while partitioning the children of a
-     given module.
-
-**Performance**
-
-.. _tensorflow-4:
-
-- **Tensorflow**
-
-  -  Addresses long pre-processing times introduced by SMP XLA optimizer
-     when dealing with large graphs and large number of microbatches. BERT
-     (large) preprocessing time goes down from 40 minutes to 6 minutes on
-     p3.16xlarge.
-
-.. _known-issues-3:
-
-**Known Issues**
-
-.. _pytorch-12:
-
-- **PyTorch**
-
-  -  Serialization for Torch in SMP overwrites instances of dict subclass
-     to be dict itself, instead of the instances of subclass. One of the
-     use cases which fails because of this issue is when a user implements
-     a subclass of OrderedDict with the ``__getitem__`` method. After
-     serialization/deserialization in SMP, indexing on the object will
-     lead to errors. A workaround is to use the dict keys to access the
-     underlying item.
diff --git a/doc/api/training/smp_versions/archives.rst b/doc/api/training/smp_versions/archives.rst
deleted file mode 100644
index a7426e8aec..0000000000
--- a/doc/api/training/smp_versions/archives.rst
+++ /dev/null
@@ -1,13 +0,0 @@
-.. _smdmp-pt-version-archive:
-
-.. toctree::
-    :maxdepth: 1
-
-    v1_10_0.rst
-    v1_9_0.rst
-    v1_6_0.rst
-    v1_5_0.rst
-    v1_4_0.rst
-    v1_3_0.rst
-    v1_2_0.rst
-    v1_1_0.rst
diff --git a/doc/api/training/smp_versions/latest.rst b/doc/api/training/smp_versions/latest.rst
deleted file mode 100644
index cec4468c54..0000000000
--- a/doc/api/training/smp_versions/latest.rst
+++ /dev/null
@@ -1,35 +0,0 @@
-###############################################
-Use the Library's API to Adapt Training Scripts
-###############################################
-
-The library provides Common APIs that you can use across frameworks,
-as well as framework-specific APIs for TensorFlow and PyTorch.
-
-Select the latest or one of the previous versions of the API documentation
-depending on which version of the library you need to use.
-To use the library, reference the
-**Common API** documentation alongside the framework specific API documentation.
-
-Version 1.11.0, 1.13.0, 1.14.0, 1.15.0 (Latest)
-===============================================
-
-To use the library, reference the Common API documentation alongside the framework specific API documentation.
-
-.. toctree::
-   :maxdepth: 1
-
-   latest/smd_model_parallel_common_api
-   latest/smd_model_parallel_pytorch
-   latest/smd_model_parallel_pytorch_tensor_parallel
-   latest/smd_model_parallel_tensorflow
-
-To find archived API documentation for the previous versions of the library,
-see the following link:
-
-Documentation Archive
-=====================
-
-.. toctree::
-   :maxdepth: 1
-
-   archives
diff --git a/doc/api/training/smp_versions/latest/smd_model_parallel_common_api.rst b/doc/api/training/smp_versions/latest/smd_model_parallel_common_api.rst
deleted file mode 100644
index d1f6b4d45b..0000000000
--- a/doc/api/training/smp_versions/latest/smd_model_parallel_common_api.rst
+++ /dev/null
@@ -1,517 +0,0 @@
-Common API
-==========
-
-The following SageMaker distribute model parallel APIs are common across all frameworks.
-
-.. contents:: Table of Contents
-  :depth: 3
-  :local:
-
-The Library's Core APIs
------------------------
-
-This API document assumes you use the following import statement in your training scripts.
-
-**TensorFlow**
-
-.. code:: python
-
-   import smdistributed.modelparallel.tensorflow as smp
-
-**PyTorch**
-
-.. code:: python
-
-   import smdistributed.modelparallel.torch as smp
-
-
-.. function:: smp.init( )
-
-   Initialize the library. Must be called at the beginning of training script.
-
-.. function:: @smp.step(non_split_inputs, input_split_axes, [*args, **kwargs])
-
-   A decorator that must be placed over a function that represents a single
-   forward and backward pass (for training use cases), or a single forward
-   pass (for evaluation use cases). Any computation that is defined inside
-   the ``smp.step``-decorated function is executed in a pipelined manner.
-
-   By default, every tensor input to the function is split across its batch
-   dimension into a number of microbatches specified while launching the
-   training job. This behavior can be customized through the arguments to
-   ``smp.step``, described below. The library then orchestrates the execution of
-   each microbatch across all partitions, based on the chosen pipeline
-   type.
-
-   In a typical use case, forward pass and back-propagation are executed
-   inside an \ ``smp.step``-decorated function and gradients, loss, and
-   other relevant metrics (such as accuracy, etc.) are returned from
-   ``smp.step``-decorated function.
-
-   Any gradient post-processing operation, such as gradient clipping and
-   allreduce, as well as ``optimizer.apply_gradients`` calls (for TF) or
-   ``optimizer.step`` (for PT) should be applied on the gradients returned
-   from the ``smp.step`` function, and not inside the ``smp.step``
-   function. This is because every operation inside ``smp.step`` is
-   executed once per microbatch, so having these operations inside
-   ``smp.step`` can either be inefficient (in the case of allreduce), or
-   lead to wrong results (in the case of ``apply_gradients`` /
-   ``optimizer.step``).
-
-   If the objects returned from the ``smp.step``-decorated function contain
-   ``tf.Tensor``\ s / ``torch.Tensor``\ s, they are converted to
-   ``StepOutput`` objects. A ``StepOutput`` object encapsulates all
-   versions of the tensor across different microbatches
-   (see ``StepOutput`` entry for more information).
-
-   The argument to ``smp.step`` decorated function should either be a tensor
-   or an instance of list, tuple, dict or set for it to be split across
-   microbatches. If your object doesn't fall into this category, you can make
-   the library split your object, by implementing ``smp_slice`` method.
-
-   Below is an example of how to use it with PyTorch.
-
-   .. code:: python
-
-      class CustomType:
-          def __init__(self, tensor):
-              self.data = tensor
-
-          # The library will call this to invoke slicing on the object passing in total microbatches (num_mb)
-          # and the current microbatch index (mb).
-          def smp_slice(self, num_mb, mb, axis):
-              dim_size = list(self.data.size())[axis]
-
-              split_size = dim_size // num_mb
-              sliced_tensor = self.data.narrow(axis, mb * split_size, split_size)
-              return CustomType(sliced_tensor, self.other)
-
-      custom_obj = CustomType(torch.ones(4,))
-
-      @smp.step()
-      def step(custom_obj):
-          loss = model(custom_obj)
-          model.backward(loss)
-          return loss
-
-
-   **Important:** ``smp.step`` splits the batch into microbatches, and
-   executes everything inside the decorated function once per microbatch.
-   This might affect the behavior of batch normalization, any operation
-   that explicitly uses the batch size information, or any other Python
-   code that is expected to run once.
-
-   **TensorFlow-specific behavior**
-
-   ``smp.step`` is a wrapper that
-   inherits from and extends the behavior of ``tf.function``, and as such,
-   all the caveats that apply to the use of ``tf.function``\ s also apply
-   to ``smp.step``. In particular, any operation that is inside
-   ``smp.step`` executes in graph mode, and not eager mode.
-
-   In the first call, ``smp.step`` performs tracing of the wrapped function every time
-   one of the tensor arguments changes their shape or dtype, or for every
-   new value of a Python argument, if there is one. Tracing is expensive,
-   so such scenarios should be avoided as much as possible or,
-   alternatively, an ``input_signature`` argument must be provided. For
-   more information on the usage of ``tf.function``, refer to the
-   TensorFlow documentation:
-
-   -  https://www.tensorflow.org/api_docs/python/tf/function\
-   -  https://www.tensorflow.org/guide/function\
-
-   Each ``smp.step`` decorated function must have a return value that depends on the
-   output of ``smp.DistributedModel``.
-
-   **Common parameters**
-
-   -  ``non_split_inputs`` (``list``): The list of arguments to the decorated function
-      that should not be split along the batch dimension. Should be used
-      for all input tensors that do not have a batch dimension. Should be a
-      list of argument names as ``str``, as they appear in the signature of
-      the ``smp.step``-decorated function. By default it is considered an
-      empty list.
-
-   -  ``input_split_axes`` (``dict``): A dict that maps the argument name to its batch
-      axis. The keys should be the argument names as ``str``, as they
-      appear in the signature of the ``smp.step``-decorated function.  By
-      default all batch axes are assumed to be the 0-axis.
-
-   **TensorFlow-only parameters**
-
-   -  All arguments of ``tf.function``. Note:
-      The \ ``experimental_compile`` argument of ``tf.function`` may not
-      work as expected with ``smp.step``, since it interferes with
-      pipelining and model partitioning. To enable XLA with the library, you can
-      instead use \ ``tf.config.optimizer.set_jit(True)``.
-
-   **PyTorch-only parameters**
-
-   -  ``detach_outputs`` (``bool``) : If ``True``, calls ``torch.Tensor.detach()`` on
-      all returned ``torch.Tensor`` outputs. Setting it to ``False``
-      increases memory consumption, unless ``detach()`` is manually called
-      on the returned tensors, because the model graph is not cleared from
-      memory after the training step. Set to \ ``True`` by default.
-
-   **Returns**
-
-   -  The same object(s) returned from the decorated function. All
-      returned \ ``tf.Tensor``, \ ``tf.Variable``  objects (for TF) or
-      ``torch.Tensor`` objects (for PT) are wrapped inside
-      a \ ``StepOutput`` object, even when they are inside a Python
-      ``list``, ``tuple``, or ``dict``.
-
-
-
-.. class:: StepOutput
-
-
-   A class that encapsulates all versions of a ``tf.Tensor``
-   or \ ``torch.Tensor`` across all microbatches.
-
-   When a particular ``tf.Tensor`` or ``torch.Tensor`` is computed inside
-   ``smp.step``, different versions of the tensor are computed for each
-   microbatch.
-
-   When this tensor is returned from ``smp.step`` and is accessed outside
-   of the decorated function, it appears as a ``StepOutput`` object, which
-   contains all such versions. For example,
-
-   -  In the case of Tensorflow, the gradient for a particular
-      ``tf.Variable`` is computed on each microbatch individually, and if
-      this gradient is returned from ``smp.step``, all gradients for this
-      ``tf.Variable`` become part of the same ``StepOutput`` object. The
-      ``StepOutput`` class offers the following API for commonly-used
-      post-processing operations on such tensors.
-   -  In the case of PyTorch, the loss for each microbatch is computed
-      individually and all the ``torch.Tensor``\ s that represent the loss
-      for different microbatches become part of same ``StepOutput`` object,
-      if loss is returned from the ``smp.step`` function.
-
-
-   The ``StepOutput`` class offers the following API for commonly-used
-   post-processing operations on tensors.
-
-   .. data:: StepOutput.outputs
-
-      Returns a list of the underlying tensors, indexed by microbatch.
-
-   .. function:: StepOutput.reduce_mean( )
-
-      Returns a ``tf.Tensor``, ``torch.Tensor`` that averages the constituent ``tf.Tensor`` s
-      ``torch.Tensor`` s. This is commonly used for averaging loss and gradients across microbatches.
-
-   .. function:: StepOutput.reduce_sum( )
-
-      Returns a ``tf.Tensor`` /
-      ``torch.Tensor`` that sums the constituent
-      ``tf.Tensor``\ s/\ ``torch.Tensor``\ s.
-
-   .. function:: StepOutput.concat( )
-
-      Returns a
-      ``tf.Tensor``/``torch.Tensor`` that concatenates tensors along the
-      batch dimension using ``tf.concat`` / ``torch.cat``.
-
-   .. function:: StepOutput.stack( )
-
-      Applies ``tf.stack`` / ``torch.stack``
-      operation to the list of constituent ``tf.Tensor``\ s /
-      ``torch.Tensor``\ s.
-
-   **TensorFlow-only methods**
-
-   .. function:: StepOutput.merge( )
-
-      Returns a ``tf.Tensor`` that
-      concatenates the constituent ``tf.Tensor``\ s along the batch
-      dimension. This is commonly used for merging the model predictions
-      across microbatches.
-
-   .. function:: StepOutput.accumulate(method="variable", var=None)
-
-      Functionally the same as ``StepOutput.reduce_mean()``. However, it is
-      more memory-efficient, especially for large numbers of microbatches,
-      since it does not wait for all constituent \ ``tf.Tensor``\ s to be
-      ready to start averaging them, thereby saving memory.
-
-      In some cases (XLA for example) ``StepOutput.reduce_mean()`` might end
-      up being more memory-efficient than ``StepOutput.accumulate()``.
-
-      **Parameters**
-
-      -  ``method`` (``"add_n"`` or ``"accumulate_n"`` or ``"variable"``):
-         If ``"add_n"`` or ``"accumulate_n"``, the library uses
-         ``tf.add_n`` and ``tf.accumulate_n``, respectively, to implement
-         accumulation. If ``"variable"``, the library uses an internal ``tf.Variable``
-         into which to accumulate the tensors. Default is \ ``"variable"``.
-         Note: Memory usage behavior of these choices can depend on the model
-         and implementation.
-
-      -  ``var``: A ``tf.Variable`` into which, if provided, the library uses to
-         accumulate the tensors. If \ ``None``, the library internally creates a
-         variable. If ``method`` is not ``"variable"``, this argument is
-         ignored.
-
-.. _mpi_basics:
-
-MPI Basics
-----------
-
-The library exposes the following basic MPI primitives to its Python API:
-
-**Global**
-
--  ``smp.rank()`` : The global rank of the current process.
--  ``smp.size()`` : The total number of processes.
--  ``smp.get_world_process_group()`` :
-   ``torch.distributed.ProcessGroup`` that contains all processes.
--  ``smp.CommGroup.WORLD``: The communication group corresponding to all processes.
--  ``smp.local_rank()``: The rank among the processes on the current instance.
--  ``smp.local_size()``: The total number of processes on the current instance.
--  ``smp.get_mp_group()``: The list of ranks over which the current model replica is partitioned.
--  ``smp.get_dp_group()``: The list of ranks that hold different replicas of the same model partition.
-
-**Tensor Parallelism**
-
--  ``smp.tp_rank()`` : The rank of the process within its
-   tensor-parallelism group.
--  ``smp.tp_size()`` : The size of the tensor-parallelism group.
--  ``smp.get_tp_process_group()`` : Equivalent to
-   ``torch.distributed.ProcessGroup`` that contains the processes in the
-   current tensor-parallelism group.
--  ``smp.CommGroup.TP_GROUP`` : The communication group corresponding to
-   the current tensor parallelism group.
-
-**Pipeline Parallelism**
-
--  ``smp.pp_rank()`` : The rank of the process within its
-   pipeline-parallelism group.
--  ``smp.pp_size()`` : The size of the pipeline-parallelism group.
--  ``smp.get_pp_process_group()`` : ``torch.distributed.ProcessGroup``
-   that contains the processes in the current pipeline-parallelism group.
--  ``smp.CommGroup.PP_GROUP`` : The communication group corresponding to
-   the current pipeline parallelism group.
-
-**Reduced-Data Parallelism**
-
--  ``smp.rdp_rank()`` : The rank of the process within its
-   reduced-data-parallelism group.
--  ``smp.rdp_size()`` : The size of the reduced-data-parallelism group.
--  ``smp.get_rdp_process_group()`` : ``torch.distributed.ProcessGroup``
-   that contains the processes in the current reduced data parallelism
-   group.
--  ``smp.CommGroup.RDP_GROUP`` : The communication group corresponding
-   to the current reduced data parallelism group.
-
-**Model Parallelism**
-
--  ``smp.mp_rank()`` : The rank of the process within its model-parallelism
-   group.
--  ``smp.mp_size()`` : The size of the model-parallelism group.
--  ``smp.get_mp_process_group()`` : ``torch.distributed.ProcessGroup``
-   that contains the processes in the current model-parallelism group.
--  ``smp.CommGroup.MP_GROUP`` : The communication group corresponding to
-   the current model parallelism group.
-
-**Data Parallelism**
-
--  ``smp.dp_rank()`` : The rank of the process within its data-parallelism
-   group.
--  ``smp.dp_size()`` : The size of the data-parallelism group.
--  ``smp.get_dp_process_group()`` : ``torch.distributed.ProcessGroup``
-   that contains the processes in the current data-parallelism group.
--  ``smp.CommGroup.DP_GROUP`` : The communication group corresponding to
-   the current data-parallelism group.
-
-.. _communication_api:
-
-Communication API
------------------
-
-The library provides a few communication primitives which can be helpful while
-developing the training script. These primitives use the following
-``enum`` s as arguments to specify which processes the communication
-should involve.
-​
-
-**Helper structures**
-
-.. data:: smp.CommGroup
-
-   An ``enum`` that takes the values
-   ``CommGroup.WORLD``, ``CommGroup.MP_GROUP``, and ``CommGroup.DP_GROUP``.
-   These values can also be accessed as ``smp.WORLD``, ``smp.MP_GROUP``,
-   and ``smp.DP_GROUP`` respectively.
-
-   -  ``CommGroup.WORLD``: Represents the entire group of processes used in
-      training
-   -  ``CommGroup.MP_GROUP``: Represents the group of processes that hold
-      the same model replica as the current process. The processes in a
-      single ``MP_GROUP`` collectively store an entire replica of the
-      model.
-   -  ``CommGroup.DP_GROUP``: Represents the group of processes that hold
-      the same model partition as the current process. The processes in a
-      single ``DP_GROUP`` perform data parallelism/allreduce among
-      themselves.
-
-.. data:: smp.RankType
-
-   An ``enum`` that takes the values
-   ``RankType.WORLD_RANK``, ``RankType.MP_RANK``, and ``RankType.DP_RANK``.
-
-   -  ``RankType.WORLD_RANK``: The associated rank is to be interpreted as
-      the rank of the process across all processes used in training.
-   -  ``RankType.MP_RANK``: The associated rank is to be interpreted as the
-      rank of the process within the ``MP_GROUP``.
-   -  ``RankType.DP_RANK``: The associated rank is to be interpreted as the
-      rank of the process within the ``DP_GROUP``.
-
-
-**Communication primitives:**
-
-.. function:: smp.broadcast(obj, group)
-
-   Sends the object to all processes in the
-   group. The receiving process must call ``smp.recv_from`` to receive the
-   sent object.
-
-   **Inputs**
-
-   -  ``obj``: An arbitrary picklable Python object that will be broadcast.
-
-   -  ``group``: A ``CommGroup`` argument that represents to which group of
-      processes the object will be sent.
-
-   **Notes**
-
-   -  When you use ``broadcast`` on the sender process, there needs
-      to be an accompanying ``smp.recv_from()`` call on the receiver
-      processes.
-
-   -  This is a synchronous call; the ``broadcast`` statement
-      returns only after all ranks participating in the call have made a
-      matching ``recv_from`` call.
-
-   **Example**
-
-   .. code:: python
-
-      if smp.rank() == 0:
-          smp.broadcast(something, group=smp.CommGroup.WORLD)
-      else:
-          smp.recv_from(0, rank_type=smp.RankType.WORLD_RANK)
-
-.. function:: smp.send(obj, dest_rank, rank_type)
-
-   Sends the object ``obj`` to
-   ``dest_rank``, which is of a type specified by ``rank_type``.
-
-   **Inputs**
-
-   -  ``obj``: An arbitrary picklable Python object that will be sent.
-
-   -  ``dest_rank`` (``int``): An integer denoting the rank of the receiving process.
-
-   -  ``rank_type`` (``enum``): A ``smp.RankType`` ``enum`` that determines how
-      ``dest_rank`` is to be interpreted. For example if ``dest_rank`` is 1
-      and ``rank_type`` is ``MP_RANK``, then ``obj`` is sent to process
-      with ``mp_rank`` 1 in the ``MP_GROUP`` which contains the current
-      process.
-
-   **Notes**
-
-   -  Note: \ This is a synchronous call; the ``send`` statement returns
-      only after the destination rank has made a matching
-      ``recv_from`` call.
-
-.. function:: smp.recv_from(src_rank, rank_type)
-
-   Receive an object from a peer process. Can be used with a matching
-   ``smp.send`` or a ``smp.broadcast`` call.
-
-   **Inputs**
-
-   -  ``src_rank`` (``int``): An integer denoting rank of the sending process.
-
-   -  ``rank_type`` (``enum``): A ``smp.RankType`` ``enum`` that determines how
-      ``dest_rank`` is to be interpreted. For example if ``src_rank`` is 1
-      and ``rank_type`` is ``MP_RANK``, then the object is received from
-      the process with ``mp_rank`` 1 in the ``MP_GROUP`` which contains the
-      current process.
-
-   **Returns**
-
-   Returns the python object that is sent by the peer process.
-
-   **Notes**
-
-   -  Note: This is a synchronous call; the ``recv_from`` statement returns
-      only after the source rank has made a matching ``send`` or
-      ``broadcast`` call, and the object is received.
-
-.. function:: smp.allgather(obj, group)
-
-   A collective call that gathers all the
-   submitted objects across all ranks in the specified ``group``. Returns a
-   list whose ``i``\ th index contains the object submitted by the
-   ``i``\ th rank in ``group``.
-
-   **Inputs**
-
-   -  ``obj``: An arbitrary picklable Python object that will be
-      allgathered.
-
-   -  ``group`` : A ``CommGroup`` argument that represents which group of
-      processes participate in ``allgather``.
-
-   **Notes**
-
-   -  Note: This is a synchronous call; the ``allgather`` statement returns
-      only after all ranks participating in the call have made a matching
-      ``allgather`` call, and all the objects are received at the current
-      rank.
-
-   **Examples**
-
-   .. code:: python
-
-      # assuming mp_size() == 2
-
-      if smp.mp_rank() == 0:
-          out = smp.allgather(obj1, smp.CommGroup.MP_GROUP)  # returns [obj1, obj2]
-      else:
-          out = smp.allgather(obj2, smp.CommGroup.MP_GROUP)  # returns [obj1, obj2]
-
-.. function:: smp.barrier(group=smp.WORLD)
-
-   A statement that hangs until all
-   processes in the specified group reach the barrier statement, similar to
-   ``MPI_Barrier()``.
-
-   **Inputs**
-
-   -  ``group``: An ``smp.CommGroup`` ``enum`` that specifies the group of
-      processes participating in the barrier call. Defaults to
-      ``smp.WORLD``.
-
-   **Examples**
-
-   -  Assume there are 8 processes and 2 model partitions, and
-      therefore 4 \ ``mp_group``\ s, and 2 ``dp_group``\ s. If
-      the \ ``barrier`` call is passed the value ``smp.MP_GROUP`` for its
-      group argument, then each process only waits until the other process
-      of its own ``mp_group`` reaches that point. It does not wait for
-      processes outside that ``mp_group``.
-
-.. function:: smp.dp_barrier()
-
-   Same as passing ``smp.DP_GROUP``\ to ``smp.barrier()``.
-   Waits for the processes in the same \ ``dp_group`` as
-   the current process to reach the same point in execution.
-
-.. function:: smp.mp_barrier()
-
-   Same as passing ``smp.MP_GROUP`` to
-   ``smp.barrier()``. Waits for the processes in the same ``mp_group`` as
-   the current process to reach the same point in execution.
diff --git a/doc/api/training/smp_versions/latest/smd_model_parallel_pytorch.rst b/doc/api/training/smp_versions/latest/smd_model_parallel_pytorch.rst
deleted file mode 100644
index 05357e673b..0000000000
--- a/doc/api/training/smp_versions/latest/smd_model_parallel_pytorch.rst
+++ /dev/null
@@ -1,944 +0,0 @@
-PyTorch API
-===========
-
-To use the PyTorch-specific APIs for SageMaker distributed model parallism,
-import the ``smdistributed.modelparallel.torch`` package at the top of your training script.
-
-.. code:: python
-
-   import smdistributed.modelparallel.torch as smp
-
-
-.. tip::
-
-   Refer to
-   `Modify a PyTorch Training Script
-   <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-customize-training-script-pt.html>`_
-   to learn how to use the following API in your PyTorch training script.
-
-.. contents:: Topics
-  :depth: 1
-  :local:
-
-smdistributed.modelparallel.torch.DistributedModel
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-.. class:: smdistributed.modelparallel.torch.DistributedModel
-
-   A sub-class of ``torch.nn.Module`` which specifies the model to be
-   partitioned. Accepts a ``torch.nn.Module`` object ``module`` which is
-   the model to be partitioned. The returned ``DistributedModel`` object
-   internally manages model parallelism and data parallelism. Only one
-   model in the training script can be wrapped with
-   ``smdistributed.modelparallel.torch.DistributedModel``.
-
-   **Example:**
-
-   .. code:: python
-
-      import smdistributed.modelparallel.torch as smp
-
-      model = smp.DistributedModel(model)
-
-   **Important**: The ``__call__`` and  ``backward`` method calls on the
-   ``smdistributed.modelparallel.torch.DistributedModel`` object (in the following example, the object
-   is \ ``model``) can only be made inside a ``smdistributed.modelparallel.torch.step``-decorated
-   function.
-
-   Since ``DistributedModel``  is a ``torch.nn.Module``, a forward pass can
-   be performed by calling the \ ``DistributedModel`` object on the input
-   tensors.
-
-   .. code:: python
-
-      predictions = model(inputs)   # model is a smp.DistributedModel object
-
-   For a backward pass, one needs to call the backward function on
-   the \ ``DistributedModel`` object, with tensors and gradients as
-   arguments, replacing the PyTorch operations \ ``torch.Tensor.backward``
-   or ``torch.autograd.backward``.
-
-   The API for ``model.backward`` is very similar to
-   ``torch.autograd.backward``. For example, the following
-   ``backward`` calls:
-
-   .. code:: python
-
-      torch.autograd.backward(loss) or loss.backward()
-
-   should be replaced with:
-
-   .. code:: python
-
-      model.backward(loss) # loss is a tensor with only one element as its data
-
-   Similarly, for non-scalar tensors, replace the following
-   ``backward`` call containing incoming gradient arguments:
-
-   .. code:: python
-
-      torch.autograd.backward(outputs, out_grads)
-
-   with the following line:
-
-   .. code:: python
-
-      model.backward(outputs, out_grads)
-
-   In these examples, all ``__call__``  and ``backward`` method calls on
-   the model objects (``model(inputs)`` and ``model.backward(loss)``) must be made inside
-   a ``smdistributed.modelparallel.torch.step``-decorated function.
-
-   **Using DDP**
-
-   If DDP is enabled with the SageMaker model parallel library, do not not place a PyTorch
-   ``DistributedDataParallel`` wrapper around the ``DistributedModel`` because
-   the ``DistributedModel`` wrapper will also handle data parallelism.
-
-   Unlike the original DDP wrapper, when you use ``DistributedModel``,
-   model parameters and buffers are not immediately broadcast across
-   processes when the wrapper is called. Instead, the broadcast is deferred to the first call of the
-   ``smdistributed.modelparallel.torch.step``-decorated function when the partition is done.
-
-   **Parameters**
-
-   -  ``module`` (``torch.nn.Module``): Module to be distributed (data parallelism and model parallelism).
-
-   -  ``trace_device`` (``"cpu"`` or ``"gpu"``) (default: ``"gpu"``)
-      Whether to perform the tracing step on the GPU or CPU. The tracing step gathers
-      information on the order of execution of modules, the shapes of
-      intermediate outputs, and execution times, to be used by the
-      partitioning algorithm. If ``trace_device`` is set to GPU, accurate
-      module execution times can be gathered during tracing for potentially
-      improved partitioning decision. However, if the model is too large to
-      fit in a single GPU, then ``trace_device`` should be set to CPU.
-
-   -  ``trace_execution_times`` (``bool``) (default: ``False``): If ``True``,
-      the library profiles the execution time of each module during tracing, and uses
-      it in the partitioning decision. This improves the partitioning
-      decision, but it might make the tracing slower. It may also introduce
-      some degree of non-determinism in partitioning results, because of the
-      inherent randomness in module execution times. Must be ``False`` if
-      ``trace_device`` is ``"cpu"``.
-
-   -  ``overlapping_allreduce`` (``bool``) (default: ``True``): This is only
-      applicable for hybrid data parallelism/model parallelism use cases (when
-      ``ddp`` is set to ``True`` while launching training). The library uses this flag
-      to decide whether to do overlapping allreduce whenever a parameter
-      gradients are ready. This leads to overlapping of communication and
-      computation and can improve performance. If this is set to ``False`` ,
-      allreduce is performed at the end of the step.
-
-   -  ``backward_passes_per_step`` (``int``) (default: 1): This is only
-      applicable for hybrid data parallelism/model parallelism use cases (when
-      ``ddp`` is set to ``True`` in config). This parameter indicates the
-      number of backward passes to perform before calling allreduce on DDP.
-      This allows accumulating updates over multiple mini-batches before
-      reducing and applying them.
-
-   -  ``average_grads_across_microbatches`` (``bool``) (default: ``True``):
-      Whether or not the computed gradients should be averaged across
-      microbatches. If ``False``, the computed gradients will be summed across
-      microbatches, but not divided by the number of microbatches. In typical
-      use case where the computed loss is averaged over the mini-batch, this
-      should be left as ``True``. If you use a loss function that only sums
-      the per-sample loss across the batch (and not divide by the batch size),
-      then this must be set to ``False`` for correctness.
-
-   -  ``bucket_cap_mb`` (default: 25): \ ``DistributedDataParallel`` buckets
-      parameters into multiple buckets so that gradient reduction of each
-      bucket can potentially overlap with backward
-      computation. \ ``bucket_cap_mb``\ controls the bucket size in MegaBytes
-      (MB).
-
-   -  ``trace_memory_usage`` (default: False): When set to True, the library attempts
-      to measure memory usage per module during tracing. If this is disabled,
-      memory usage will be estimated through the sizes of tensors returned from
-      the module.
-
-   -  ``broadcast_buffers`` (default: True): Flag to be used with ``ddp=True``.
-      This parameter is forwarded to the underlying ``DistributedDataParallel`` wrapper.
-      Please see: `broadcast_buffer <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel>`__.
-
-   -  ``gradient_as_bucket_view`` (default: False): To be
-      used with ``ddp=True``. This parameter is forwarded to the underlying
-      ``DistributedDataParallel`` wrapper. Please see `gradient_as_bucket_view <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel>`__.
-
-   **Properties**
-
-   -  ``partitioned``: Is ``True`` if the model is partitioned, ``False``
-      otherwise. Initialized to ``False`` when ``DistributedModel`` is first
-      created. It becomes be ``True`` during the first call
-      to ``smdistributed.modelparallel.torch.step``-decorated function. Once the model is partitioned, the
-      local parameters or local ``state_dict`` can be fetched using the
-      following methods.
-
-   **Methods**
-
-   .. function:: backward(tensors, grad_tensors)
-
-      Triggers a distributed backward
-      pass across model partitions. Example usage provided in the previous
-      section. The API is very similar
-      to https://pytorch.org/docs/stable/autograd.html#torch.autograd.backward.
-      ``retain_grad`` and ``create_graph``  flags are not supported.
-
-   .. function:: local_buffers( )
-
-      Returns an iterator over buffers for the modules in
-      the partitioned model that have been assigned to the current process.
-
-   .. function:: local_named_buffers( )
-
-      Returns an iterator over buffers for the
-      modules in the partitioned model that have been assigned to the current
-      process. This yields both the name of the buffer as well as the buffer
-      itself.
-
-   .. function:: local_parameters( )
-
-      Returns an iterator over parameters for the
-      modules in the partitioned model that have been assigned to the current
-      process.
-
-   .. function:: local_named_parameters( )
-
-      Returns an iterator over parameters for
-      the modules in the partitioned model that have been assigned to the
-      current process. This yields both the name of the parameter as well as
-      the parameter itself.
-
-   .. function:: local_modules( )
-
-      Returns an iterator over the modules in the
-      partitioned model that have been assigned to the current process.
-
-   .. function:: local_named_modules( )
-
-      Returns an iterator over the modules in the
-      partitioned model that have been assigned to the current process. This
-      yields both the name of the module as well as the module itself.
-
-   .. function:: local_state_dict( )
-
-      Returns the ``state_dict`` that contains local
-      parameters that belong to the current \ ``mp_rank``. This ``state_dict``
-      contains a key \ ``_smp_is_partial`` to indicate this is a
-      partial \ ``state_dict``, which indicates whether the
-      ``state_dict`` contains elements corresponding to only the current
-      partition, or to the entire model.
-
-   .. function:: state_dict( )
-
-      Returns the ``state_dict`` that contains parameters
-      for the entire model. It first collects the \ ``local_state_dict``  and
-      gathers and merges the \ ``local_state_dict`` from all ``mp_rank``\ s to
-      create a full ``state_dict``. Please note that this needs to be called on all ranks with
-      ``dp_rank()==0`` to ensure the gather happens properly.
-      If it is only called on all such ranks, it can hang.
-
-   .. function:: load_state_dict( )
-
-      Same as the ``torch.module.load_state_dict()`` ,
-      except: It first gathers and merges the ``state_dict``\ s across
-      ``mp_rank``\ s, if they are partial. The actual loading happens after the
-      model partition so that each rank knows its local parameters.
-
-   .. function:: register_post_partition_hook(hook)
-
-      Registers a callable ``hook`` to
-      be executed after the model is partitioned. This is useful in situations
-      where an operation needs to be executed after the model partition during
-      the first call to ``smdistributed.modelparallel.torch.step``, but before the actual execution of the
-      first forward pass. Returns a ``RemovableHandle`` object ``handle``,
-      which can be used to remove the hook by calling ``handle.remove()``.
-
-   .. function:: cpu( )
-
-      Allgathers parameters and buffers across all ``mp_rank``\ s and moves them
-      to the CPU.
-
-   .. function:: join( )
-
-      A context manager to be used in conjunction with an instance of
-      ``smdistributed.modelparallel.torch.DistributedModel`` to be able to train with uneven inputs across
-      participating processes. This is only supported when ``ddp=True``. This will use the join with the wrapped
-      ``DistributedDataParallel`` instance. For more information, see:
-      `join <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel.join>`__
-      in the PyTorch documentation.
-
-   .. function:: register_comm_hook( state, callable )
-
-      **Available for PyTorch 1.8.1 only**
-      Registers a communication hook which is an enhancement that provides
-      a flexible hook ``callable`` to users where they can specify how
-      gradients are aggregated across multiple workers. This method will be called on the wrapped ``DistributedDataParallel`` instance.
-
-      Please note that when you register a comm hook you have full control of how the gradients are processed.
-      When using only data parallelism with Torch DDP you are expected to average grads across data parallel replicas within the hook.
-      Similarly, when using DistributedModel you have to averaging grads across data parallel replicas within the hook.
-      In addition to that, you also have to average grads across microbatches within the hook unless you explicitly desire to not average based on your loss function.
-      See ``average_grads_across_microbatches`` for more information about averaging grads across microbatches.
-
-      This is only supported when ``ddp=True`` and ``overlapping_allreduce=True`` (default).
-      For more information, see:
-      `register_comm_hook <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel.register_comm_hook>`__
-      in the PyTorch documentation.
-
-  **Behavior of** ``smdistributed.modelparallel.torch.DistributedModel`` **with Tensor Parallelism**
-
-  When a model is wrapped by ``smdistributed.modelparallel.torch.DistributedModel``, the library
-  immediately traverses the modules of the model object, and replaces the
-  modules that are supported for tensor parallelism with their distributed
-  counterparts. This replacement happens in place. If there are no other
-  references to the original modules in the script, they are
-  garbage-collected. The module attributes that previously referred to the
-  original submodules now refer to the distributed versions of those
-  submodules.
-
-  **Example:**
-
-  .. code:: python
-
-     # register DistributedSubmodule as the distributed version of Submodule
-     # (note this is a hypothetical example, smp.nn.DistributedSubmodule does not exist)
-     import smdistributed.modelparallel.torch as smp
-
-     smp.tp_register_with_module(Submodule, smp.nn.DistributedSubmodule)
-
-     class MyModule(nn.Module):
-         def __init__(self):
-             ...
-
-             self.submodule = Submodule()
-         ...
-
-     # enabling tensor parallelism for the entire model
-     with smp.tensor_parallelism():
-         model = MyModule()
-
-     # here model.submodule is still a Submodule object
-     assert isinstance(model.submodule, Submodule)
-
-     model = smp.DistributedModel(model)
-
-     # now model.submodule is replaced with an equivalent instance
-     # of smp.nn.DistributedSubmodule
-     assert isinstance(model.module.submodule, smp.nn.DistributedSubmodule)
-
-  If ``pipeline_parallel_degree`` (equivalently, ``partitions``) is 1, the
-  placement of model partitions into GPUs and the initial broadcast of
-  model parameters and buffers across data-parallel ranks take place
-  immediately. This is because it does not need to wait for the model
-  partition when ``smdistributed.modelparallel.torch.DistributedModel`` wrapper is called. For other
-  cases with ``pipeline_parallel_degree`` greater than 1, the broadcast
-  and device placement will be deferred until the first call of an
-  ``smdistributed.modelparallel.torch.step``-decorated function happens. This is because the first
-  ``smdistributed.modelparallel.torch.step``-decorated function call is when the model partitioning
-  happens if pipeline parallelism is enabled.
-
-  Because of the module replacement during the ``smdistributed.modelparallel.torch.DistributedModel``
-  call, any ``load_state_dict`` calls on the model, as well as any direct
-  access to model parameters, such as during the optimizer creation,
-  should be done **after** the ``smdistributed.modelparallel.torch.DistributedModel`` call.
-
-  Since the broadcast of the model parameters and buffers happens
-  immediately during ``smdistributed.modelparallel.torch.DistributedModel`` call when the degree of
-  pipeline parallelism is 1, using ``@smp.step`` decorators is not
-  required when tensor parallelism is used by itself (without pipeline
-  parallelism).
-
-  For more information about the library's tensor parallelism APIs for PyTorch,
-  see :ref:`smdmp-pytorch-tensor-parallel`.
-
-  **Additional Methods of** ``smdistributed.modelparallel.torch.DistributedModel`` **for Tensor Parallelism**
-
-  The following are the new methods of ``smdistributed.modelparallel.torch.DistributedModel``, in
-  addition to the ones listed in the
-  `documentation <https://sagemaker.readthedocs.io/en/stable/api/training/smp_versions/v1.2.0/smd_model_parallel_pytorch.html#smp.DistributedModel>`__.
-
-  .. function:: distributed_modules()
-
-     -  An iterator that runs over the set of distributed
-        (tensor-parallelized) modules in the model
-
-  .. function:: is_distributed_parameter(param)
-
-     -  Returns ``True`` if the given ``nn.Parameter`` is distributed over
-        tensor-parallel ranks.
-
-  .. function::  is_distributed_buffer(buf)
-
-     -  Returns ``True`` if the given buffer is distributed over
-        tensor-parallel ranks.
-
-  .. function::  is_scaled_batch_parameter(param)
-
-     -  Returns ``True`` if the given ``nn.Parameter`` is operates on the
-        scaled batch (batch over the entire ``TP_GROUP``, and not only the
-        local batch).
-
-  .. function::  is_scaled_batch_buffer(buf)
-
-     -  Returns ``True`` if the parameter corresponding to the given
-        buffer operates on the scaled batch (batch over the entire
-        ``TP_GROUP``, and not only the local batch).
-
-  .. function::  default_reducer_named_parameters()
-
-     -  Returns an iterator that runs over ``(name, param)`` tuples, for
-        ``param`` that is allreduced over the ``DP_GROUP``.
-
-  .. function::  scaled_batch_reducer_named_parameters()
-
-     -  Returns an iterator that runs over ``(name, param)`` tuples, for
-        ``param`` that is allreduced over the ``RDP_GROUP``.
-
-smdistributed.modelparallel.torch.DistributedOptimizer
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-.. class:: smdistributed.modelparallel.torch.DistributedOptimizer(optimizer, static_loss_scale=1.0, dynamic_loss_scale=False, **dynamic_loss_args)
-
-   An optimizer wrapper for saving and loading optimizer states.
-
-   :param optimizer: An optimizer object.
-   :type optimizer: object
-   :param static_loss_scale: Effective only for FP16 training. The default value is ``1.0``.
-   :type static_loss_scale: float
-   :param dynamic_loss_scale: Effective only for FP16 training. Set to ``True`` to
-      use dynamic loss scale. The default value is ``False``.
-   :type dynamic_loss_scale: boolean
-   :param dynamic_loss_args: Effective only for FP16 training.
-      If ``dynamic_loss_scale=True``, you can configure additional scale
-      parameters for dynamic loss scale.
-      The following list shows available parameters.
-
-      * ``"init_scale"``: Default is ``2**32``
-      * ``"scale_factor"``: Default is ``2.``
-      * ``"scale_window"``: Default is ``1000``
-      * ``"min_scale"``: Default is ``1``
-      * ``"delayed_shift"``: Default is ``1``
-      * ``"consecutive_hysteresis"``: Default is ``False``
-   :type dynamic_loss_args: dict
-
-   **Example usage of an FP32 Optimizer:**
-
-   .. code:: python
-
-      optimizer = torch.optim.AdaDelta(...)
-      optimizer = smdistributed.modelparallel.torch.DistributedOptimizer(optimizer)
-
-   **Example usage of an FP16 Optimizer with static loss scale:**
-
-   .. code:: python
-
-      optimizer = torch.optim.AdaDelta(...)
-      optimizer = smdistributed.modelparallel.torch.DistributedOptimizer(
-          optimizer,
-          static_loss_scale=1.0
-      )
-
-   **Example usage of an FP16 Optimizer with dynamic loss scale:**
-
-   .. code:: python
-
-      optimizer = torch.optim.AdaDelta(...)
-      optimizer = smdistributed.modelparallel.torch.DistributedOptimizer(
-          optimizer,
-          static_loss_scale=None,
-          dynamic_loss_scale=True,
-          dynamic_loss_args={
-              "scale_window": 1000,
-              "min_scale": 1,
-              "delayed_shift": 2
-          }
-      )
-
-   .. tip::
-
-      After you modify training scripts with
-      :class:`smdistributed.modelparallel.torch.DistributedModel` and
-      :class:`smdistributed.modelparallel.torch.DistributedOptimizer`,
-      use the SageMaker PyTorch estimator's distribution configuration to enable FP16 training.
-      You simply need to add ``"fp16": True`` to the ``smp_options`` config dictionary's
-      ``"parameters"`` key as shown in
-      `Using the SageMaker TensorFlow and PyTorch Estimators
-      <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-sm-sdk.html>`_.
-      For more information about available parameters for the ``smp_options`` config,
-      see :ref:`sm-sdk-modelparallel-general`.
-
-   This wrapper returns an ``optimizer`` object with the following methods overridden:
-
-   .. method:: state_dict( )
-
-      Returns the ``state_dict`` that contains optimizer state for the entire model.
-      It first collects the ``local_state_dict`` and gathers and merges
-      the ``local_state_dict`` from all ``mp_rank``\ s to create a full
-      ``state_dict``.
-
-   .. method::  load_state_dict( )
-
-      Same as the ``torch.optimizer.load_state_dict()`` , except:
-
-         -  It first gathers and merges the local ``state_dict``\ s if they are
-            partial.
-         -  The actual loading happens after the model partition so that each
-            rank knows its local parameters.
-
-   .. method::  local_state_dict( )
-
-      Returns the ``state_dict`` that contains the
-      local optimizer state that belongs to the current \ ``mp_rank``. This
-      ``state_dict`` contains a key \ ``_smp_is_partial`` to indicate this is
-      a partial \ ``state_dict``, which indicates whether the
-      ``state_dict`` contains elements corresponding to only the current
-      partition, or to the entire model.
-
-smdistributed.modelparallel.torch.nn.FlashAttentionLayer
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-.. function:: smdistributed.modelparallel.torch.nn.FlashAttentionLayer(attention_dropout_prob=0.1, attention_head_size=None, scale_attention_scores=True, scale_attn_by_layer_idx=False, layer_idx=None, scale=None, triton_flash_attention=False, use_alibi=False)
-
-   This class supports
-   `FlashAttention <https://github.com/HazyResearch/flash-attention>`_
-   for PyTorch 2.0.
-   It takes the ``qkv`` matrix as an argument through its ``forward`` class method,
-   computes attention scores and probabilities,
-   and then operates the matrix multiplication with value layers.
-
-   Through this class, the smp library supports
-   custom attention masks such as Attention with
-   Linear Biases (ALiBi), and you can activate them by setting
-   ``triton_flash_attention`` and ``use_alibi`` to ``True``.
-
-   Note that the Triton flash attention does not support dropout
-   on the attention probabilities. It uses standard lower triangular
-   causal mask when causal mode is enabled. It also runs only
-   on P4d and P4de instances, with fp16 or bf16.
-
-   This class computes the scale factor to apply when computing attention.
-   By default, ``scale`` is set to ``None``, and it's automatically calculated.
-   When ``scale_attention_scores`` is ``True`` (which is default), you must pass a value
-   to ``attention_head_size``. When ``scale_attn_by_layer_idx`` is ``True``,
-   you must pass a value to ``layer_idx``. If both factors are used, they are
-   multiplied as follows: ``(1/(sqrt(attention_head_size) * (layer_idx+1)))``.
-   This scale calculation can be bypassed if you specify a custom scaling
-   factor to ``scale``. In other words, if you specify a value to ``scale``, the set of parameters
-   (``scale_attention_scores``, ``attention_head_size``, ``scale_attn_by_layer_idx``, ``layer_idx``)
-   is overridden and ignored.
-
-   **Parameters**
-
-   * ``attention_dropout_prob`` (float): (default: 0.1) specifies dropout probability
-     to apply to attention.
-   * ``attention_head_size`` (int): Required when ``scale_attention_scores`` is True.
-     When ``scale_attention_scores`` is passed, this contributes
-     ``1/sqrt(attention_head_size)`` to the scale factor.
-   * ``scale_attention_scores`` (boolean): (default: True) determines whether
-     to multiply 1/sqrt(attention_head_size) to the scale factor.
-   * ``layer_idx`` (int): Required when ``scale_attn_by_layer_idx`` is ``True``.
-     The layer id to use for scaling attention by layer id.
-     It contributes 1/(layer_idx + 1) to the scaling factor.
-   * ``scale_attn_by_layer_idx`` (boolean): (default: False) determines whether
-     to multiply 1/(layer_idx + 1) to the scale factor.
-   * ``scale`` (float) (default: None): If passed, this scale factor will be
-     applied bypassing the all of the previous arguments.
-   * ``triton_flash_attention`` (bool): (default: False) If passed, Triton
-     implementation of flash attention will be used. This is necessary to supports
-     Attention with Linear Biases (ALiBi) (see next arg). Note that this version
-     of the kernel doesn’t support dropout.
-   * ``use_alibi`` (bool): (default: False) If passed, it enables Attention with
-     Linear Biases (ALiBi) using the mask provided.
-
-   .. method:: forward(self, qkv, attn_mask=None, causal=False)
-
-      Returns a single ``torch.Tensor`` ``(batch_size x num_heads x seq_len x head_size)``,
-      which represents the output of attention computation.
-
-      **Parameters**
-
-      * ``qkv``: ``torch.Tensor`` in the form of ``(batch_size x seqlen x 3 x num_heads x head_size)``.
-      * ``attn_mask``: ``torch.Tensor`` in the form of ``(batch_size x 1 x 1 x seqlen)``.
-        By default it is ``None``, and usage of this mask needs ``triton_flash_attention``
-        and ``use_alibi`` to be set. See how to generate the mask in the following code snippet.
-      * ``causal``: When passed, it uses the standard lower triangular mask. The default is ``False``.
-
-   When using ALiBi, it needs an attention mask prepared like the following.
-
-   .. code:: python
-
-      def generate_alibi_attn_mask(attention_mask, batch_size, seq_length,
-         num_attention_heads, alibi_bias_max=8):
-
-         device, dtype = attention_mask.device, attention_mask.dtype
-         alibi_attention_mask = torch.zeros(
-            1, num_attention_heads, 1, seq_length, dtype=dtype, device=device
-         )
-
-         alibi_bias = torch.arange(1 - seq_length, 1, dtype=dtype, device=device).view(
-            1, 1, 1, seq_length
-         )
-         m = torch.arange(1, num_attention_heads + 1, dtype=dtype, device=device)
-         m.mul_(alibi_bias_max / num_attention_heads)
-         alibi_bias = alibi_bias * (1.0 / (2 ** m.view(1, num_attention_heads, 1, 1)))
-
-         alibi_attention_mask.add_(alibi_bias)
-         alibi_attention_mask = alibi_attention_mask[..., :seq_length, :seq_length]
-         if attention_mask is not None and attention_mask.bool().any():
-            alibi_attention_mask.masked_fill(
-                  attention_mask.bool().view(batch_size, 1, 1, seq_length), float("-inf")
-            )
-
-         return alibi_attention_mask
-
-smdistributed.modelparallel.torch Context Managers and Util Functions
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-.. function:: smdistributed.modelparallel.torch.model_creation(tensor_parallelism=False, dtype=None, **tensor_parallel_config)
-
-   Context manager to create a ``torch`` model. This API combines both the
-   :class:`smdistributed.modelparallel.torch.tensor_parallelism` and
-   :class:`smdistributed.modelparallel.torch.delay_param_initialization` decorators,
-   so you can simply use this single context when creating the torch model.
-
-   :param tensor_parallelism: Whether to enable tensor parallelism during model creation.
-   :type tensor_parallelism: boolean
-   :param dtype: The dtype to use when creating the model. It has the following rules.
-
-      * If dtype is specified, it will be used during model creation.
-      * If dtype is not specified, the default dtype will be used during model creation,
-        which is usually FP32. This is for the best performance on CPU.
-      * Any model that causes out-of-memory problems with FP32 initialization
-        is recommended to be created with
-        :class:`smdistributed.modelparallel.torch.delayed_parameter_initialization`.
-      * ``FP16_Module`` casts the model back to FP16 if FP16 training is enabled
-        with the ``smp`` config. For more inforamtion about FP16 training
-        in SageMaker with the model parallel library, see `FP16 Training
-        <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-fp16.html>`_
-        in the *Amazon SageMaker Developer Guide*.
-
-   :type dtype: ``torch.dtype``
-   :param tensor_parallel_config: kwargs to specifiy other tensor parallel configs.
-      This is not used if ``tensor_parallelism`` is ``False``.
-   :type tensor_parallel_config: dict
-
-   **Example Usage:**
-
-   .. code:: python
-
-      import smdistributed.modelparallel.torch as smp
-
-      with smp.model_creation(
-          tensor_parallelism=smp.tp_size() > 1,
-          dtype=torch.float16 if args.fp16 else torch.get_default_dtype()
-      ):
-          model = MyModel(...)
-
-.. function:: smdistributed.modelparallel.torch.partition(index)
-
-   :param index: The index of the partition.
-   :type index: int
-
-   A context manager which places all modules defined inside into the
-   partition with ID ``index``.  The ``index`` argument must be less than
-   the number of partitions.
-
-   Use ``smdistributed.modelparallel.torch.partition`` to implement manual partitioning.
-   If ``"auto_partition"`` is ``True``, then the
-   ``smdistributed.modelparallel.torch.partition`` contexts are ignored. Any module that is not placed in
-   any ``smdistributed.modelparallel.torch.partition`` context is placed in the
-   ``default_partition`` defined through the SageMaker Python SDK.
-
-   When ``smdistributed.modelparallel.torch.partition`` contexts are nested, the innermost context
-   overrides the rest (see the following example). In PyTorch, manual
-   partitioning should be done inside the module \ ``__init__``, and the
-   partition assignment applies to the modules that are *created* inside
-   the ``smdistributed.modelparallel.torch.partition`` context.
-
-   Example:
-
-   .. code:: python
-
-      import smdistributed.modelparallel.torch as smp
-
-      class Model(torch.nn.Module):
-          def __init__(self):
-              with smp.partition(1):
-                  self.child0 = Child0()            # child0 on partition 1
-                  with smp.partition(2):
-                      self.child1 = Child1()        # child1 on partition 2
-                  self.child2 = Child2()            # child2 on partition 1
-              self.child3 = Child3()                # child3 on default_partition
-
-.. data:: smdistributed.modelparallel.torch.amp.GradScaler
-
-   `Torch AMP Gradscaler <https://pytorch.org/docs/stable/amp.html#torch.cuda.amp.GradScaler>`__
-   currently doesn’t work with the library. ``smdistributed.modelparallel.torch.amp.GradScaler`` replaces
-   ``torch.amp.GradScaler`` and provides the same functionality.
-
-.. function:: smdistributed.modelparallel.torch.delay_param_initialization(enabled=True)
-
-   If enabled, it delays the initialization of parameters
-   to save CPU memory. That is, parameter initialization takes place
-   after the model is partitioned on GPUs.
-
-.. function:: smdistributed.modelparallel.torch.get_world_process_group( )
-
-   Returns a ``torch.distributed`` ``ProcessGroup`` that consists of all
-   processes, which can be used with the ``torch.distributed`` API.
-   Requires ``"ddp": True`` in SageMaker Python SDK parameters.
-
-.. function:: smdistributed.modelparallel.torch.get_mp_process_group( )
-
-   Returns a ``torch.distributed`` ``ProcessGroup`` that consists of the
-   processes in the ``MP_GROUP`` which contains the current process, which
-   can be used with the \ ``torch.distributed`` API. Requires
-   ``"ddp": True`` in SageMaker Python SDK parameters.
-
-.. function:: smdistributed.modelparallel.torch.get_dp_process_group( )
-
-   Returns a ``torch.distributed`` ``ProcessGroup`` that consists of the
-   processes in the ``DP_GROUP`` which contains the current process, which
-   can be used with the \ ``torch.distributed`` API. Requires
-   ``"ddp": True`` in SageMaker Python SDK parameters.
-
-.. function:: smdistributed.modelparallel.torch.is_initialized( )
-
-   Returns ``True`` if ``smdistributed.modelparallel.torch.init`` has already been called for the
-   process, and ``False`` otherwise.
-
-.. function::smp.is_tracing( )
-
-   Returns ``True`` if the current process is running the tracing step, and
-   ``False`` otherwise.
-
-.. data:: smdistributed.modelparallel.torch.nn.FusedLayerNorm
-
-   `Apex Fused Layer Norm <https://nvidia.github.io/apex/layernorm.html>`__ is currently not
-   supported by the library. ``smdistributed.modelparallel.torch.nn.FusedLayerNorm`` replaces ``apex``
-   ``FusedLayerNorm`` and provides the same functionality. This requires
-   ``apex`` to be installed on the system.
-
-.. data:: smdistributed.modelparallel.torch.optimizers.FusedNovoGrad
-
-
-   `Fused Novo Grad optimizer <https://nvidia.github.io/apex/optimizers.html#apex.optimizers.FusedNovoGrad>`__ is
-   currently not supported by the library. ``smdistributed.modelparallel.torch.optimizers.FusedNovoGrad`` replaces ``apex`` ``FusedNovoGrad``
-   optimizer and provides the same functionality. This requires ``apex`` to
-   be installed on the system.
-
-.. data:: smdistributed.modelparallel.torch.optimizers.FusedLamb
-
-
-   `FusedLamb optimizer <https://nvidia.github.io/apex/optimizers.html#apex.optimizers.FusedLAMB>`__
-   currently doesn’t work with the library. ``smdistributed.modelparallel.torch.optimizers.FusedLamb`` replaces
-   ``apex`` ``FusedLamb`` optimizer and provides the same functionality.
-   This requires ``apex`` to be installed on the system.
-
-.. _pytorch_saving_loading:
-
-smdistributed.modelparallel.torch APIs for Saving and Loading
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-.. function:: smdistributed.modelparallel.torch.save(obj, f, partial=True, pickel_module=picklemodule, pickle_protocol=2, )
-
-   Saves an object. This operation is similar to `torch.save()
-   <https://pytorch.org/docs/stable/generated/torch.save.html>`_, except that
-   it has an additional keyword argument, ``partial``, and accepts only
-   string type for the argument ``f`` (file). If ``partial=True``, each
-   ``mp_rank`` saves a separate checkpoint file and the library adds an ``mp_rank``
-   index to your saved file.
-
-   **Parameters**
-
-   -  ``obj`` (dict): A saved object.
-   -  ``f`` (str): A string containing a file name.
-   -  ``partial`` (bool, default= ``True``):  When set to ``True``, each
-      ``mp_rank`` saves a separate checkpoint file and the library adds an
-      ``mp_rank`` index to the saved file. If you want to be able to load
-      and further train a model that you save with ``smdistributed.modelparallel.torch.save()``, you must
-      set ``partial=True``.
-   -  ``pickle_module`` (picklemodule, default = module ``"pickle"`` from ``"/opt/conda/lib/python3.6/pickle.py"``):
-      A module used for pickling metadata and objects.
-   -  ``pickle_protocol``  (int, default=2): Can be specified to
-      override the defaultprotocol.
-
-.. function:: smdistributed.modelparallel.torch.load(f, map_location, pickle_module, pickle_load_args, partial=True)
-
-   Loads an object saved with ``smdistributed.modelparallel.torch.save()`` from a file.
-
-   Similar to, `torch.load() <https://pytorch.org/docs/stable/generated/torch.load.html>`__,
-   except it has an additional keyword argument, ``partial``, and accepts
-   only string type for the argument ``f`` (file). If \ ``partial=True``,
-   then each ``mp_rank`` loads a separate checkpoint file.
-
-   **Parameters**
-
-   -  ``f`` (string): A string containing a file name.
-   -  ``map_location`` (function): A function
-      `torch.device <https://pytorch.org/docs/stable/tensor_attributes.html#torch.torch.device>`__,
-      a string, or a dict specifying how to remap storage locations.
-   -  ``pickle_module`` (pickle module): A module used for unpickling
-      metadata and objects (has to match the \ ``pickle_module``\ used to
-      serialize file).
-   -  ``pickle_load_args`` (Python 3 only): Optional keyword arguments
-      passed to ``pickle_module.load()`` and ``pickle_module.Unpickler()``.
-   -  ``partial`` (bool, default= ``True``): When set to ``True``, each
-      ``mp_rank`` loads the checkpoint corresponding to the ``mp_rank``.
-      Should be used when loading a model trained with the library.
-
-.. function:: smdistributed.modelparallel.torch.save_checkpoint(path, tag, partial=True, model=None, optimizer=None, user_content=None, translate_if_full=True, num_kept_partial_checkpoints=None)
-
-   Saves a checkpoint. While :class:`smdistributed.modelparallel.torch.save` saves
-   model and optimizer objects,
-   this function checkpoints model and optimizer and saves the checkpoints as separate files.
-   It creates checkpoint folders in the following structure.
-
-   .. code:: text
-
-      - path
-      - ${tag}_partial        (folder for partial checkpoint)
-        - model_rankinfo.pt
-        - optimizer_rankinfo.pt
-        - fp16_states_rankinfo.pt
-        - user_content.pt
-      - $tag                  (checkpoint file for full checkpoint)
-      - user_content_$tag     (user_content file for full checkpoint)
-      - newest                (a file that indicates the newest checkpoint)
-
-   **Parameters**
-
-   * ``path`` (str) (required): Path to save the checkpoint. The library creates
-     the directory if it does not already exist.
-     For example, ``/opt/ml/checkpoint/model_parallel``.
-   * ``tag`` (str) (required): A tag for the current checkpoint, usually the train
-     steps. Note: tag needs to be the same across all ranks (GPU workers).
-     When ``partial=False`` this will be the checkpoint file name.
-   * ``partial`` (boolean) (default: True): Whether to save the partial checkpoint.
-   * ``model`` (:class:`smdistributed.modelparallel.torch.DistributedModel`)
-     (default: None): The model to save. It needs to an ``smp.DistributedModel`` object.
-   * ``optimizer`` (:class:`smdistributed.modelparallel.torch.DistributedOptimizer`)
-     (default: None): The optimizer to save. It needs to be an ``smp.DistributedOptimizer`` object.
-   * ``user_content`` (any) (default: None): User-defined content to save.
-   * ``translate_if_full`` (boolean) (default: True): Whether to translate the
-     full ``state_dict`` to HF ``state_dict`` if possible.
-   * ``num_kept_partial_checkpoints`` (int) (default: None): The maximum number
-     of partial checkpoints to keep on disk.
-
-.. function:: smdistributed.modelparallel.torch.resume_from_checkpoint(path, tag=None, partial=True, strict=True, load_optimizer=True, load_sharded_optimizer_state=True, translate_function=None)
-
-   While :class:`smdistributed.modelparallel.torch.load` loads saved
-   model and optimizer objects, this function resumes from a saved checkpoint file.
-
-   **Parameters**
-
-   * ``path`` (str) (required): Path to load the checkpoint.
-   * ``tag`` (str) (default: None): Tag of the checkpoint to resume. If not provided,
-     the library tries to locate the newest checkpoint from the saved newest file.
-   * ``partial`` (boolean) (default: True): Whether to load the partial checkpoint.
-   * ``strict`` (boolean) (default: True): Load with strict load, no extra key or
-     missing key is allowed.
-   * ``load_optimizer`` (boolean) (default: True): Whether to load ``optimizer``.
-   * ``load_sharded_optimizer_state`` (boolean) (default: True): Whether to load
-     the sharded optimizer state of a model.
-     It can be used only when you activate
-     the `sharded data parallelism
-     <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-sharded-data-parallelism.html>`_
-     feature of the SageMaker model parallel library.
-     When this is ``False``, the library only loads the FP16
-     states, such as FP32 master parameters and the loss scaling factor,
-     not the sharded optimizer states.
-   * ``translate_function`` (function) (default: None): function to translate the full
-     checkpoint into smdistributed.modelparallel format.
-     For supported models, this is not required.
-
-   **Example usage**
-
-   .. code:: python
-
-     # Save
-     smp.save_checkpoint(
-         checkpoint_dir,
-         tag=f"total_steps{total_steps}",
-         partial=True,
-         model=model,
-         optimizer=optimizer,
-         user_content=user_content
-         num_kept_partial_checkpoints=args.num_kept_checkpoints)
-
-     # Load: this will automatically load the newest checkpoint
-     user_content = smp.resume_from_checkpoint(path, partial=partial)
-
-.. _pytorch_saving_loading_instructions:
-
-General instruction on saving and loading
------------------------------------------
-
-The library can save partial or full checkpoints.
-
--  For partial checkpoints, each ``mp_rank`` saves its own checkpoint
-   file with only the parameters that belong to that rank.
--  For full checkpoints, the library saves a single checkpoint that contains
-   entire model parameters.
-
-When **saving** using ``smdistributed.modelparallel.torch.save()``, each rank only holds its own
-parameters. If you want to save the full model, there will be some
-communication between the ranks to create the full model. If you save
-checkpoints often, you should save partial checkpoints for best
-performance.
-
-When **loading** using ``smdistributed.modelparallel.torch.load()``, the library can load either partial or |
-full checkpoints or full checkpoints saved by a non-model-parallel model. If you
-want to resume training with a non-model-parallel model or do inference, you need
-a full checkpoint.
-
-The following is an example of how you can save and load a checkpoint:
-
-.. code:: python
-
-   import smdistributed.modelparallel.torch as smp
-   # Original model and optimizer
-   model = MyModel(...)
-   optimizer = MyOpt(...)
-
-   # model parallel wrapper
-   model = smp.DistributedModel(model)
-   optimizer = smp.DistributedOptimizer(optimizer)
-
-   # To save, always save on dp_rank 0 to avoid data racing
-   if partial:
-       # To save the partial model on each mp rank
-       # the library will create `checkpoint.pt_{mprank}` for each mp rank
-       if save_partial_model:
-           if smp.dp_rank() == 0:
-               model_dict = model.local_state_dict() # save the partial model
-               opt_dict = optimizer.local_state_dict() # save the partial optimizer state
-               smp.save(
-                   {"model_state_dict": model_dict, "optimizer_state_dict": opt_dict},
-                   f"/checkpoint.pt",
-                   partial=True,
-               )
-
-       # To save the full model
-       if save_full_model:
-           if smp.dp_rank() == 0:
-               model_dict = model.state_dict() # save the full model
-               opt_dict = optimizer.state_dict() # save the full optimizer state
-               smp.save(
-                   {"model_state_dict": model_dict, "optimizer_state_dict": opt_dict},
-                   "/checkpoint.pt",
-                   partial=False,
-               )
-
-   # To load, load on all ranks.
-   # The only difference for partial/full loading is the partial flag in smp.load
-   # Load partial checkpoint
-   if partial_checkpoint:
-       checkpoint = smp.load("/checkpoint.pt", partial=True)
-       model.load_state_dict(checkpoint["model_state_dict"])
-       optimizer.load_state_dict(checkpoint["optimizer_state_dict"])
-   # Load full checkpoint
-   if full_checkpoint:
-       checkpoint = smp.load("/checkpoint.pt", partial=False)
-       model.load_state_dict(checkpoint["model_state_dict"])
-       optimizer.load_state_dict(checkpoint["optimizer_state_dict"])
diff --git a/doc/api/training/smp_versions/latest/smd_model_parallel_pytorch_tensor_parallel.rst b/doc/api/training/smp_versions/latest/smd_model_parallel_pytorch_tensor_parallel.rst
deleted file mode 100644
index 2c2a7b1f2f..0000000000
--- a/doc/api/training/smp_versions/latest/smd_model_parallel_pytorch_tensor_parallel.rst
+++ /dev/null
@@ -1,896 +0,0 @@
-.. _smdmp-pytorch-tensor-parallel:
-
-PyTorch API for Tensor Parallelism
-==================================
-
-SageMaker distributed tensor parallelism works by replacing specific submodules
-in the model with their distributed implementations. The distributed modules
-have their parameters and optimizer states partitioned across tensor-parallel
-ranks. This is to compute the same output as it would have been computed by
-the original modules. Since tensor parallelism occurs across data-parallel
-ranks, a rank might collect slices of the activations corresponding to the
-data shards on other devices that are part of the same tensor parallelism group.
-
-You can enable or disable tensor parallelism for specific parts of the model.
-Within the enabled parts, the replacements with distributed modules will take
-place on a best-effort basis for those module supported for tensor parallelism.
-Alternatively, you can directly import and use the library’s distributed
-modules in the model definition.
-
-Some of the supported modules (such as ``smdistributed.modelparallel.torch.nn.Transformer``) are high-level
-blocks that contain many operations. Because custom implementations
-(as opposed to the built-in PyTorch modules) are typically used for these
-high-level blocks, the library offers an API that you can use to register
-specific distributed versions with such custom modules (provided that they
-are functionally equivalent). This allows the library to automatically replace
-the occurrences of such PyTorch modules with their distributed counterparts
-provided by the library.
-For more information, see the following topics.
-
-.. contents:: Topics
-  :depth: 3
-  :local:
-
-.. _registering-tp-modules:
-
-Registering Tensor Parallelism Distributed Modules
---------------------------------------------------
-
-Although PyTorch natively provides some of the commonly used (and
-tensor-parallelizable) building blocks such as Transformer, users often
-use custom implementations for such higher-level modules. To distribute
-such modules with tensor parallelism, you need to register the
-distributed modules to the custom module implementation in your class,
-so that the library knows how to distribute the custom module. When you
-register the distributed modules, make sure the custom module that you
-use is functionally equivalent to the distributed module. You can verify
-this by taking a look at the equivalent reference implementations in the
-:ref:`smdmp-tp-appendix`.
-These implementations are functionally equivalent to their distributed
-versions in ``smdistributed.modelparallel.torch.nn`` module.
-
-.. class:: smdistributed.modelparallel.torch.tp_register(dist_module, init_hook=None, forward_hook=None, return_hook=None)
-
-   -  A decorator class that registers the ``dist_module`` class with
-      the module class that it is attached to. The hooks can be used to
-      adapt to different interfaces used with ``__init__`` and
-      ``forward`` methods.
-   -  **Arguments:**
-
-      -  ``dist_module``: A subclass of ``smdistributed.modelparallel.torch.nn.DistributedModule``
-         that implements the distributed version of the module class the
-         decorator is attached to. Any distributed module class defined
-         in ``smdistributed.modelparallel.torch.nn`` module can be used.
-      -  ``init_hook``: A callable that translates the arguments of the
-         original module ``__init__`` method to an ``(args, kwargs)``
-         tuple compatible with the arguments of the corresponding
-         distributed module ``__init__`` method. Must return a tuple,
-         whose first element is an iterable representing the positional
-         arguments, and second element is a ``dict`` representing the
-         keyword arguments. The input signature of the ``init_hook``
-         must **exactly** match the signature of the original
-         ``__init__`` method (including argument order and default
-         values), except it must exclude ``self``.
-      -  ``forward_hook``: A callable that translates the arguments of
-         the original module ``forward`` method to an ``(args, kwargs)``
-         tuple compatible with the arguments of the corresponding
-         distributed module ``forward`` method. Must return a tuple,
-         whose first element is an iterable representing the positional
-         arguments, and second element is a ``dict`` representing the
-         keyword arguments. The input signature of the ``init_hook``
-         must **exactly** match the signature of the original
-         ``forward`` method (including argument order and default
-         values), except it must exclude ``self``.
-      -  ``return_hook``: A callable that translates the object returned
-         from the distributed module to the return object expected of
-         the original module.
-
-   -  **Example:**
-
-      .. code:: python
-
-         import smdistributed.modelparallel.torch as smp
-
-         init_hook = lambda config: ((), config.to_dict())
-
-         # register smp.nn.DistributedTransformer
-         # as the distributed version of MyTransformer
-         @smp.tp_register(smp.nn.DistributedTransformer, init_hook=init_hook)
-         class MyTransformer(nn.Module):
-             def __init__(self, config):
-                 ...
-
-             def forward(self, hidden_states, attention_mask):
-                 ...
-
-.. function:: smdistributed.modelparallel.torch.tp_register_with_module(module_cls, dist_module, init_hook=None, forward_hook=None, return_hook=None)
-
-   -  When you do not have direct access to model definition code, you
-      can use this API to similarly register a distributed module with
-      an existing module class.
-
-   -  **Arguments:**
-
-      -  ``module_cls``: The existing module class that will be
-         distributed.
-      -  ``dist_module``: A subclass of ``smdistributed.modelparallel.torch.nn.DistributedModule``
-         that implements the distributed version of the module class the
-         decorator is attached to. Any distributed module class defined
-         in ``smdistributed.modelparallel.torch.nn`` module can be used.
-      -  ``init_hook``: A callable that translates the arguments of the
-         original module ``__init__`` method to an ``(args, kwargs)``
-         tuple compatible with the arguments of the corresponding
-         distributed module ``__init__`` method. Must return a tuple,
-         whose first element is an iterable representing the positional
-         arguments, and second element is a ``dict`` representing the
-         keyword arguments. The input signature of the ``init_hook``
-         must **exactly** match the signature of the original
-         ``__init__`` method (including argument order and default
-         values), except it must exclude ``self``.
-      -  ``forward_hook``: A callable that translates the arguments of
-         the original module ``forward`` method to an ``(args, kwargs)``
-         tuple compatible with the arguments of the corresponding
-         distributed module ``forward`` method. Must return a tuple,
-         whose first element is an iterable representing the positional
-         arguments, and second element is a ``dict`` representing the
-         keyword arguments. The input signature of the ``init_hook``
-         must **exactly** match the signature of the original
-         ``forward`` method (including argument order and default
-         values), except it must exclude ``self``.
-      -  ``return_hook``: A callable that translates the object returned
-         from the distributed module to the return object expected of
-         the original module.
-
-   -  **Example:**
-
-      .. code:: python
-
-         import smdistributed.modelparallel.torch as smp
-
-         from somelibrary import MyTransformer
-
-         init_hook = lambda config: ((), config.to_dict())
-
-         # register smp.nn.DistributedTransformer as the distributed version of MyTransformer
-         smp.tp_register_with_module(MyTransformer,
-                                     smp.nn.DistributedTransformer,
-                                     init_hook=init_hook)
-
-.. _smdmp-supported-modules-for-tp:
-
-Supported Modules for Tensor Parallelism
-----------------------------------------
-
-The following modules are supported for tensor parallelism.
-
-.. contents:: Topics
-  :depth: 3
-  :local:
-
-.. _tp-module-api:
-
-Tensor Parallelism Module APIs
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
--  :class:`smdistributed.modelparallel.torch.nn.DistributedLinear` (implements ``nn.Linear``)
--  :class:`smdistributed.modelparallel.torch.nn.DistributedTransformerLMHead`
--  :class:`smdistributed.modelparallel.torch.nn.DistributedTransformer`
--  :class:`smdistributed.modelparallel.torch.nn.DistributedTransformerLayer`
--  :class:`smdistributed.modelparallel.torch.nn.DistributedAttentionLayer`
--  :class:`smdistributed.modelparallel.torch.nn.DistributedTransformerOutputLayer`
--  :class:`smdistributed.modelparallel.torch.nn.DistributedEmbedding`
-
-.. class:: smdistributed.modelparallel.torch.nn.DistributedLinear(in_features, out_features)
-
-    Tensor-parallel implementation of the ``nn.Linear`` class.
-    Functionally equivalent to an ``nn.Linear`` module with the same
-    ``in_features`` and ``out_features``. In other words,
-    ``in_features`` and ``out_features`` are the number of *global*
-    channels across tensor-parallel ranks.
-
-    For more information about what's the reference implementation of this module,
-    see :ref:`smdmp-tp-appendix`.
-
-
-    -  **Arguments:**
-
-      -  ``in_features``: The total number of input channels for the
-         linear layer across all tensor-parallel ranks.
-      -  ``out_features``: The total number of output channels for the
-         linear layer across all tensor-parallel ranks.
-
-.. class:: smdistributed.modelparallel.torch.nn.DistributedTransformerLMHead(num_layers=12, num_attention_heads=32, attention_head_size=32, hidden_size=1024, intermediate_size=4096, vocab_size=30522, num_positions=1024, attention_dropout_prob=0.1, hidden_dropout_prob=0.1, activation="gelu", layernorm_epsilon=1e-5, num_token_types=0, causal_mask_size=None, add_cross_attention=False, add_lm_head=True,  initializer_range=0.02, use_normal_initialization=False, pre_layernorm=False, post_layernorm=True)
-
-    Constructs a distributed transformer model, including embeddings
-    and a single LM head. A word embedding of size
-    ``(vocab_size, hidden_size)`` is created, as well as a positional
-    embedding of size ``(num_positions, hidden_size)``, and the
-    embeddings are added together. If ``num_token_types`` is larger
-    than 0, a separate embedding of size
-    ``(num_token_types, hidden_size)`` is created, and further added
-    on top.
-
-    -  The embeddings are fed through a ``DistributedTransformer``, and
-       if ``add_lm_head`` is ``True``, the output passes through a single
-       LM head, which is a linear module without bias whose weight is
-       tied to the word embeddings.
-    -  See :class:`smdistributed.modelparallel.torch.nn.DistributedTransformerLayer` for descriptions of the rest
-       of the arguments.
-    -  **Methods:**
-
-      -  ``forward(self, inputs)``
-
-         -  If ``add_cross_attention`` is ``True``, ``inputs`` must be a
-            tuple
-            ``(input_ids, attention_mask, token_type_ids, position_ids, cross_states, cross_states, cross_mask, labels)``.
-         -  Otherwise, ``inputs`` must be a tuple
-            ``(input_ids, attention_mask, token_type_ids, position_ids, labels)``.
-         -  If ``token_type_ids`` is ``None``, token type embedding will
-            not be used.
-         -  ``input_ids`` is assumed to be of shape ``[N, S]``, where
-            ``N`` is the batch size and ``S`` is sequence length.
-         -  ``attention_mask`` is assumed to be a 0-1 tensor of shape
-            ``[N, S]``, where 1 represents a masked position.
-
-.. class:: smdistributed.modelparallel.torch.nn.DistributedTransformer(num_layers=12, num_attention_heads=32, attention_head_size=32, hidden_size=1024, intermediate_size=4096, attention_dropout_prob=0.1, hidden_dropout_prob=0.1, activation="gelu", layernorm_epsilon=1e-5, initializer_range=0.02, use_normal_initialization=False, causal_mask_size=None, add_cross_attention=False, pre_layernorm=False, post_layernorm=True)
-
-   A sequence of :class:`smdistributed.modelparallel.torch.nn.DistributedTransformerLayer`\ s, whose
-   number is given by ``num_layers`` argument. For the other
-   arguments and methods, refer to
-   :class:`smdistributed.modelparallel.torch.nn.DistributedTransformerLayer`.
-
-   If both ``pre_layernorm`` and ``post_layernorm`` are ``True``,
-   layer normalization is applied to both the input and the output of
-   the ``DistributedTransformer``, in addition to the intermediate
-   attention and transformer-output layers.
-
-.. class:: smdistributed.modelparallel.torch.nn.DistributedTransformerLayer(num_attention_heads=32, attention_head_size=32, hidden_size=1024, intermediate_size=4096, attention_dropout_prob=0.1, hidden_dropout_prob=0.1, activation="gelu", layernorm_epsilon=1e-5, initializer_range=0.02, use_normal_initialization=False, causal_mask_size=None, add_cross_attention=False, pre_layernorm=False, post_layernorm=True)
-
-   Tensor-parallel implementation of a single transformer layer.
-   Number of attention heads, hidden size, and intermediate size
-   refer to the global quantities across all tensor-parallel ranks.
-
-   For more information about what's the reference implementation of this module,
-   see :ref:`smdmp-tp-appendix`.
-
-   -  **Arguments:**
-
-      -  ``num_attention_heads``: The total number of attention heads
-         across tensor-parallel ranks
-      -  ``attention_head_size``: The number of channels of a single
-         attention head.
-      -  ``hidden_size``: The hidden dimension of the transformer. The
-         input tensor ``hidden_states`` is assumed to have its last
-         dimension size equal to ``hidden_size``.
-      -  ``intermediate_size``: The number of output channels in the
-         first linear transformation of the transformer output layer.
-         ``DistributedTransformerOutputLayer`` first maps
-         ``hidden_size`` dimensions of its input tensor into
-         ``intermediate_size`` dimensions, and then maps it back into
-         ``hidden_size`` dimensions.
-      -  ``attention_dropout_prob``: The dropout probability applied to
-         the attention probabilities.
-      -  ``hidden_dropout_prob``: The dropout probability used in
-         dropout layers other than the one applied to the attention
-         probabilities.
-      -  ``activation``: Choice of activation function to use at the
-         output layer. Must be ``"gelu"`` or ``"relu"``.
-      -  ``layernorm_epsilon``: The epsilon added to the denominator of
-         layer normalization for numerical stability.
-      -  ``initializer_range``: If ``use_normal_initialization`` is
-         ``True``, the standard deviation of the normal random variable
-         to initialize the weights with.
-      -  ``use_normal_initialization``: If ``True``, the weights are
-         initialized with normal distribution with standard deviation
-         given by ``initializer_range``. Otherwise, default PyTorch
-         initialization is used.
-      -  ``causal_mask_size``: If ``None``, no causal mask is used on
-         attentions. Otherwise, should be set to maximum sequence length
-         to apply a causal mask to the attention scores. This is used,
-         for instance, in GPT-2.
-      -  ``add_cross_attention``: If ``True``, a cross-attention layer
-         will be added after the self-attention block. The
-         cross-attention layer computes the attention keys and values
-         based on the ``cross_states`` input (instead of
-         ``hidden_states`` input, as in self-attention. This is used in
-         the decoder block of encoder-decoder architectures. For
-         encoder-only architectures that only use self-attention, this
-         should be kept ``False``.
-      -  ``pre_layernorm``: If ``True``, inserts layer normalization at
-         the input. At least one of ``pre_layernorm`` and
-         ``post_layernorm`` must be ``True``.
-      -  ``post_layernorm``: If ``True``, inserts layer normalization at
-         the output. At least one of ``pre_layernorm`` and
-         ``post_layernorm`` must be ``True``.
-      -  ``use_alibi`` (bool, default False): Activates Attention with
-         Linear Biases (ALiBi) for attention computation.
-         ALiBi facilitates efficient extrapolation on input sequences
-         and thus improves training efficiency.
-         The library enables ALiBi by using the `Triton
-         flash attention kernel
-         <https://github.com/HazyResearch/flash-attention>`_.
-         Refer to https://arxiv.org/abs/2108.12409 for more
-         details on the technique.
-         (Available from
-         the SageMaker model parallelism library v1.15.0.)
-      -  ``alibi_bias_max`` (int, default 8): Defines the ALiBi base
-         value for mask generation. (Available from
-         the SageMaker model parallelism library v1.15.0.)
-
-   -  **Methods:**
-
-      -  ``forward(self, inputs)``: Forward pass for the transformer
-         layer.
-
-         -  **Arguments:**
-
-            -  If ``add_cross_attention=False``, ``inputs`` must be a
-               tuple ``(hidden_states, attention_mask)``, where
-               ``hidden_states`` is assumed to be a tensor of dimensions
-               ``[N, S, H]``, where ``N`` is batch size, ``S`` is
-               sequence length, and ``H`` is ``hidden_size``.
-               ``attention_mask`` is assumed to be a tensor of
-               dimensions ``[N, 1, 1, S]``, where ``N`` is the batch
-               size, and ``S`` is the sequence length.
-            -  If ``add_cross_attention=True``, ``inputs`` must be a
-               tuple
-               ``(hidden_states, cross_states, attention_mask, cross_mask)``,
-               where ``hidden_states`` is assumed to be a tensor of
-               dimensions ``[N, S_1, H]``, where ``N`` is batch size,
-               ``S_1`` is sequence length, and ``H`` is ``hidden_size``.
-               ``cross_states`` is assumed to be a tensor of size
-               ``[N, S_2, H]``, similarly interpreted.
-               ``attention_mask`` is assumed to be a tensor of
-               dimensions ``[N, 1, 1, S_1]``, where ``N`` is the batch
-               size, and ``S_1`` is the sequence length, and
-               ``cross_mask`` is assumed to be a tensor of size
-               ``[N, 1, 1, S_2]``. Keys and values for the attention
-               heads in the cross-attention layer (but not the
-               self-attention layer) are computed using
-               ``cross_states``, and ``cross_mask`` is applied as the
-               attention mask in the cross-attention layer (but not the
-               self-attention layer).
-
-         -  **Returns:**
-
-            -  If ``add_cross_attention=False``, a tuple
-               ``(hidden_states, attention_mask)``, where
-               ``hidden_states`` is the output of the transformer, and
-               ``attention_mask`` is the same the ``attention_mask``
-               argument.
-            -  If ``add_cross_attention=True``, a tuple
-               ``(hidden_states, cross_states, attention_mask, cross_mask)``,
-               where ``hidden_states`` is the output of the transformer,
-               and the next three tensors are the same as the input
-               arguments.
-
-.. class:: smdistributed.modelparallel.torch.nn.DistributedAttentionLayer(num_attention_heads=32, attention_head_size=32, hidden_size=1024, attention_dropout_prob=0.1, hidden_dropout_prob=0.1, layernorm_epsilon=1e-5, initializer_range=0.02, use_normal_initialization=False, cross_attention=False, causal_mask_size=None, pre_layernorm=False, post_layernorm=True)
-
-   A distributed implementation for the attention block. Includes the
-   computation of the self- or cross-attention (context layer),
-   followed by a linear mapping and dropout, which is optionally
-   followed by the residual-connection and layer normalization.
-
-   For more information about what's the reference implementation of this module,
-   see :ref:`smdmp-tp-appendix`.
-
-   -  **Arguments:**
-
-      -  See :class:`smdistributed.modelparallel.torch.nn.DistributedTransformerLayer` for descriptions of the
-         arguments.
-      -  ``cross_attention``: If ``True``, it computes the attentions
-         with respect to the ``cross_states`` tensor of the ``forward``
-         method input tuple. (Default: ``False``)
-
-   -  **Methods:**
-
-      -  ``forward(self, inputs)``: Forward pass for the attention
-         layer.
-
-         -  **Arguments:**
-
-            -  If ``cross_attention=False``, ``inputs`` must be a tuple
-               ``(hidden_states, attention_mask)``, where
-               ``hidden_states`` is assumed to be a tensor of dimensions
-               ``[N, S, H]``, where ``N`` is batch size, ``S`` is
-               sequence length, and ``H`` is ``hidden_size``.
-               ``attention_mask`` is assumed to be a tensor of
-               dimensions ``[N, 1, 1, S]``, where ``N`` is the
-               batch size, and ``S`` is the sequence length.
-            -  If ``cross_attention=True``, ``inputs`` must be a tuple
-               ``(hidden_states, cross_states, attention_mask)``, where
-               ``hidden_states`` is assumed to be a tensor of dimensions
-               ``[N, S_1, H]``, where ``N`` is batch size, ``S_1`` is
-               sequence length, and ``H`` is ``hidden_size``.
-               ``cross_states`` is assumed to be a tensor of size
-               ``[N, S_2, H]``, similarly interpreted.
-               ``attention_mask`` is assumed to be a tensor of
-               dimensions ``[N, 1, 1, S_2]``, where ``N`` is the batch
-               size, and ``S_2`` is the sequence length. Keys and values
-               for the attention heads are computed using
-               ``cross_states``.
-
-         -  **Returns:**
-
-            -  A single tensor that is the output of the attention
-               layer.
-
-.. class:: smdistributed.modelparallel.torch.nn.DistributedTransformerOutputLayer(hidden_size=1024, intermediate_size=4096,  hidden_dropout_prob=0.1, activation="gelu", layernorm_epsilon=1e-5, initializer_range=0.02, use_normal_initialization=False, pre_layernorm=False, post_layernorm=True, fp32_residual_addition=False)
-
-   -  Distributed implementation of a single transformer output layer. A
-      single :class:`smdistributed.modelparallel.torch.nn.DistributedTransformerLayer` with
-      ``add_cross_attention=False`` consists of a single
-      ``DistributedAttentionLayer`` immediately followed by a single
-      ``DistributedTransformerOutputLayer``. The latter linearly maps
-      the last channel of the input tensor from ``hidden_size`` to
-      ``intermediate_size``, and then maps it back to ``hidden_size``.
-
-      For more information about what's the reference implementation of this module,
-      see :ref:`smdmp-tp-appendix`.
-
-   -  **Arguments:**
-
-      -  See :class:`smdistributed.modelparallel.torch.nn.DistributedTransformerLayer` for descriptions of the
-         arguments.
-      - ``fp32_residual_addition``: Set to ``True`` if you want to avoid overflow
-        (NaN loss values) for large models with more than 100 billion parameters
-        when using FP16. (Default: False)
-
-.. class:: smdistributed.modelparallel.torch.nn.DistributedEmbedding(num_embeddings,embedding_dim, padding_idx=None, max_norm=None, norm_type=2.0, scale_grad_by_freq=False, sparse=False, _weight=None, initializer_range=0.02, _skip_allgather=False,_skip_scatter_and_merge=False,)
-
-   -  Distributed implementation of a single Embedding Layer. Currently
-      only supports splitting across the embedding_dim.
-   -  **Arguments:**
-
-      -  See :class:`smdistributed.modelparallel.torch.nn.DistributedEmbedding` for descriptions of the
-         arguments.
-
-.. _enabling-tp:
-
-Enabling Tensor Parallelism
-~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-There are two ways tensor parallelism can be enabled.
-
-First, you can use
-the distributed module implementations in ``smdistributed.modelparallel.torch.nn`` module directly in
-your model definition. See :ref:`smdmp-supported-modules-for-tp`
-for a complete list of built-in distributed modules. Here is an example
-of how this can be done:
-
-.. code:: python
-
-   import torch.nn as nn
-   import smdistributed.modelparallel.torch as smp
-
-   class TransformerModel:
-       def __init__(self):
-           self.embedding = nn.Embedding(vocab_size, hidden_size)
-
-           # directly instantiate smp.nn.DistributedTransformer and use it
-           self.encoder = smp.nn.DistributedTransformer(num_layers, hidden_size, **kwargs)
-
-           self.pooler = nn.Linear(hidden_size, hidden_size)
-
-       def forward(self, hidden_states):
-           emb_out = self.embedding(hidden_states)
-           enc_out = self.encoder(emb_out)
-           return self.pooler(enc_out)
-
-Second, you can enable tensor parallelism for specific modules or blocks
-of code, which will automatically enable tensor parallelism for the
-supported modules within that scope. To do this, you can use the
-following API:
-
-.. decorator:: smdistributed.modelparallel.torch.tensor_parallelism(enabled=True, **kwargs)
-
-   -  A context manager that enables or disables tensor parallelism for
-      any supported module that is created inside. If there are nested
-      contexts, the innermost overrides the rest. If there are
-      multiple supported modules created within the context, where one
-      is the submodule of the other, only the outermost module will be
-      distributed. If a supported module shares weights with another
-      (supported or unsupported) module, or if its hyperparameters do
-      not support distribution (e.g., not divisible by the tensor
-      parallelism degree), tensor parallelism will **not** be enabled
-      for this module even if this API is used.
-
-      **Example:**
-
-      .. code:: python
-
-         import smdistributed.modelparallel.torch as smp
-
-         with smp.tensor_parallelism():
-             self.m0 = nn.Linear(20, 20)                   # will be distributed
-             with smp.tensor_parallelism(enabled=False):
-                 self.m1 = nn.Linear(20, 20)               # will not be distributed
-
-   - ``kwargs`` - Keyword arguments that can be used to modify the configurations of
-     the distributed modules created inside the context.
-     If a keyword argument provided through it matches any ``__init__`` method arguments
-     of a ``DistributedModule`` that substitutes a module created inside
-     the ``smdistributed.modelparallel.torch.tensor_parallelism`` context, this keyword will override
-     the value defined in the ``init_hook``.
-
-     - (*For v1.7.0 and later*) Through the following additional keyword arguments,
-       the library supports `NVIDIA Megatron’s fused kernels
-       <https://github.com/NVIDIA/Megatron-LM/tree/main/megatron/fused_kernels>`_
-
-       - ``fused_softmax`` (bool) - Fusion of attention masking and softmax.
-         By default, it is set to ``True``. You can deactivate it by setting
-         ``fused_softmax=False`` in the ``smdistributed.modelparallel.torch.tensor_parallelism`` context manager.
-       - ``fused_bias_gelu`` (bool) - Fusion of bias addition and Gelu activation.
-         By default, it is set to ``False``. You can activate it by setting
-         ``fused_bias_gelu=True`` in the ``smdistributed.modelparallel.torch.tensor_parallelism`` context manager.
-
-
-
-.. function:: smdistributed.modelparallel.torch.set_tensor_parallelism(module, enabled=True, **kwargs)
-
-   -  Enables or disables tensor parallelism for the supported
-      submodules of ``module``. If enabling, the outermost supported
-      modules will be distributed. If disabling, tensor parallelism will
-      be disabled for the entire module subtree of ``module``. Unlike
-      the context manager, this API can be used after the model creation
-      (but before wrapping with :class:`smdistributed.modelparallel.torch.DistributedModel`), so direct
-      access to model definition code is not required. If a supported
-      module shares weights with another (supported or unsupported)
-      module, or if its hyperparameters do not support distribution
-      (e.g., not divisible by the tensor parallelism degree), tensor
-      parallelism will **not** be enabled for this module.
-   -  Keyword arguments ``kwargs`` can be used to modify the
-      configurations of the distributed modules created inside the
-      context. If a keyword argument provided here matches any
-      ``__init__`` method arguments of a :class:`smdistributed.modelparallel.torch.DistributedModel` that
-      substitutes a module created inside the ``smdistributed.modelparallel.torch.tensor_parallelism``
-      context, this keyword will override the value defined in the
-      ``init_hook``.
-   -  **Example:**
-
-      .. code:: python
-
-         import smdistributed.modelparallel.torch as smp
-
-         model = MyModel()
-         smp.set_tensor_parallelism(model.encoder, True)
-         smp.set_tensor_parallelism(model.encoder.embedding, True)
-
-         # outermost supported submodules in model.encoder will be distributed, except for
-         # model.encoder.embedding
-         model = smp.DistributedModel(model)
-         optimizer = smp.DistributedOptimizer(optimizer)
-
-.. _activation-checkpointing-api:
-
-Activation Checkpointing APIs
------------------------------
-
-``smdistributed.modelparallel`` provides three APIs to enable
-activation checkpointing: one for checkpointing modules,
-one for checkpointing sequential modules, and
-one for checkpointing pretrained models.
-
-For a conceptual guide and examples, see
-`Activation Checkpointing <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-activation-checkpointing.html>`_
-in the *SageMaker's Distributed Model Parallel developer guide*.
-
-.. class:: smdistributed.modelparallel.torch.patches.checkpoint.checkpoint(module, *args, preserve_rng_state=True)
-
-   -  Checkpoints the module passed. Throws error if, during manual
-      partitioning, all children of module are not on same rank as the
-      module itself, i.e. the module tree is split across multiple
-      partitions. During auto-partitioning, if the module is split
-      across multiple partitions, then this call is ignored(with a
-      warning). Note that this call applies to the module instance only,
-      not to the module class.
-
-   -  **Arguments:**
-
-      -  ``module (Instance of nn.Module)``: The module to be
-         checkpointed. Note that unlike native checkpointing in
-         PyTorch’s, activation checkpointing in
-         ``smdistributed.modelparallel`` is at the granularity of a
-         module. A generic function cannot be passed here.
-      -  ``args``: Tuple containing inputs to the module.
-      -  ``preserve_rng_state (bool, default=True)``: Omit stashing and
-         restoring the RNG state during each checkpoint.
-
-.. class:: smdistributed.modelparallel.torch.patches.checkpoint.checkpoint_sequential(sequential_module, input, strategy="each", preserve_rng_state=True, pack_args_as_tuple=False)
-
-   -  Checkpoints the modules inside
-      `nn.Sequential <https://pytorch.org/docs/stable/generated/torch.nn.Sequential.html>`__.
-      This can be used even if different layers that are part of the
-      sequential container lie on different partitions. Each layer part
-      of the sequential module that is checkpointed must lie completely
-      within one partition. If this is not the case during manual
-      partitioning, then an error will be thrown. If this is not the
-      case during auto partitioning, a warning will be raised and this
-      module will be run without checkpointing.
-
-   -  **Arguments**
-
-      -  ``sequential_module (nn.Sequential)``: the sequential module to
-         be checkpointed.
-      -  ``input (torch.Tensor or a tuple of torch.Tensors)``: input to
-         the module, which can be a tensor or a tuple of tensors. If a
-         tuple is passed, then pack_args_as_tuple should be set to True.
-      -  ``strategy (string, default=“each”)`` : Strategy determines how
-         many layers part of the sequential module need to be grouped
-         together for one checkpointing call. This determines how much
-         memory can be reduced. It can take the following values
-
-         -  ``each`` : The default is to checkpoint each module inside
-            the sequential separately.
-         -  ``contiguous``: Groups consecutive layers on the same
-            partition together. For example, if a sequential consists of
-            [a, b, c, d] where a,b are on pp_rank0 and c,d are on
-            pp_rank 1, then this strategy would checkpoint a,b together
-            and then c,d together. This means effectively, inputs of a,
-            outputs of b, inputs of c, and outputs of d are in memory;
-            the reamining activations are recomputed.
-         -  ``group_2, group_3, group_4, etc:`` More generally,
-            ``group_x`` where x is an integer. This strategy provides
-            more flexibility in how many layers to group together.
-            ``group_x`` groups x layers together on a best effort basis.
-            It can group x layers together if there are x layers
-            consecutively on the same partition. For example:
-            [a,b,c,d,e] where a,b are on pp_rank0 and c,d,e are on
-            pp_rank 1. If the strategy is ``group_3,`` then a,b are
-            checkpointed together on pp_rank0 and c,d,e are checkpointed
-            together on pp_rank1.
-
-      -  ``preserve_rng_state (bool, default=True)``: Set to ``False``
-         to omit stashing and restoring the RNG state during each
-         checkpoint.
-      -  ``pack_args_as_tuple (bool, default=False)``: To ensure that
-         backward works correctly, the autograd function has to unpack
-         any tuples received. If the checkpointed layer takes a tuple as
-         input, then this needs to be set to True.
-
-.. class:: smdistributed.modelparallel.torch.set_activation_checkpointing(module, preserve_rng_state=True, pack_args_as_tuple=False, strategy="each")
-
-   -  This API is recommended when importing pretrained models from
-      libraries, such as PyTorch and Hugging Face Transformers. This is
-      particularly useful when you don’t have access to the model
-      definition code and not be able to replace a module call with
-      checkpoint.
-
-   -  **Arguments**:
-
-      -  ``module (Instance of nn.Module or nn.Sequential)``: The module
-         to checkpoint.
-      -  ``preserve_rng_state (bool, default=True)``: Set to ``False``
-         to omit stashing and restoring the RNG state during each
-         checkpoint.
-      -  ``pack_args_as_tuple (bool, default=False)``: *Can only be
-         passed when module is a sequential module.* To ensure that
-         backward works correctly, the autograd function has to unpack
-         any tuples received. If the layer checkpointed takes a tuple as
-         input, then this needs to be set to True.
-      -  ``strategy: (string, default=“each”)``: *Can only be passed
-         when module is a sequential module.* Strategy determines how
-         many layers part of the sequential module need to be grouped
-         together for one checkpointing call.
-      -  This determines how much memory can be reduced. It can take the
-         following values
-
-         -  ``each`` : The default is to checkpoint each module inside
-            the sequential separately.
-         -  ``contiguous``: Groups consecutive layers on the same
-            partition together. For example if a sequential consists of
-            ``[a, b, c, d]`` where ``a, b`` are on ``pp_rank0`` and ``c, d`` are on
-            ``pp_rank 1``, then this strategy would checkpoint a,b together
-            and then ``c, d`` together. This means effectively, the inputs of
-            ``a``, outputs of ``b``, inputs of ``c``, and outputs of ``d`` are in
-            memory, and the rest of the activations are recomputed.
-         -  ``group_2, group_3, group_4, etc:`` More generally,
-            ``group_x`` where x is an integer. This strategy provides
-            more flexibility in how many layers to group together.
-            ``group_x`` groups x number of layers together on a best
-            effort basis if there are x layers consecutively in the same
-            partition. **Example**: Assume a module with layers ``[a, b,
-            c, d, e]``. The layers a and b are on pp_rank0, and ``c``, ``d``, and
-            ``e`` are on ``pp_rank 1``. If the strategy is ``group_3,`` then ``a``,
-            ``b`` are checkpointed together on ``pp_rank0``, and ``c``, ``d``, ``e`` are
-            checkpointed together on ``pp_rank1``.
-
-.. _smdmp-tp-appendix:
-
-Appendix: Reference Implementations for Modules
------------------------------------------------
-
-The following are reference implementations for transformer-related
-modules. Note that this is not the actual ``smdistributed`` source code,
-but the distributed implementations provided in the library are the
-distributed versions of these reference implementations, and can be used
-to determine whether the distributed modules perform the same operations
-as the custom modules in your script.
-
-To keep the implementations simple, we only assume keyword arguments,
-and assume the existence of a method ``parse_args(kwargs)``, which
-parses the arguments to ``__init__`` methods and sets the relevant
-attributes of the module, such as ``hidden_size`` and
-``num_attention_heads``.
-
-``smdistributed.modelparallel.torch.nn.DistributedTransformer``
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. code:: python
-
-   class Transformer(nn.Module):
-       def __init__(self, **kwargs):
-           super(Transformer, self).__init__()
-           self.parse_args(kwargs)
-
-           self.layers = []
-           for l in range(self.num_layers):
-               self.layers.append(TransformerLayer(**kwargs))
-
-           self.seq_layers = nn.Sequential(*self.layers)
-
-       def forward(self, inp):
-           return self.seq_layers(inp)
-
-``smdistributed.modelparallel.torch.nn.DistributedTransformerLayer``
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. code:: python
-
-   class TransformerLayer(nn.Module):
-       def __init__(self, **kwargs):
-           super(TransformerLayer, self).__init__()
-           self.parse_args(kwargs)
-
-           self.attention = AttentionLayer(**kwargs)
-           self.output = TransformerOutputLayer(**kwargs)
-
-           if self.add_cross_attention:
-               self.cross_attention = AttentionLayer(cross_attention=True, **kwargs)
-
-       def forward(self, inp):
-           if self.add_cross_attention:
-               hidden_states, cross_states, attention_mask, cross_mask = inp
-           else:
-               hidden_states, attention_mask = inp
-
-           attention_output = self.attention((hidden_states, attention_mask))
-           if self.add_cross_attention:
-               attention_output = self.cross_attention((attention_output,
-                                                        cross_states,
-                                                        cross_mask))
-
-           output = self.output(attention_output)
-
-           if self.add_cross_attention:
-               return output, cross_states, attention_mask, cross_mask
-           else:
-               return output, attention_mask
-
-``smdistributed.modelparallel.torch.nn.DistributedAttentionLayer``
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. code:: python
-
-   class AttentionLayer(nn.Module):
-       def __init__(self, **kwargs):
-           super(AttentionLayer, self).__init__()
-           self.parse_args(kwargs)
-           self.attention_head_size = self.hidden_size // self.num_attention_heads
-
-           self.query = nn.Linear(self.hidden_size, self.hidden_size)
-           self.key = nn.Linear(self.hidden_size, self.hidden_size)
-           self.value = nn.Linear(self.hidden_size, self.hidden_size)
-           self.dense = nn.Linear(self.hidden_size, self.hidden_size)
-
-           self.dropout1 = nn.Dropout(self.attention_dropout_prob)
-           self.dropout2 = nn.Dropout(self.hidden_dropout_prob)
-
-           if self.pre_layernorm:
-               self.pre_layernorm = nn.LayerNorm(self.hidden_size,
-                                       eps=self.layernorm_epsilon)
-
-           if self.post_layernorm:
-               self.layernorm = nn.LayerNorm(self.hidden_size,
-                                       eps=self.layernorm_epsilon)
-
-       def transpose(self, tensor, key=False):
-           shape = tensor.size()[:-1] +
-                           (self.num_attention_heads, self.attention_head_size)
-           tensor = torch.reshape(tensor, shape)
-           if key:
-               return tensor.permute(0, 2, 3, 1)
-           else:
-               return tensor.permute(0, 2, 1, 3)
-
-       def forward(self, inp):
-           if self.cross_attention:
-               hidden_states, cross_states, attention_mask = inp
-           else:
-               hidden_states, attention_mask = inp
-
-           if self.pre_layernorm:
-               norm_states = self.pre_layernorm(hidden_states)
-           else:
-               norm_states = hidden_states
-
-           query_layer = self.query(norm_states)
-
-           if self.cross_attention:
-               key_layer = self.key(cross_states)
-               value_layer = self.value(cross_states)
-           else:
-               key_layer = self.key(norm_states)
-               value_layer = self.value(norm_states)
-
-           query_layer = self.transpose(query_layer)
-           key_layer = self.transpose(key_layer, key=True)
-           value_layer = self.transpose(value_layer)
-
-           attention_scores = torch.matmul(query_layer, key_layer)
-           attention_scores = attention_scores / math.sqrt(self.attention_head_size)
-
-           if not self.cross_attention and self.causal_mask is not None:
-               attention_scores = self.apply_causal_mask(attention_scores)
-
-           attention_scores = attention_scores + attention_mask
-
-           attention_probs = F.softmax(attention_scores, dim=-1)
-           attention_probs = self.dropout1(attention_probs)
-
-           context_layer = torch.matmul(attention_probs, value_layer)
-           context_layer = context_layer.permute(0, 2, 1, 3)
-           new_context_layer_shape = context_layer.size()[:-2] + \
-                                       (self.local_attention_size,)
-           context_layer = torch.reshape(context_layer, new_context_layer_shape)
-
-           self_attention = self.dense(context_layer)
-           self_attention = self.dropout2(self_attention)
-
-           if self.post_layernorm:
-               return self.layernorm(self_attention + hidden_states)
-           else:
-               return self_attention
-
-``smdistributed.modelparallel.torch.nn.DistributedTransformerOutputLayer``
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. code:: python
-
-   class TransformerOutputLayer(nn.Module):
-       def __init__(self, **kwargs):
-           super(TransformerOutputLayer, self).__init__()
-           self.parse_args(kwargs)
-
-           self.dense1 = nn.Linear(self.hidden_size, self.intermediate_size)
-           self.dense2 = nn.Linear(self.intermediate_size, self.hidden_size)
-
-           self.dropout = nn.Dropout(self.attention_dropout_prob)
-
-           if self.pre_layernorm:
-               self.pre_layernorm = nn.LayerNorm(self.hidden_size,
-                                       eps=self.layernorm_epsilon)
-
-           if self.post_layernorm:
-               self.layernorm = nn.LayerNorm(self.hidden_size,
-                                       eps=self.layernorm_epsilon)
-
-       def forward(self, inp):
-           if self.pre_layernorm:
-               norm_inp = self.pre_layernorm(inp)
-           else:
-               norm_inp = inp
-
-           dense1_output = self.dense1(norm_inp)
-           if self.activation == "gelu":
-               act_output = F.gelu(dense1_output)
-           else:
-               act_output = F.relu(dense1_output)
-
-           dense2_output = self.dense2(act_output)
-           output = self.dropout(dense2_output)
-
-           if self.post_layernorm:
-               return self.layernorm(inp + output)
-           else:
-               return output
diff --git a/doc/api/training/smp_versions/latest/smd_model_parallel_tensorflow.rst b/doc/api/training/smp_versions/latest/smd_model_parallel_tensorflow.rst
deleted file mode 100644
index 7f21f7a557..0000000000
--- a/doc/api/training/smp_versions/latest/smd_model_parallel_tensorflow.rst
+++ /dev/null
@@ -1,165 +0,0 @@
-TensorFlow API
-==============
-
-To use the TensorFlow-specific APIs for SageMaker distributed model parallism,
-you need to add the following import statement at the top of your training script.
-
-.. code:: python
-
-   import smdistributed.modelparallel.tensorflow as smp
-
-.. tip::
-
-   Refer to
-   `Modify a TensorFlow Training Script
-   <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-customize-training-script-tf.html>`_
-   to learn how to use the following APIs in your TensorFlow training script.
-
-.. class:: smp.DistributedModel
-   :noindex:
-
-   A sub-class of the Keras \ ``Model`` class, which defines the model to
-   be partitioned. Model definition is done by sub-classing
-   ``smp.DistributedModel`` class, and implementing the ``call()`` method,
-   in the same way as the Keras model sub-classing API. Any operation that
-   is part of the \ ``smp.DistributedModel.call()`` method is subject to
-   partitioning, meaning that every operation placed inside executes in
-   exactly one of the devices (the operations outside run on all devices).
-
-
-   Similar to the regular Keras API, the forward pass is done by directly
-   calling the model object on the input tensors. For example:
-
-   .. code:: python
-
-      predictions = model(inputs)   # model is a smp.DistributedModel object
-
-   However, ``model()`` calls can only be made inside a
-   ``smp.step``-decorated function.
-
-   The outputs from a ``smp.DistributedModel`` are available in all ranks,
-   regardless of which rank computed the last operation.
-
-   **Methods:**
-
-   .. function:: save_model(save_path="/opt/ml/model")
-
-      **Inputs**
-      - ``save_path`` (``string``): A path to save an unpartitioned model with latest training weights.
-
-      Saves the entire,
-      unpartitioned model with the latest trained weights to ``save_path`` in
-      TensorFlow ``SavedModel`` format. Defaults to ``"/opt/ml/model"``, which
-      SageMaker monitors to upload the model artifacts to Amazon S3.
-
-.. function:: smp.partition(index)
-
-   **Inputs**
-
-   -  ``index`` (``int``): The index of the partition.
-
-   A context manager which places all operations defined inside into the
-   partition whose ID is equal to ``index``. When
-   ``smp.partition`` contexts are nested, the innermost context overrides
-   the rest. The ``index`` argument must be smaller than the number of
-   partitions.
-
-   ``smp.partition`` is used in the manual partitioning API;
-   if \ ``"auto_partition"`` parameter is set to ``True`` while launching
-   training, then ``smp.partition`` contexts are ignored. Any operation
-   that is not placed in any ``smp.partition`` context is placed in the
-   ``default_partition``, as shown in the following example:
-
-   .. code:: python
-
-      # auto_partition: False
-      # default_partition: 0
-      smp.init()
-      [...]
-      x = tf.constant(1.2)                     # placed in partition 0
-      with smp.partition(1):
-          y = tf.add(x, tf.constant(2.3))      # placed in partition 1
-          with smp.partition(3):
-              z = tf.reduce_sum(y)             # placed in partition 3
-
-
-.. function:: register_post_partition_hook(hook)
-
-    Registers a callable ``hook`` to
-    be executed after the model is partitioned. This is useful in situations
-    where an operation needs to be executed after the model partition during
-    the first call to ``smp.step``, but before the actual execution of the
-    first forward pass.
-
-    .. code:: python
-
-        @smp.register_post_partition_hook
-        def test_eager():
-            # All statements here will be executed right after partition but before the first forward pass
-            tf.print("Entered hook through eager context")
-
-.. class:: smp.CheckpointManager
-
-
-   A subclass of TensorFlow
-   `CheckpointManager <https://www.tensorflow.org/api_docs/python/tf/train/CheckpointManager>`__,
-   which is used to manage checkpoints. The usage is similar to TensorFlow
-   ``CheckpointManager``.
-
-   The following returns a ``CheckpointManager`` object.
-
-   .. code:: python
-
-      smp.CheckpointManager(checkpoint,
-                            directory="/opt/ml/checkpoints",
-                            max_to_keep=None,
-                            checkpoint_name="ckpt")
-
-   **Parameters**
-
-   -  ``checkpoint``: A `tf.train.Checkpoint
-      <https://www.tensorflow.org/api_docs/python/tf/train/Checkpoint>`__ instance
-      that represents a model checkpoint.
-
-   -  ``directory``: (``str``) The path to a directory in which to write
-      checkpoints. A file named "checkpoint" is also written to this
-      directory (in a human-readable text format) which contains the state
-      of the ``CheckpointManager``. Defaults to
-      ``"/opt/ml/checkpoints"``, which is the directory that SageMaker
-      monitors for uploading the checkpoints to Amazon S3.
-   -  ``max_to_keep`` (``int``): The number of checkpoints to keep. If
-      ``None``, all checkpoints are kept.
-   -  ``checkpoint_name`` (``str``): Custom name for the checkpoint file.
-      Defaults to ``"ckpt"``.
-
-
-   **Methods:**
-
-   .. function:: save( )
-
-      Saves a new checkpoint in the specified directory. Internally uses ``tf.train.CheckpointManager.save()``.
-
-   .. function:: restore( )
-
-      Restores the latest checkpoint in the specified directory.
-      Internally uses ``tf.train.CheckpointManager.restore()``.
-
-
-   **Examples:**
-
-   .. code:: python
-
-      checkpoint = tf.train.Checkpoint(optimizer=optimizer, model=model)
-      ckpt_manager = smp.CheckpointManager(checkpoint, max_to_keep=5)  # use /opt/ml/checkpoints
-
-      for inputs in train_ds:
-          loss = train_step(inputs)
-          # [...]
-          ckpt_manager.save()  # save a new checkpoint in /opt/ml/checkpoints
-
-   .. code:: python
-
-      for step, inputs in enumerate(train_ds):
-          if step == 0:
-              ckpt_manager.restore()
-          loss = train_step(inputs)
diff --git a/doc/api/training/smp_versions/model-data-parallel.png b/doc/api/training/smp_versions/model-data-parallel.png
deleted file mode 100644
index 089b84673a..0000000000
Binary files a/doc/api/training/smp_versions/model-data-parallel.png and /dev/null differ
diff --git a/doc/api/training/smp_versions/v1.1.0/smd_model_parallel_common_api.rst b/doc/api/training/smp_versions/v1.1.0/smd_model_parallel_common_api.rst
deleted file mode 100644
index 8a8e87252e..0000000000
--- a/doc/api/training/smp_versions/v1.1.0/smd_model_parallel_common_api.rst
+++ /dev/null
@@ -1,485 +0,0 @@
-.. admonition:: Contents
-
-   - :ref:`communication_api`
-   - :ref:`mpi_basics`
-
-Common API
-==========
-
-The following SageMaker distribute model parallel APIs are common across all frameworks.
-
-**Important**: This API document assumes you use the following import statement in your training scripts.
-
-**TensorFlow**
-
-.. code:: python
-
-   import smdistributed.modelparallel.tensorflow as smp
-
-**PyTorch**
-
-.. code:: python
-
-   import smdistributed.modelparallel.torch as smp
-
-
-.. function:: smp.init( )
-   :noindex:
-
-   Initialize the library. Must be called at the beginning of training script.
-
-.. function:: @smp.step(non_split_inputs, input_split_axes, [*args, **kwargs])
-   :noindex:
-
-   A decorator that must be placed over a function that represents a single
-   forward and backward pass (for training use cases), or a single forward
-   pass (for evaluation use cases). Any computation that is defined inside
-   the ``smp.step``-decorated function is executed in a pipelined manner.
-
-   By default, every tensor input to the function is split across its batch
-   dimension into a number of microbatches specified while launching the
-   training job. This behavior can be customized through the arguments to
-   ``smp.step``, described below. The library then orchestrates the execution of
-   each microbatch across all partitions, based on the chosen pipeline
-   type.
-
-   In a typical use case, forward pass and back-propagation are executed
-   inside an \ ``smp.step``-decorated function and gradients, loss, and
-   other relevant metrics (such as accuracy, etc.) are returned from
-   ``smp.step``-decorated function.
-
-   Any gradient post-processing operation, such as gradient clipping and
-   allreduce, as well as ``optimizer.apply_gradients`` calls (for TF) or
-   ``optimizer.step`` (for PT) should be applied on the gradients returned
-   from the ``smp.step`` function, and not inside the ``smp.step``
-   function. This is because every operation inside ``smp.step`` is
-   executed once per microbatch, so having these operations inside
-   ``smp.step`` can either be inefficient (in the case of allreduce), or
-   lead to wrong results (in the case of ``apply_gradients`` /
-   ``optimizer.step``).
-
-   If the objects returned from the ``smp.step``-decorated function contain
-   ``tf.Tensor``\ s / ``torch.Tensor``\ s, they are converted to
-   ``StepOutput`` objects. A ``StepOutput`` object encapsulates all
-   versions of the tensor across different microbatches
-   (see ``StepOutput`` entry for more information).
-
-   The argument to ``smp.step`` decorated function should either be a tensor
-   or an instance of list, tuple, dict or set for it to be split across
-   microbatches. If your object doesn't fall into this category, you can make
-   the library split your object, by implementing ``smp_slice`` method.
-
-   Below is an example of how to use it with PyTorch.
-
-   .. code:: python
-
-      class CustomType:
-          def __init__(self, tensor):
-              self.data = tensor
-
-          # The library will call this to invoke slicing on the object passing in total microbatches (num_mb)
-          # and the current microbatch index (mb).
-          def smp_slice(self, num_mb, mb, axis):
-              dim_size = list(self.data.size())[axis]
-
-              split_size = dim_size // num_mb
-              sliced_tensor = self.data.narrow(axis, mb * split_size, split_size)
-              return CustomType(sliced_tensor, self.other)
-
-      custom_obj = CustomType(torch.ones(4,))
-
-      @smp.step()
-      def step(custom_obj):
-          loss = model(custom_obj)
-          model.backward(loss)
-          return loss
-
-
-   **Important:** ``smp.step`` splits the batch into microbatches, and
-   executes everything inside the decorated function once per microbatch.
-   This might affect the behavior of batch normalization, any operation
-   that explicitly uses the batch size information, or any other Python
-   code that is expected to run once.
-
-   **TensorFlow-specific behavior**
-
-   ``smp.step`` is a wrapper that
-   inherits from and extends the behavior of ``tf.function``, and as such,
-   all the caveats that apply to the use of ``tf.function``\ s also apply
-   to ``smp.step``. In particular, any operation that is inside
-   ``smp.step`` executes in graph mode, and not eager mode.
-
-   In the first call, ``smp.step`` performs tracing of the wrapped function every time
-   one of the tensor arguments changes their shape or dtype, or for every
-   new value of a Python argument, if there is one. Tracing is expensive,
-   so such scenarios should be avoided as much as possible or,
-   alternatively, an ``input_signature`` argument must be provided. For
-   more information on the usage of ``tf.function``, refer to the
-   TensorFlow documentation:
-
-   -  https://www.tensorflow.org/api_docs/python/tf/function\
-   -  https://www.tensorflow.org/guide/function\
-
-   **Common parameters**
-
-   -  ``non_split_inputs`` (``list``): The list of arguments to the decorated function
-      that should not be split along the batch dimension. Should be used
-      for all input tensors that do not have a batch dimension. Should be a
-      list of argument names as ``str``, as they appear in the signature of
-      the ``smp.step``-decorated function. By default it is considered an
-      empty list.
-
-   -  ``input_split_axes`` (``dict``): A dict that maps the argument name to its batch
-      axis. The keys should be the argument names as ``str``, as they
-      appear in the signature of the ``smp.step``-decorated function.  By
-      default all batch axes are assumed to be the 0-axis.
-
-   **TensorFlow-only parameters**
-
-   -  All arguments of ``tf.function``. Note:
-      The \ ``experimental_compile`` argument of ``tf.function`` may not
-      work as expected with ``smp.step``, since it interferes with
-      pipelining and model partitioning. To enable XLA with the library, you can
-      instead use \ ``tf.config.optimizer.set_jit(True)``.
-
-   **PyTorch-only parameters**
-
-   -  ``detach_outputs`` (``bool``) : If ``True``, calls ``torch.Tensor.detach()`` on
-      all returned ``torch.Tensor`` outputs. Setting it to ``False``
-      increases memory consumption, unless ``detach()`` is manually called
-      on the returned tensors, because the model graph is not cleared from
-      memory after the training step. Set to \ ``True`` by default.
-
-   **Returns**
-
-   -  The same object(s) returned from the decorated function. All
-      returned \ ``tf.Tensor``, \ ``tf.Variable``  objects (for TF) or
-      ``torch.Tensor`` objects (for PT) are wrapped inside
-      a \ ``StepOutput`` object, even when they are inside a Python
-      ``list``, ``tuple``, or ``dict``.
-
-
-
-.. class:: StepOutput
-   :noindex:
-
-
-   A class that encapsulates all versions of a ``tf.Tensor``
-   or \ ``torch.Tensor`` across all microbatches.
-
-   When a particular ``tf.Tensor`` or ``torch.Tensor`` is computed inside
-   ``smp.step``, different versions of the tensor are computed for each
-   microbatch.
-
-   When this tensor is returned from ``smp.step`` and is accessed outside
-   of the decorated function, it appears as a ``StepOutput`` object, which
-   contains all such versions. For example,
-
-   -  In the case of Tensorflow, the gradient for a particular
-      ``tf.Variable`` is computed on each microbatch individually, and if
-      this gradient is returned from ``smp.step``, all gradients for this
-      ``tf.Variable`` become part of the same ``StepOutput`` object. The
-      ``StepOutput`` class offers the following API for commonly-used
-      post-processing operations on such tensors.
-   -  In the case of PyTorch, the loss for each microbatch is computed
-      individually and all the ``torch.Tensor``\ s that represent the loss
-      for different microbatches become part of same ``StepOutput`` object,
-      if loss is returned from the ``smp.step`` function.
-
-
-   The ``StepOutput`` class offers the following API for commonly-used
-   post-processing operations on tensors.
-
-   .. data:: StepOutput.outputs
-      :noindex:
-
-      Returns a list of the underlying tensors, indexed by microbatch.
-
-   .. function:: StepOutput.reduce_mean( )
-      :noindex:
-
-      Returns a ``tf.Tensor``, ``torch.Tensor`` that averages the constituent ``tf.Tensor`` s
-      ``torch.Tensor`` s. This is commonly used for averaging loss and gradients across microbatches.
-
-   .. function:: StepOutput.reduce_sum( )
-      :noindex:
-
-      Returns a ``tf.Tensor`` /
-      ``torch.Tensor`` that sums the constituent
-      ``tf.Tensor``\ s/\ ``torch.Tensor``\ s.
-
-   .. function:: StepOutput.concat( )
-      :noindex:
-
-      Returns a
-      ``tf.Tensor``/``torch.Tensor`` that concatenates tensors along the
-      batch dimension using ``tf.concat`` / ``torch.cat``.
-
-   .. function:: StepOutput.stack( )
-      :noindex:
-
-      Applies ``tf.stack`` / ``torch.stack``
-      operation to the list of constituent ``tf.Tensor``\ s /
-      ``torch.Tensor``\ s.
-
-   **TensorFlow-only methods**
-
-   .. function:: StepOutput.merge( )
-      :noindex:
-
-      Returns a ``tf.Tensor`` that
-      concatenates the constituent ``tf.Tensor``\ s along the batch
-      dimension. This is commonly used for merging the model predictions
-      across microbatches.
-
-   .. function:: StepOutput.accumulate(method="variable", var=None)
-      :noindex:
-
-      Functionally the same as ``StepOutput.reduce_mean()``. However, it is
-      more memory-efficient, especially for large numbers of microbatches,
-      since it does not wait for all constituent \ ``tf.Tensor``\ s to be
-      ready to start averaging them, thereby saving memory.
-
-      In some cases (XLA for example) ``StepOutput.reduce_mean()`` might end
-      up being more memory-efficient than ``StepOutput.accumulate()``.
-
-      **Parameters**
-
-      -  ``method`` (``"add_n"`` or ``"accumulate_n"`` or ``"variable"``):
-         If ``"add_n"`` or ``"accumulate_n"``, the library uses
-         ``tf.add_n`` and ``tf.accumulate_n``, respectively, to implement
-         accumulation. If ``"variable"``, the library uses an internal ``tf.Variable``
-         into which to accumulate the tensors. Default is \ ``"variable"``.
-         Note: Memory usage behavior of these choices can depend on the model
-         and implementation.
-
-      -  ``var``: A ``tf.Variable`` into which, if provided, the library uses to
-         accumulate the tensors. If \ ``None``, the library internally creates a
-         variable. If ``method`` is not ``"variable"``, this argument is
-         ignored.
-
-.. _mpi_basics:
-   :noindex:
-
-MPI Basics
-^^^^^^^^^^
-
-The library exposes the following basic MPI primitives to its Python API:
-
--  ``smp.rank()``: The rank of the current process.
--  ``smp.size()``: The total number of processes.
--  ``smp.mp_rank()``: The rank of the process among the processes that
-   hold the current model replica.
--  ``smp.dp_rank()``: The rank of the process among the processes that
-   hold different replicas of the same model partition.
--  ``smp.dp_size()``: The total number of model replicas.
--  ``smp.local_rank()``: The rank among the processes on the current
-   instance.
--  ``smp.local_size()``: The total number of processes on the current
-   instance.
--  ``smp.get_mp_group()``: The list of ranks over which the current
-   model replica is partitioned.
--  ``smp.get_dp_group()``: The list of ranks that hold different
-   replicas of the same model partition.
-
-.. _communication_api:
-   :noindex:
-
-Communication API
-^^^^^^^^^^^^^^^^^
-
-The library provides a few communication primitives which can be helpful while
-developing the training script. These primitives use the following
-``enum`` s as arguments to specify which processes the communication
-should involve.
-​
-
-**Helper structures**
-
-.. data:: smp.CommGroup
-   :noindex:
-
-   An ``enum`` that takes the values
-   ``CommGroup.WORLD``, ``CommGroup.MP_GROUP``, and ``CommGroup.DP_GROUP``.
-   These values can also be accessed as ``smp.WORLD``, ``smp.MP_GROUP``,
-   and ``smp.DP_GROUP`` respectively.
-
-   -  ``CommGroup.WORLD``: Represents the entire group of processes used in
-      training
-   -  ``CommGroup.MP_GROUP``: Represents the group of processes that hold
-      the same model replica as the current process. The processes in a
-      single ``MP_GROUP`` collectively store an entire replica of the
-      model.
-   -  ``CommGroup.DP_GROUP``: Represents the group of processes that hold
-      the same model partition as the current process. The processes in a
-      single ``DP_GROUP`` perform data parallelism/allreduce among
-      themselves.
-
-.. data:: smp.RankType
-   :noindex:
-
-   An ``enum`` that takes the values
-   ``RankType.WORLD_RANK``, ``RankType.MP_RANK``, and ``RankType.DP_RANK``.
-
-   -  ``RankType.WORLD_RANK``: The associated rank is to be interpreted as
-      the rank of the process across all processes used in training.
-   -  ``RankType.MP_RANK``: The associated rank is to be interpreted as the
-      rank of the process within the ``MP_GROUP``.
-   -  ``RankType.DP_RANK``: The associated rank is to be interpreted as the
-      rank of the process within the ``DP_GROUP``.
-
-
-**Communication primitives:**
-
-.. function:: smp.broadcast(obj, group)
-   :noindex:
-
-   Sends the object to all processes in the
-   group. The receiving process must call ``smp.recv_from`` to receive the
-   sent object.
-
-   **Inputs**
-
-   -  ``obj``: An arbitrary picklable Python object that will be broadcast.
-
-   -  ``group``: A ``CommGroup`` argument that represents to which group of
-      processes the object will be sent.
-
-   **Notes**
-
-   -  When you use ``broadcast`` on the sender process, there needs
-      to be an accompanying ``smp.recv_from()`` call on the receiver
-      processes.
-
-   -  This is a synchronous call; the ``broadcast`` statement
-      returns only after all ranks participating in the call have made a
-      matching ``recv_from`` call.
-
-   **Example**
-
-   .. code:: python
-
-      if smp.rank() == 0:
-          smp.broadcast(something, group=smp.CommGroup.WORLD)
-      else:
-          smp.recv_from(0, rank_type=smp.RankType.WORLD_RANK)
-
-.. function:: smp.send(obj, dest_rank, rank_type)
-   :noindex:
-
-   Sends the object ``obj`` to
-   ``dest_rank``, which is of a type specified by ``rank_type``.
-
-   **Inputs**
-
-   -  ``obj``: An arbitrary picklable Python object that will be sent.
-
-   -  ``dest_rank`` (``int``): An integer denoting the rank of the receiving process.
-
-   -  ``rank_type`` (``enum``): A ``smp.RankType`` ``enum`` that determines how
-      ``dest_rank`` is to be interpreted. For example if ``dest_rank`` is 1
-      and ``rank_type`` is ``MP_RANK``, then ``obj`` is sent to process
-      with ``mp_rank`` 1 in the ``MP_GROUP`` which contains the current
-      process.
-
-   **Notes**
-
-   -  Note: \ This is a synchronous call; the ``send`` statement returns
-      only after the destination rank has made a matching
-      ``recv_from`` call.
-
-.. function:: smp.recv_from(src_rank, rank_type)
-   :noindex:
-
-   Receive an object from a peer process. Can be used with a matching
-   ``smp.send`` or a ``smp.broadcast`` call.
-
-   **Inputs**
-
-   -  ``src_rank`` (``int``): An integer denoting rank of the sending process.
-
-   -  ``rank_type`` (``enum``): A ``smp.RankType`` ``enum`` that determines how
-      ``dest_rank`` is to be interpreted. For example if ``src_rank`` is 1
-      and ``rank_type`` is ``MP_RANK``, then the object is received from
-      the process with ``mp_rank`` 1 in the ``MP_GROUP`` which contains the
-      current process.
-
-   **Returns**
-
-   Returns the python object that is sent by the peer process.
-
-   **Notes**
-
-   -  Note: This is a synchronous call; the ``recv_from`` statement returns
-      only after the source rank has made a matching ``send`` or
-      ``broadcast`` call, and the object is received.
-
-.. function:: smp.allgather(obj, group)
-   :noindex:
-
-   A collective call that gathers all the
-   submitted objects across all ranks in the specified ``group``. Returns a
-   list whose ``i``\ th index contains the object submitted by the
-   ``i``\ th rank in ``group``.
-
-   **Inputs**
-
-   -  ``obj``: An arbitrary picklable Python object that will be
-      allgathered.
-
-   -  ``group`` : A ``CommGroup`` argument that represents which group of
-      processes participate in ``allgather``.
-
-   **Notes**
-
-   -  Note: This is a synchronous call; the ``allgather`` statement returns
-      only after all ranks participating in the call have made a matching
-      ``allgather`` call, and all the objects are received at the current
-      rank.
-
-   **Examples**
-
-   .. code:: python
-
-      # assuming mp_size() == 2
-
-      if smp.mp_rank() == 0:
-          out = smp.allgather(obj1, smp.CommGroup.MP_GROUP)  # returns [obj1, obj2]
-      else:
-          out = smp.allgather(obj2, smp.CommGroup.MP_GROUP)  # returns [obj1, obj2]
-
-.. function:: smp.barrier(group=smp.WORLD)
-   :noindex:
-
-   A statement that hangs until all
-   processes in the specified group reach the barrier statement, similar to
-   ``MPI_Barrier()``.
-
-   **Inputs**
-
-   -  ``group``: An ``smp.CommGroup`` ``enum`` that specifies the group of
-      processes participating in the barrier call. Defaults to
-      ``smp.WORLD``.
-
-   **Examples**
-
-   -  Assume there are 8 processes and 2 model partitions, and
-      therefore 4 \ ``mp_group``\ s, and 2 ``dp_group``\ s. If
-      the \ ``barrier`` call is passed the value ``smp.MP_GROUP`` for its
-      group argument, then each process only waits until the other process
-      of its own ``mp_group`` reaches that point. It does not wait for
-      processes outside that ``mp_group``.
-
-.. function:: smp.dp_barrier()
-   :noindex:
-
-   Same as passing ``smp.DP_GROUP``\ to ``smp.barrier()``.
-   Waits for the processes in the same \ ``dp_group`` as
-   the current process to reach the same point in execution.
-
-.. function:: smp.mp_barrier()
-   :noindex:
-
-   Same as passing ``smp.MP_GROUP`` to
-   ``smp.barrier()``. Waits for the processes in the same ``mp_group`` as
-   the current process to reach the same point in execution.
diff --git a/doc/api/training/smp_versions/v1.1.0/smd_model_parallel_pytorch.rst b/doc/api/training/smp_versions/v1.1.0/smd_model_parallel_pytorch.rst
deleted file mode 100644
index 3b822d79e9..0000000000
--- a/doc/api/training/smp_versions/v1.1.0/smd_model_parallel_pytorch.rst
+++ /dev/null
@@ -1,521 +0,0 @@
-.. admonition:: Contents
-
-   - :ref:`pytorch_saving_loading`
-   - :ref:`pytorch_saving_loading_instructions`
-
-PyTorch API
-===========
-
-**Supported versions: 1.6.0**
-
-This API document assumes you use the following import statements in your training scripts.
-
-.. code:: python
-
-   import smdistributed.modelparallel.torch as smp
-
-
-.. tip::
-
-   Refer to
-   `Modify a PyTorch Training Script
-   <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-customize-training-script.html#model-parallel-customize-training-script-pt>`_
-   to learn how to use the following API in your PyTorch training script.
-
-.. class:: smp.DistributedModel
-   :noindex:
-
-   A sub-class of ``torch.nn.Module`` which specifies the model to be
-   partitioned. Accepts a ``torch.nn.Module`` object ``module`` which is
-   the model to be partitioned. The returned ``DistributedModel`` object
-   internally manages model parallelism and data parallelism. Only one
-   model in the training script can be wrapped with
-   ``smp.DistributedModel``.
-
-
-   **Example:**
-
-   .. code:: python
-
-      model = smp.DistributedModel(model)
-
-   **Important**: The ``__call__`` and  ``backward`` method calls on the
-   ``smp.DistributedModel`` object (in the following example, the object
-   is \ ``model``) can only be made inside a ``smp.step``-decorated
-   function.
-
-
-   Since ``DistributedModel``  is a ``torch.nn.Module``, a forward pass can
-   be performed by calling the \ ``DistributedModel`` object on the input
-   tensors.
-
-   .. code:: python
-
-      predictions = model(inputs)   # model is a smp.DistributedModel object
-
-   For a backward pass, one needs to call the backward function on
-   the \ ``DistributedModel`` object, with tensors and gradients as
-   arguments, replacing the PyTorch operations \ ``torch.Tensor.backward``
-   or ``torch.autograd.backward``.
-
-
-   The API for ``model.backward`` is very similar to
-   ``torch.autograd.backward``. For example, the following
-   ``backward`` calls:
-
-   .. code:: python
-
-      torch.autograd.backward(loss) or loss.backward()
-
-   should be replaced with:
-
-   .. code:: python
-
-      model.backward(loss) # loss is a tensor with only one element as its data
-
-   Similarly, for non-scalar tensors, replace the following
-   ``backward`` call containing incoming gradient arguments:
-
-   .. code:: python
-
-      torch.autograd.backward(outputs, out_grads)
-
-   with the following line:
-
-   .. code:: python
-
-      model.backward(outputs, out_grads)
-
-   In these examples, all ``__call__``  and ``backward`` method calls on
-   the model objects (``model(inputs)`` and ``model.backward(loss)``) must be made inside
-   a ``smp.step``-decorated function.
-
-   **Parameters**
-
-   -  ``module`` (``torch.nn.Module``): Module to be distributed (data parallelism and model parallelism).
-
-   -  ``trace_device`` (``"cpu"`` or ``"gpu"``) (default: ``"gpu"``)
-      Whether to perform the tracing step on the GPU or CPU. The tracing step gathers
-      information on the order of execution of modules, the shapes of
-      intermediate outputs, and execution times, to be used by the
-      partitioning algorithm. If ``trace_device`` is set to GPU, accurate
-      module execution times can be gathered during tracing for potentially
-      improved partitioning decision. However, if the model is too large to
-      fit in a single GPU, then ``trace_device`` should be set to CPU.
-
-   -  ``trace_execution_times`` (``bool``) (default: ``False``): If ``True``,
-      the library profiles the execution time of each module during tracing, and uses
-      it in the partitioning decision. This improves the partitioning
-      decision, but it might make the tracing slower. It may also introduce
-      some degree of non-determinism in partitioning results, because of the
-      inherent randomness in module execution times. Must be ``False`` if
-      ``trace_device`` is ``"cpu"``.
-
-   -  ``overlapping_allreduce`` (``bool``) (default: ``True``): This is only
-      applicable for hybrid data parallelism/model parallelism use cases (when
-      ``ddp`` is set to ``True`` while launching training). The library uses this flag
-      to decide whether to do overlapping allreduce whenever a parameter
-      gradients are ready. This leads to overlapping of communication and
-      computation and can improve performance. If this is set to ``False`` ,
-      allreduce is performed at the end of the step.
-
-   -  ``backward_passes_per_step`` (``int``) (default: 1): This is only
-      applicable for hybrid data parallelism/model parallelism use cases (when
-      ``ddp`` is set to ``True`` in config). This parameter indicates the
-      number of backward passes to perform before calling allreduce on DDP.
-      This allows accumulating updates over multiple mini-batches before
-      reducing and applying them.
-
-   -  ``average_grads_across_microbatches`` (``bool``) (default: ``True``):
-      Whether or not the computed gradients should be averaged across
-      microbatches. If ``False``, the computed gradients will be summed across
-      microbatches, but not divided by the number of microbatches. In typical
-      use case where the computed loss is averaged over the mini-batch, this
-      should be left as ``True``. If you use a loss function that only sums
-      the per-sample loss across the batch (and not divide by the batch size),
-      then this must be set to ``False`` for correctness.
-
-   -  ``bucket_cap_mb`` (default: 25): \ ``DistributedDataParallel`` buckets
-      parameters into multiple buckets so that gradient reduction of each
-      bucket can potentially overlap with backward
-      computation. \ ``bucket_cap_mb``\ controls the bucket size in MegaBytes
-      (MB).
-
-    - ``trace_memory_usage`` (default: False): When set to True, the library attempts
-      to measure memory usage per module during tracing. If this is disabled,
-      memory usage will be estimated through the sizes of tensors returned from
-      the module.
-
-   **Properties**
-
-   -  ``partitioned``: Is ``True`` if the model is partitioned, ``False``
-      otherwise. Initialized to ``False`` when ``DistributedModel`` is first
-      created. It becomes be ``True`` during the first call
-      to ``smp.step``-decorated function. Once the model is partitioned, the
-      local parameters or local ``state_dict`` can be fetched using the
-      following methods.
-
-   **Methods**
-
-   .. function:: backward(tensors, grad_tensors)
-      :noindex:
-
-      Triggers a distributed backward
-      pass across model partitions. Example usage provided in the previous
-      section. The API is very similar
-      to https://pytorch.org/docs/stable/autograd.html#torch.autograd.backward.
-      ``retain_grad`` and ``create_graph``  flags are not supported.
-
-   .. function:: local_buffers( )
-      :noindex:
-
-      Returns an iterator over buffers for the modules in
-      the partitioned model that have been assigned to the current process.
-
-   .. function:: local_named_buffers( )
-      :noindex:
-
-      Returns an iterator over buffers for the
-      modules in the partitioned model that have been assigned to the current
-      process. This yields both the name of the buffer as well as the buffer
-      itself.
-
-   .. function:: local_parameters( )
-      :noindex:
-
-      Returns an iterator over parameters for the
-      modules in the partitioned model that have been assigned to the current
-      process.
-
-   .. function:: local_named_parameters( )
-      :noindex:
-
-      Returns an iterator over parameters for
-      the modules in the partitioned model that have been assigned to the
-      current process. This yields both the name of the parameter as well as
-      the parameter itself.
-
-   .. function:: local_modules( )
-      :noindex:
-
-      Returns an iterator over the modules in the
-      partitioned model that have been assigned to the current process.
-
-   .. function:: local_named_modules( )
-      :noindex:
-
-      Returns an iterator over the modules in the
-      partitioned model that have been assigned to the current process. This
-      yields both the name of the module as well as the module itself.
-
-   .. function:: local_state_dict( )
-      :noindex:
-
-      Returns the ``state_dict`` that contains local
-      parameters that belong to the current \ ``mp_rank``. This ``state_dict``
-      contains a key \ ``_smp_is_partial`` to indicate this is a
-      partial \ ``state_dict``, which indicates whether the
-      ``state_dict`` contains elements corresponding to only the current
-      partition, or to the entire model.
-
-   .. function:: state_dict( )
-      :noindex:
-
-      Returns the ``state_dict`` that contains parameters
-      for the entire model. It first collects the \ ``local_state_dict``  and
-      gathers and merges the \ ``local_state_dict`` from all ``mp_rank``\ s to
-      create a full ``state_dict``.
-
-   .. function:: load_state_dict( )
-      :noindex:
-
-      Same as the ``torch.module.load_state_dict()`` ,
-      except: It first gathers and merges the ``state_dict``\ s across
-      ``mp_rank``\ s, if they are partial. The actual loading happens after the
-      model partition so that each rank knows its local parameters.
-
-   .. function:: register_post_partition_hook(hook)
-      :noindex:
-
-      Registers a callable ``hook`` to
-      be executed after the model is partitioned. This is useful in situations
-      where an operation needs to be executed after the model partition during
-      the first call to ``smp.step`` but before the actual execution of the
-      first forward pass. Returns a ``RemovableHandle`` object ``handle``,
-      which can be used to remove the hook by calling ``handle.remove()``.
-
-   .. function:: cpu( )
-      :noindex:
-
-      Allgathers parameters and buffers across all ``mp_rank``\ s and moves them
-      to the CPU.
-
-.. class:: smp.DistributedOptimizer
-   :noindex:
-
-   **Parameters**
-   - ``optimizer``
-
-   An optimizer wrapper for saving/loading optimizer states. This wrapper
-   returns ``optimizer`` with the following methods overridden:
-
-   .. function:: state_dict( )
-      :noindex:
-
-      Returns the ``state_dict`` that contains optimizer state for the entire model.
-      It first collects the ``local_state_dict`` and gathers and merges
-      the ``local_state_dict`` from all ``mp_rank``s to create a full
-      ``state_dict``. Please note that this needs to be called on all ranks with
-      ``dp_rank()==0`` to ensure the gather happens properly.
-      If it is only called on all such ranks, it can hang.
-
-   .. function::  load_state_dict( )
-      :noindex:
-
-      Same as the ``torch.optimizer.load_state_dict()`` , except:
-
-         -  It first gathers and merges the local ``state_dict``\ s if they are
-            partial.
-         -  The actual loading happens after the model partition so that each
-            rank knows its local parameters.
-
-   .. function::  local_state_dict( )
-      :noindex:
-
-      Returns the ``state_dict`` that contains the
-      local optimizer state that belongs to the current \ ``mp_rank``. This
-      ``state_dict`` contains a key \ ``_smp_is_partial`` to indicate this is
-      a partial \ ``state_dict``, which indicates whether the
-      ``state_dict`` contains elements corresponding to only the current
-      partition, or to the entire model.
-
-   ​
-.. function:: smp.partition(index)
-   :noindex:
-
-   **Inputs**
-
-   -  ``index`` (int) - The index of the partition.
-
-   A context manager which places all modules defined inside into the
-   partition with ID ``index``.  The ``index`` argument must be less than
-   the number of partitions.
-
-   Use ``smp.partition`` to implement manual partitioning.
-   If ``"auto_partition"`` is ``True``, then the
-   ``smp.partition`` contexts are ignored. Any module that is not placed in
-   any ``smp.partition`` context is placed in the
-   ``default_partition`` defined through the SageMaker Python SDK.
-
-   When ``smp.partition`` contexts are nested, the innermost context
-   overrides the rest (see the following example). In PyTorch, manual
-   partitioning should be done inside the module \ ``__init__``, and the
-   partition assignment applies to the modules that are *created* inside
-   the ``smp.partition`` context.
-
-   Example:
-
-   .. code:: python
-
-      class Model(torch.nn.Module):
-          def __init__(self):
-              with smp.partition(1):
-                  self.child0 = Child0()            # child0 on partition 1
-                  with smp.partition(2):
-                      self.child1 = Child1()        # child1 on partition 2
-                  self.child2 = Child2()            # child2 on partition 1
-              self.child3 = Child3()                # child3 on default_partition
-
-.. function:: smp.get_world_process_group( )
-   :noindex:
-
-   Returns a ``torch.distributed`` ``ProcessGroup`` that consists of all
-   processes, which can be used with the ``torch.distributed`` API.
-   Requires ``"ddp": True`` in SageMaker Python SDK parameters.
-
-.. function:: smp.get_mp_process_group( )
-   :noindex:
-
-   Returns a ``torch.distributed`` ``ProcessGroup`` that consists of the
-   processes in the ``MP_GROUP`` which contains the current process, which
-   can be used with the \ ``torch.distributed`` API. Requires
-   ``"ddp": True`` in SageMaker Python SDK parameters.
-
-.. function:: smp.get_dp_process_group( )
-   :noindex:
-
-   Returns a ``torch.distributed`` ``ProcessGroup`` that consists of the
-   processes in the ``DP_GROUP`` which contains the current process, which
-   can be used with the \ ``torch.distributed`` API. Requires
-   ``"ddp": True`` in SageMaker Python SDK parameters.
-
-.. function:: smp.is_initialized( )
-   :noindex:
-
-   Returns ``True`` if ``smp.init`` has already been called for the
-   process, and ``False`` otherwise.
-
-.. function::smp.is_tracing( )
-   :noindex:
-
-   Returns ``True`` if the current process is running the tracing step, and
-   ``False`` otherwise.
-
-.. data:: smp.nn.FusedLayerNorm
-   :noindex:
-
-   `Apex Fused Layer Norm <https://nvidia.github.io/apex/layernorm.html>`__ is currently not
-   supported by the library. ``smp.nn.FusedLayerNorm`` replaces ``apex``
-   ``FusedLayerNorm`` and provides the same functionality. This requires
-   ``apex`` to be installed on the system.
-
-.. data:: smp.optimizers.FusedNovoGrad
-   :noindex:
-
-   `Fused Novo Grad optimizer <https://nvidia.github.io/apex/optimizers.html#apex.optimizers.FusedNovoGrad>`__ is
-   currently not supported by the library. ``smp.optimizers.FusedNovoGrad`` replaces ``apex`` ``FusedNovoGrad``
-   optimizer and provides the same functionality. This requires ``apex`` to
-   be installed on the system.
-
-.. data:: smp.optimizers.FusedLamb
-   :noindex:
-
-   `FusedLamb optimizer <https://nvidia.github.io/apex/optimizers.html#apex.optimizers.FusedLAMB>`__
-   currently doesn’t work with the library. ``smp.optimizers.FusedLamb`` replaces
-   ``apex`` ``FusedLamb`` optimizer and provides the same functionality.
-   This requires ``apex`` to be installed on the system.
-
-.. data:: smp.amp.GradScaler
-   :noindex:
-
-   `Torch AMP Gradscaler <https://pytorch.org/docs/stable/amp.html#torch.cuda.amp.GradScaler>`__
-   currently doesn’t work with the library. ``smp.amp.GradScaler`` replaces
-   ``torch.amp.GradScaler`` and provides the same functionality.
-
-.. _pytorch_saving_loading:
-   :noindex:
-
-APIs for Saving and Loading
-^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-.. function:: smp.save( )
-   :noindex:
-
-   Saves an object. This operation is similar to ``torch.save()``, except
-   it has an additional keyword argument, ``partial``, and accepts only
-   string type for the argument ``f`` (file). If ``partial=True``, each
-   ``mp_rank`` saves a separate checkpoint file and the library adds an ``mp_rank``
-   index to your saved file.
-
-   **Parameters**
-
-   -  ``obj`` (dict): A saved object.
-   -  ``f`` (str): A string containing a file name.
-   -  ``partial`` (bool, default= ``True``):  When set to ``True``, each
-      ``mp_rank`` saves a separate checkpoint file and the library adds an
-      ``mp_rank`` index to the saved file. If you want to be able to load
-      and further train a model that you save with ``smp.save()``, you must
-      set ``partial=True``.
-   -  ``pickle_module`` (picklemodule, default = module ``"pickle"`` from ``"/opt/conda/lib/python3.6/pickle.py"``):
-      A module used for pickling metadata and objects.
-   -  ``pickle_protocol``  (int, default=2): Can be specified to
-      override the defaultprotocol.
-
-.. function:: smp.load( )
-   :noindex:
-
-   Loads an object saved with ``smp.save()`` from a file.
-
-   Similar to, `torch.load() <https://pytorch.org/docs/stable/generated/torch.load.html>`__,
-   except it has an additional keyword argument, ``partial``, and accepts
-   only string type for the argument ``f`` (file). If \ ``partial=True``,
-   then each ``mp_rank`` loads a separate checkpoint file.
-
-   **Parameters**
-
-   -  ``f`` (string): A string containing a file name.
-   -  ``map_location`` (function): A function
-      `torch.device <https://pytorch.org/docs/stable/tensor_attributes.html#torch.torch.device>`__,
-      a string, or a dict specifying how to remap storage locations.
-   -  ``pickle_module`` (pickle module): A module used for unpickling
-      metadata and objects (has to match the \ ``pickle_module``\ used to
-      serialize file).
-   -  ``pickle_load_args`` (Python 3 only): Optional keyword arguments
-      passed to ``pickle_module.load()`` and ``pickle_module.Unpickler()``.
-   -  ``partial`` (bool, default= ``True``): When set to ``True``, each
-      ``mp_rank`` loads the checkpoint corresponding to the ``mp_rank``.
-      Should be used when loading a model trained with the library.
-
-.. _pytorch_saving_loading_instructions:
-   :noindex:
-
-General Instruction For Saving and Loading
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-The library can save partial or full checkpoints.
-
--  For partial checkpoints, each ``mp_rank`` saves its own checkpoint
-   file with only the parameters that belong to that rank.
--  For full checkpoints, the library saves a single checkpoint that contains
-   entire model parameters.
-
-When **saving** using ``smp.save()``, each rank only holds its own
-parameters. If you want to save the full model, there will be some
-communication between the ranks to create the full model. If you save
-checkpoints often, you should save partial checkpoints for best
-performance.
-
-When **loading** using ``smp.load()``, the library can load either partial or |
-full checkpoints or full checkpoints saved by a non-model-parallel model. If you
-want to resume training with a non-model-parallel model or do inference, you need
-a full checkpoint.
-
-The following is an example of how you can save and load a checkpoint:
-
-.. code:: python
-
-   # Original model and optimizer
-   model = MyModel(...)
-   optimizer = MyOpt(...)
-
-   # model parallel wrapper
-   model = smp.DistributedModel(model)
-   optimizer = smp.DistributedOptimizer(optimizer)
-
-   # To save, always save on dp_rank 0 to avoid data racing
-   if partial:
-       # To save the partial model on each mp rank
-       # the library will create `checkpoint.pt_{mprank}` for each mp rank
-       if save_partial_model:
-           if smp.dp_rank() == 0:
-               model_dict = model.local_state_dict() # save the partial model
-               opt_dict = optimizer.local_state_dict() # save the partial optimizer state
-               smp.save(
-                   {"model_state_dict": model_dict, "optimizer_state_dict": opt_dict},
-                   f"/checkpoint.pt",
-                   partial=True,
-               )
-
-       # To save the full model
-       if save_full_model:
-           if smp.dp_rank() == 0:
-               model_dict = model.state_dict() # save the full model
-               opt_dict = optimizer.state_dict() # save the full optimizer state
-               smp.save(
-                   {"model_state_dict": model_dict, "optimizer_state_dict": opt_dict},
-                   "/checkpoint.pt",
-                   partial=False,
-               )
-
-   # To load, load on all ranks.
-   # The only difference for partial/full loading is the partial flag in smp.load
-   # Load partial checkpoint
-   if partial_checkpoint:
-       checkpoint = smp.load("/checkpoint.pt", partial=True)
-       model.load_state_dict(checkpoint["model_state_dict"])
-       optimizer.load_state_dict(checkpoint["optimizer_state_dict"])
-   # Load full checkpoint
-   if full_checkpoint:
-       checkpoint = smp.load("/checkpoint.pt", partial=False)
-       model.load_state_dict(checkpoint["model_state_dict"])
-       optimizer.load_state_dict(checkpoint["optimizer_state_dict"])
diff --git a/doc/api/training/smp_versions/v1.1.0/smd_model_parallel_tensorflow.rst b/doc/api/training/smp_versions/v1.1.0/smd_model_parallel_tensorflow.rst
deleted file mode 100644
index 252c60d16b..0000000000
--- a/doc/api/training/smp_versions/v1.1.0/smd_model_parallel_tensorflow.rst
+++ /dev/null
@@ -1,164 +0,0 @@
-TensorFlow API
-==============
-
-**Supported version: 2.3.1**
-
-**Important**: This API document assumes you use the following import statement in your training scripts.
-
-.. code:: python
-
-   import smdistributed.modelparallel.tensorflow as smp
-
-.. tip::
-
-   Refer to
-   `Modify a TensorFlow Training Script
-   <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-customize-training-script.html#model-parallel-customize-training-script-tf>`_
-   to learn how to use the following API in your TensorFlow training script.
-
-.. class:: smp.DistributedModel
-   :noindex:
-
-   A sub-class of the Keras \ ``Model`` class, which defines the model to
-   be partitioned. Model definition is done by sub-classing
-   ``smp.DistributedModel`` class, and implementing the ``call()`` method,
-   in the same way as the Keras model sub-classing API. Any operation that
-   is part of the \ ``smp.DistributedModel.call()`` method is subject to
-   partitioning, meaning that every operation placed inside executes in
-   exactly one of the devices (the operations outside run on all devices).
-
-
-   Similar to the regular Keras API, the forward pass is done by directly
-   calling the model object on the input tensors. For example:
-
-   .. code:: python
-
-      predictions = model(inputs)   # model is a smp.DistributedModel object
-
-   However, ``model()`` calls can only be made inside a
-   ``smp.step``-decorated function.
-
-   The outputs from a ``smp.DistributedModel`` are available in all ranks,
-   regardless of which rank computed the last operation.
-
-   **Methods:**
-
-   .. function:: save_model(save_path="/opt/ml/model")
-      :noindex:
-
-      **Inputs**
-      - ``save_path`` (``string``): A path to save an unpartitioned model with latest training weights.
-
-      Saves the entire,
-      unpartitioned model with the latest trained weights to ``save_path`` in
-      TensorFlow ``SavedModel`` format. Defaults to ``"/opt/ml/model"``, which
-      SageMaker monitors to upload the model artifacts to Amazon S3.
-
-.. function:: smp.partition(index)
-   :noindex:
-
-   **Inputs**
-
-   -  ``index`` (``int``): The index of the partition.
-
-   A context manager which places all operations defined inside into the
-   partition whose ID is equal to ``index``. When
-   ``smp.partition`` contexts are nested, the innermost context overrides
-   the rest. The ``index`` argument must be smaller than the number of
-   partitions.
-
-   ``smp.partition`` is used in the manual partitioning API;
-   if \ ``"auto_partition"`` parameter is set to ``True`` while launching
-   training, then ``smp.partition`` contexts are ignored. Any operation
-   that is not placed in any ``smp.partition`` context is placed in the
-   ``default_partition``, as shown in the following example:
-
-   .. code:: python
-
-      # auto_partition: False
-      # default_partition: 0
-      smp.init()
-      [...]
-      x = tf.constant(1.2)                     # placed in partition 0
-      with smp.partition(1):
-          y = tf.add(x, tf.constant(2.3))      # placed in partition 1
-          with smp.partition(3):
-              z = tf.reduce_sum(y)             # placed in partition 3
-
-   ​
-
-.. class:: smp.CheckpointManager
-   :noindex:
-
-   A subclass of TensorFlow
-   `CheckpointManager <https://www.tensorflow.org/api_docs/python/tf/train/CheckpointManager>`__,
-   which is used to manage checkpoints. The usage is similar to TensorFlow
-   ``CheckpointManager``.
-
-   The following returns a ``CheckpointManager`` object.
-
-   .. code:: python
-
-      smp.CheckpointManager(checkpoint,
-                            directory="/opt/ml/checkpoints",
-                            max_to_keep=None,
-                            checkpoint_name="ckpt")
-
-
-   **Important:** ``smp.CheckpointManager.restore()`` must be called after
-   the first training step. This is because the first call of the
-   ``smp.step`` function constructs and partitions the model, which must
-   take place before the checkpoint restore. Calling it before the first
-   ``smp.step`` call might result in hangs or unexpected behavior.
-
-   **Parameters**
-
-   -  ``checkpoint``: A `tf.train.Checkpoint
-      <https://www.tensorflow.org/api_docs/python/tf/train/Checkpoint>`__ instance
-      that represents a model checkpoint.
-
-   -  ``directory``: (``str``) The path to a directory in which to write
-      checkpoints. A file named "checkpoint" is also written to this
-      directory (in a human-readable text format) which contains the state
-      of the ``CheckpointManager``. Defaults to
-      ``"/opt/ml/checkpoints"``, which is the directory that SageMaker
-      monitors for uploading the checkpoints to Amazon S3.
-   -  ``max_to_keep`` (``int``): The number of checkpoints to keep. If
-      ``None``, all checkpoints are kept.
-   -  ``checkpoint_name`` (``str``): Custom name for the checkpoint file.
-      Defaults to ``"ckpt"``.
-
-
-   **Methods:**
-
-   .. function:: save( )
-      :noindex:
-
-      Saves a new checkpoint in the specified directory. Internally uses ``tf.train.CheckpointManager.save()``.
-
-   .. function:: restore( )
-      :noindex:
-
-      Restores the latest checkpoint in the specified directory.
-      Internally uses ``tf.train.CheckpointManager.restore()``.
-
-
-   **Examples:**
-
-   .. code:: python
-
-      checkpoint = tf.train.Checkpoint(optimizer=optimizer, model=model)
-      ckpt_manager = smp.CheckpointManager(checkpoint, max_to_keep=5)  # use /opt/ml/checkpoints
-
-      for inputs in train_ds:
-          loss = train_step(inputs)
-          # [...]
-          ckpt_manager.save()  # save a new checkpoint in /opt/ml/checkpoints
-
-   .. code:: python
-
-      for step, inputs in enumerate(train_ds):
-          if step == 1:                    # NOTE: restore occurs on the second step
-              ckpt_manager.restore()
-          loss = train_step(inputs)
-
diff --git a/doc/api/training/smp_versions/v1.10.0/smd_model_parallel_common_api.rst b/doc/api/training/smp_versions/v1.10.0/smd_model_parallel_common_api.rst
deleted file mode 100644
index b4713b2707..0000000000
--- a/doc/api/training/smp_versions/v1.10.0/smd_model_parallel_common_api.rst
+++ /dev/null
@@ -1,538 +0,0 @@
-Common API
-==========
-
-The following SageMaker distribute model parallel APIs are common across all frameworks.
-
-.. contents:: Table of Contents
-  :depth: 3
-  :local:
-
-The Library's Core APIs
------------------------
-
-This API document assumes you use the following import statement in your training scripts.
-
-**TensorFlow**
-
-.. code:: python
-
-   import smdistributed.modelparallel.tensorflow as smp
-
-**PyTorch**
-
-.. code:: python
-
-   import smdistributed.modelparallel.torch as smp
-
-
-.. function:: smp.init( )
-   :noindex:
-
-   Initialize the library. Must be called at the beginning of training script.
-
-.. function:: @smp.step(non_split_inputs, input_split_axes, [*args, **kwargs])
-   :noindex:
-
-   A decorator that must be placed over a function that represents a single
-   forward and backward pass (for training use cases), or a single forward
-   pass (for evaluation use cases). Any computation that is defined inside
-   the ``smp.step``-decorated function is executed in a pipelined manner.
-
-   By default, every tensor input to the function is split across its batch
-   dimension into a number of microbatches specified while launching the
-   training job. This behavior can be customized through the arguments to
-   ``smp.step``, described below. The library then orchestrates the execution of
-   each microbatch across all partitions, based on the chosen pipeline
-   type.
-
-   In a typical use case, forward pass and back-propagation are executed
-   inside an \ ``smp.step``-decorated function and gradients, loss, and
-   other relevant metrics (such as accuracy, etc.) are returned from
-   ``smp.step``-decorated function.
-
-   Any gradient post-processing operation, such as gradient clipping and
-   allreduce, as well as ``optimizer.apply_gradients`` calls (for TF) or
-   ``optimizer.step`` (for PT) should be applied on the gradients returned
-   from the ``smp.step`` function, and not inside the ``smp.step``
-   function. This is because every operation inside ``smp.step`` is
-   executed once per microbatch, so having these operations inside
-   ``smp.step`` can either be inefficient (in the case of allreduce), or
-   lead to wrong results (in the case of ``apply_gradients`` /
-   ``optimizer.step``).
-
-   If the objects returned from the ``smp.step``-decorated function contain
-   ``tf.Tensor``\ s / ``torch.Tensor``\ s, they are converted to
-   ``StepOutput`` objects. A ``StepOutput`` object encapsulates all
-   versions of the tensor across different microbatches
-   (see ``StepOutput`` entry for more information).
-
-   The argument to ``smp.step`` decorated function should either be a tensor
-   or an instance of list, tuple, dict or set for it to be split across
-   microbatches. If your object doesn't fall into this category, you can make
-   the library split your object, by implementing ``smp_slice`` method.
-
-   Below is an example of how to use it with PyTorch.
-
-   .. code:: python
-
-      class CustomType:
-          def __init__(self, tensor):
-              self.data = tensor
-
-          # The library will call this to invoke slicing on the object passing in total microbatches (num_mb)
-          # and the current microbatch index (mb).
-          def smp_slice(self, num_mb, mb, axis):
-              dim_size = list(self.data.size())[axis]
-
-              split_size = dim_size // num_mb
-              sliced_tensor = self.data.narrow(axis, mb * split_size, split_size)
-              return CustomType(sliced_tensor, self.other)
-
-      custom_obj = CustomType(torch.ones(4,))
-
-      @smp.step()
-      def step(custom_obj):
-          loss = model(custom_obj)
-          model.backward(loss)
-          return loss
-
-
-   **Important:** ``smp.step`` splits the batch into microbatches, and
-   executes everything inside the decorated function once per microbatch.
-   This might affect the behavior of batch normalization, any operation
-   that explicitly uses the batch size information, or any other Python
-   code that is expected to run once.
-
-   **TensorFlow-specific behavior**
-
-   ``smp.step`` is a wrapper that
-   inherits from and extends the behavior of ``tf.function``, and as such,
-   all the caveats that apply to the use of ``tf.function``\ s also apply
-   to ``smp.step``. In particular, any operation that is inside
-   ``smp.step`` executes in graph mode, and not eager mode.
-
-   In the first call, ``smp.step`` performs tracing of the wrapped function every time
-   one of the tensor arguments changes their shape or dtype, or for every
-   new value of a Python argument, if there is one. Tracing is expensive,
-   so such scenarios should be avoided as much as possible or,
-   alternatively, an ``input_signature`` argument must be provided. For
-   more information on the usage of ``tf.function``, refer to the
-   TensorFlow documentation:
-
-   -  https://www.tensorflow.org/api_docs/python/tf/function\
-   -  https://www.tensorflow.org/guide/function\
-
-   Each ``smp.step`` decorated function must have a return value that depends on the
-   output of ``smp.DistributedModel``.
-
-   **Common parameters**
-
-   -  ``non_split_inputs`` (``list``): The list of arguments to the decorated function
-      that should not be split along the batch dimension. Should be used
-      for all input tensors that do not have a batch dimension. Should be a
-      list of argument names as ``str``, as they appear in the signature of
-      the ``smp.step``-decorated function. By default it is considered an
-      empty list.
-
-   -  ``input_split_axes`` (``dict``): A dict that maps the argument name to its batch
-      axis. The keys should be the argument names as ``str``, as they
-      appear in the signature of the ``smp.step``-decorated function.  By
-      default all batch axes are assumed to be the 0-axis.
-
-   **TensorFlow-only parameters**
-
-   -  All arguments of ``tf.function``. Note:
-      The \ ``experimental_compile`` argument of ``tf.function`` may not
-      work as expected with ``smp.step``, since it interferes with
-      pipelining and model partitioning. To enable XLA with the library, you can
-      instead use \ ``tf.config.optimizer.set_jit(True)``.
-
-   **PyTorch-only parameters**
-
-   -  ``detach_outputs`` (``bool``) : If ``True``, calls ``torch.Tensor.detach()`` on
-      all returned ``torch.Tensor`` outputs. Setting it to ``False``
-      increases memory consumption, unless ``detach()`` is manually called
-      on the returned tensors, because the model graph is not cleared from
-      memory after the training step. Set to \ ``True`` by default.
-
-   **Returns**
-
-   -  The same object(s) returned from the decorated function. All
-      returned \ ``tf.Tensor``, \ ``tf.Variable``  objects (for TF) or
-      ``torch.Tensor`` objects (for PT) are wrapped inside
-      a \ ``StepOutput`` object, even when they are inside a Python
-      ``list``, ``tuple``, or ``dict``.
-
-
-
-.. class:: StepOutput
-   :noindex:
-
-
-   A class that encapsulates all versions of a ``tf.Tensor``
-   or \ ``torch.Tensor`` across all microbatches.
-
-   When a particular ``tf.Tensor`` or ``torch.Tensor`` is computed inside
-   ``smp.step``, different versions of the tensor are computed for each
-   microbatch.
-
-   When this tensor is returned from ``smp.step`` and is accessed outside
-   of the decorated function, it appears as a ``StepOutput`` object, which
-   contains all such versions. For example,
-
-   -  In the case of Tensorflow, the gradient for a particular
-      ``tf.Variable`` is computed on each microbatch individually, and if
-      this gradient is returned from ``smp.step``, all gradients for this
-      ``tf.Variable`` become part of the same ``StepOutput`` object. The
-      ``StepOutput`` class offers the following API for commonly-used
-      post-processing operations on such tensors.
-   -  In the case of PyTorch, the loss for each microbatch is computed
-      individually and all the ``torch.Tensor``\ s that represent the loss
-      for different microbatches become part of same ``StepOutput`` object,
-      if loss is returned from the ``smp.step`` function.
-
-
-   The ``StepOutput`` class offers the following API for commonly-used
-   post-processing operations on tensors.
-
-   .. data:: StepOutput.outputs
-      :noindex:
-
-      Returns a list of the underlying tensors, indexed by microbatch.
-
-   .. function:: StepOutput.reduce_mean( )
-      :noindex:
-
-      Returns a ``tf.Tensor``, ``torch.Tensor`` that averages the constituent ``tf.Tensor`` s
-      ``torch.Tensor`` s. This is commonly used for averaging loss and gradients across microbatches.
-
-   .. function:: StepOutput.reduce_sum( )
-      :noindex:
-
-      Returns a ``tf.Tensor`` /
-      ``torch.Tensor`` that sums the constituent
-      ``tf.Tensor``\ s/\ ``torch.Tensor``\ s.
-
-   .. function:: StepOutput.concat( )
-      :noindex:
-
-      Returns a
-      ``tf.Tensor``/``torch.Tensor`` that concatenates tensors along the
-      batch dimension using ``tf.concat`` / ``torch.cat``.
-
-   .. function:: StepOutput.stack( )
-      :noindex:
-
-      Applies ``tf.stack`` / ``torch.stack``
-      operation to the list of constituent ``tf.Tensor``\ s /
-      ``torch.Tensor``\ s.
-
-   **TensorFlow-only methods**
-
-   .. function:: StepOutput.merge( )
-      :noindex:
-
-      Returns a ``tf.Tensor`` that
-      concatenates the constituent ``tf.Tensor``\ s along the batch
-      dimension. This is commonly used for merging the model predictions
-      across microbatches.
-
-   .. function:: StepOutput.accumulate(method="variable", var=None)
-      :noindex:
-
-      Functionally the same as ``StepOutput.reduce_mean()``. However, it is
-      more memory-efficient, especially for large numbers of microbatches,
-      since it does not wait for all constituent \ ``tf.Tensor``\ s to be
-      ready to start averaging them, thereby saving memory.
-
-      In some cases (XLA for example) ``StepOutput.reduce_mean()`` might end
-      up being more memory-efficient than ``StepOutput.accumulate()``.
-
-      **Parameters**
-
-      -  ``method`` (``"add_n"`` or ``"accumulate_n"`` or ``"variable"``):
-         If ``"add_n"`` or ``"accumulate_n"``, the library uses
-         ``tf.add_n`` and ``tf.accumulate_n``, respectively, to implement
-         accumulation. If ``"variable"``, the library uses an internal ``tf.Variable``
-         into which to accumulate the tensors. Default is \ ``"variable"``.
-         Note: Memory usage behavior of these choices can depend on the model
-         and implementation.
-
-      -  ``var``: A ``tf.Variable`` into which, if provided, the library uses to
-         accumulate the tensors. If \ ``None``, the library internally creates a
-         variable. If ``method`` is not ``"variable"``, this argument is
-         ignored.
-
-.. _mpi_basics:
-   :noindex:
-
-MPI Basics
-----------
-
-The library exposes the following basic MPI primitives to its Python API:
-
-**Global**
-
--  ``smp.rank()`` : The global rank of the current process.
--  ``smp.size()`` : The total number of processes.
--  ``smp.get_world_process_group()`` :
-   ``torch.distributed.ProcessGroup`` that contains all processes.
--  ``smp.CommGroup.WORLD``: The communication group corresponding to all processes.
--  ``smp.local_rank()``: The rank among the processes on the current instance.
--  ``smp.local_size()``: The total number of processes on the current instance.
--  ``smp.get_mp_group()``: The list of ranks over which the current model replica is partitioned.
--  ``smp.get_dp_group()``: The list of ranks that hold different replicas of the same model partition.
-
-**Tensor Parallelism**
-
--  ``smp.tp_rank()`` : The rank of the process within its
-   tensor-parallelism group.
--  ``smp.tp_size()`` : The size of the tensor-parallelism group.
--  ``smp.get_tp_process_group()`` : Equivalent to
-   ``torch.distributed.ProcessGroup`` that contains the processes in the
-   current tensor-parallelism group.
--  ``smp.CommGroup.TP_GROUP`` : The communication group corresponding to
-   the current tensor parallelism group.
-
-**Pipeline Parallelism**
-
--  ``smp.pp_rank()`` : The rank of the process within its
-   pipeline-parallelism group.
--  ``smp.pp_size()`` : The size of the pipeline-parallelism group.
--  ``smp.get_pp_process_group()`` : ``torch.distributed.ProcessGroup``
-   that contains the processes in the current pipeline-parallelism group.
--  ``smp.CommGroup.PP_GROUP`` : The communication group corresponding to
-   the current pipeline parallelism group.
-
-**Reduced-Data Parallelism**
-
--  ``smp.rdp_rank()`` : The rank of the process within its
-   reduced-data-parallelism group.
--  ``smp.rdp_size()`` : The size of the reduced-data-parallelism group.
--  ``smp.get_rdp_process_group()`` : ``torch.distributed.ProcessGroup``
-   that contains the processes in the current reduced data parallelism
-   group.
--  ``smp.CommGroup.RDP_GROUP`` : The communication group corresponding
-   to the current reduced data parallelism group.
-
-**Model Parallelism**
-
--  ``smp.mp_rank()`` : The rank of the process within its model-parallelism
-   group.
--  ``smp.mp_size()`` : The size of the model-parallelism group.
--  ``smp.get_mp_process_group()`` : ``torch.distributed.ProcessGroup``
-   that contains the processes in the current model-parallelism group.
--  ``smp.CommGroup.MP_GROUP`` : The communication group corresponding to
-   the current model parallelism group.
-
-**Data Parallelism**
-
--  ``smp.dp_rank()`` : The rank of the process within its data-parallelism
-   group.
--  ``smp.dp_size()`` : The size of the data-parallelism group.
--  ``smp.get_dp_process_group()`` : ``torch.distributed.ProcessGroup``
-   that contains the processes in the current data-parallelism group.
--  ``smp.CommGroup.DP_GROUP`` : The communication group corresponding to
-   the current data-parallelism group.
-
-.. _communication_api:
-   :noindex:
-
-Communication API
------------------
-
-The library provides a few communication primitives which can be helpful while
-developing the training script. These primitives use the following
-``enum`` s as arguments to specify which processes the communication
-should involve.
-​
-
-**Helper structures**
-
-.. data:: smp.CommGroup
-   :noindex:
-
-   An ``enum`` that takes the values
-   ``CommGroup.WORLD``, ``CommGroup.MP_GROUP``, and ``CommGroup.DP_GROUP``.
-   These values can also be accessed as ``smp.WORLD``, ``smp.MP_GROUP``,
-   and ``smp.DP_GROUP`` respectively.
-
-   -  ``CommGroup.WORLD``: Represents the entire group of processes used in
-      training
-   -  ``CommGroup.MP_GROUP``: Represents the group of processes that hold
-      the same model replica as the current process. The processes in a
-      single ``MP_GROUP`` collectively store an entire replica of the
-      model.
-   -  ``CommGroup.DP_GROUP``: Represents the group of processes that hold
-      the same model partition as the current process. The processes in a
-      single ``DP_GROUP`` perform data parallelism/allreduce among
-      themselves.
-
-.. data:: smp.RankType
-   :noindex:
-
-   An ``enum`` that takes the values
-   ``RankType.WORLD_RANK``, ``RankType.MP_RANK``, and ``RankType.DP_RANK``.
-
-   -  ``RankType.WORLD_RANK``: The associated rank is to be interpreted as
-      the rank of the process across all processes used in training.
-   -  ``RankType.MP_RANK``: The associated rank is to be interpreted as the
-      rank of the process within the ``MP_GROUP``.
-   -  ``RankType.DP_RANK``: The associated rank is to be interpreted as the
-      rank of the process within the ``DP_GROUP``.
-
-
-**Communication primitives:**
-
-.. function:: smp.broadcast(obj, group)
-   :noindex:
-
-   Sends the object to all processes in the
-   group. The receiving process must call ``smp.recv_from`` to receive the
-   sent object.
-
-   **Inputs**
-
-   -  ``obj``: An arbitrary picklable Python object that will be broadcast.
-
-   -  ``group``: A ``CommGroup`` argument that represents to which group of
-      processes the object will be sent.
-
-   **Notes**
-
-   -  When you use ``broadcast`` on the sender process, there needs
-      to be an accompanying ``smp.recv_from()`` call on the receiver
-      processes.
-
-   -  This is a synchronous call; the ``broadcast`` statement
-      returns only after all ranks participating in the call have made a
-      matching ``recv_from`` call.
-
-   **Example**
-
-   .. code:: python
-
-      if smp.rank() == 0:
-          smp.broadcast(something, group=smp.CommGroup.WORLD)
-      else:
-          smp.recv_from(0, rank_type=smp.RankType.WORLD_RANK)
-
-.. function:: smp.send(obj, dest_rank, rank_type)
-   :noindex:
-
-   Sends the object ``obj`` to
-   ``dest_rank``, which is of a type specified by ``rank_type``.
-
-   **Inputs**
-
-   -  ``obj``: An arbitrary picklable Python object that will be sent.
-
-   -  ``dest_rank`` (``int``): An integer denoting the rank of the receiving process.
-
-   -  ``rank_type`` (``enum``): A ``smp.RankType`` ``enum`` that determines how
-      ``dest_rank`` is to be interpreted. For example if ``dest_rank`` is 1
-      and ``rank_type`` is ``MP_RANK``, then ``obj`` is sent to process
-      with ``mp_rank`` 1 in the ``MP_GROUP`` which contains the current
-      process.
-
-   **Notes**
-
-   -  Note: \ This is a synchronous call; the ``send`` statement returns
-      only after the destination rank has made a matching
-      ``recv_from`` call.
-
-.. function:: smp.recv_from(src_rank, rank_type)
-   :noindex:
-
-   Receive an object from a peer process. Can be used with a matching
-   ``smp.send`` or a ``smp.broadcast`` call.
-
-   **Inputs**
-
-   -  ``src_rank`` (``int``): An integer denoting rank of the sending process.
-
-   -  ``rank_type`` (``enum``): A ``smp.RankType`` ``enum`` that determines how
-      ``dest_rank`` is to be interpreted. For example if ``src_rank`` is 1
-      and ``rank_type`` is ``MP_RANK``, then the object is received from
-      the process with ``mp_rank`` 1 in the ``MP_GROUP`` which contains the
-      current process.
-
-   **Returns**
-
-   Returns the python object that is sent by the peer process.
-
-   **Notes**
-
-   -  Note: This is a synchronous call; the ``recv_from`` statement returns
-      only after the source rank has made a matching ``send`` or
-      ``broadcast`` call, and the object is received.
-
-.. function:: smp.allgather(obj, group)
-   :noindex:
-
-   A collective call that gathers all the
-   submitted objects across all ranks in the specified ``group``. Returns a
-   list whose ``i``\ th index contains the object submitted by the
-   ``i``\ th rank in ``group``.
-
-   **Inputs**
-
-   -  ``obj``: An arbitrary picklable Python object that will be
-      allgathered.
-
-   -  ``group`` : A ``CommGroup`` argument that represents which group of
-      processes participate in ``allgather``.
-
-   **Notes**
-
-   -  Note: This is a synchronous call; the ``allgather`` statement returns
-      only after all ranks participating in the call have made a matching
-      ``allgather`` call, and all the objects are received at the current
-      rank.
-
-   **Examples**
-
-   .. code:: python
-
-      # assuming mp_size() == 2
-
-      if smp.mp_rank() == 0:
-          out = smp.allgather(obj1, smp.CommGroup.MP_GROUP)  # returns [obj1, obj2]
-      else:
-          out = smp.allgather(obj2, smp.CommGroup.MP_GROUP)  # returns [obj1, obj2]
-
-.. function:: smp.barrier(group=smp.WORLD)
-   :noindex:
-
-   A statement that hangs until all
-   processes in the specified group reach the barrier statement, similar to
-   ``MPI_Barrier()``.
-
-   **Inputs**
-
-   -  ``group``: An ``smp.CommGroup`` ``enum`` that specifies the group of
-      processes participating in the barrier call. Defaults to
-      ``smp.WORLD``.
-
-   **Examples**
-
-   -  Assume there are 8 processes and 2 model partitions, and
-      therefore 4 \ ``mp_group``\ s, and 2 ``dp_group``\ s. If
-      the \ ``barrier`` call is passed the value ``smp.MP_GROUP`` for its
-      group argument, then each process only waits until the other process
-      of its own ``mp_group`` reaches that point. It does not wait for
-      processes outside that ``mp_group``.
-
-.. function:: smp.dp_barrier()
-   :noindex:
-
-   Same as passing ``smp.DP_GROUP``\ to ``smp.barrier()``.
-   Waits for the processes in the same \ ``dp_group`` as
-   the current process to reach the same point in execution.
-
-.. function:: smp.mp_barrier()
-   :noindex:
-
-   Same as passing ``smp.MP_GROUP`` to
-   ``smp.barrier()``. Waits for the processes in the same ``mp_group`` as
-   the current process to reach the same point in execution.
diff --git a/doc/api/training/smp_versions/v1.10.0/smd_model_parallel_pytorch.rst b/doc/api/training/smp_versions/v1.10.0/smd_model_parallel_pytorch.rst
deleted file mode 100644
index 7a81e6ddfe..0000000000
--- a/doc/api/training/smp_versions/v1.10.0/smd_model_parallel_pytorch.rst
+++ /dev/null
@@ -1,883 +0,0 @@
-PyTorch API
-===========
-
-To use the PyTorch-specific APIs for SageMaker distributed model parallism,
-import the ``smdistributed.modelparallel.torch`` package at the top of your training script.
-
-.. code:: python
-
-   import smdistributed.modelparallel.torch as smp
-
-
-.. tip::
-
-   Refer to
-   `Modify a PyTorch Training Script
-   <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-customize-training-script-pt.html>`_
-   to learn how to use the following API in your PyTorch training script.
-
-.. contents:: Topics
-  :depth: 1
-  :local:
-
-smdistributed.modelparallel.torch.DistributedModel
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-.. class:: smdistributed.modelparallel.torch.DistributedModel
-  :noindex:
-
-   A sub-class of ``torch.nn.Module`` which specifies the model to be
-   partitioned. Accepts a ``torch.nn.Module`` object ``module`` which is
-   the model to be partitioned. The returned ``DistributedModel`` object
-   internally manages model parallelism and data parallelism. Only one
-   model in the training script can be wrapped with
-   ``smdistributed.modelparallel.torch.DistributedModel``.
-
-   **Example:**
-
-   .. code:: python
-
-      import smdistributed.modelparallel.torch as smp
-
-      model = smp.DistributedModel(model)
-
-   **Important**: The ``__call__`` and  ``backward`` method calls on the
-   ``smdistributed.modelparallel.torch.DistributedModel`` object (in the following example, the object
-   is \ ``model``) can only be made inside a ``smdistributed.modelparallel.torch.step``-decorated
-   function.
-
-   Since ``DistributedModel``  is a ``torch.nn.Module``, a forward pass can
-   be performed by calling the \ ``DistributedModel`` object on the input
-   tensors.
-
-   .. code:: python
-
-      predictions = model(inputs)   # model is a smp.DistributedModel object
-
-   For a backward pass, one needs to call the backward function on
-   the \ ``DistributedModel`` object, with tensors and gradients as
-   arguments, replacing the PyTorch operations \ ``torch.Tensor.backward``
-   or ``torch.autograd.backward``.
-
-   The API for ``model.backward`` is very similar to
-   ``torch.autograd.backward``. For example, the following
-   ``backward`` calls:
-
-   .. code:: python
-
-      torch.autograd.backward(loss) or loss.backward()
-
-   should be replaced with:
-
-   .. code:: python
-
-      model.backward(loss) # loss is a tensor with only one element as its data
-
-   Similarly, for non-scalar tensors, replace the following
-   ``backward`` call containing incoming gradient arguments:
-
-   .. code:: python
-
-      torch.autograd.backward(outputs, out_grads)
-
-   with the following line:
-
-   .. code:: python
-
-      model.backward(outputs, out_grads)
-
-   In these examples, all ``__call__``  and ``backward`` method calls on
-   the model objects (``model(inputs)`` and ``model.backward(loss)``) must be made inside
-   a ``smdistributed.modelparallel.torch.step``-decorated function.
-
-   **Using DDP**
-
-   If DDP is enabled with the SageMaker model parallel library, do not not place a PyTorch
-   ``DistributedDataParallel`` wrapper around the ``DistributedModel`` because
-   the ``DistributedModel`` wrapper will also handle data parallelism.
-
-   Unlike the original DDP wrapper, when you use ``DistributedModel``,
-   model parameters and buffers are not immediately broadcast across
-   processes when the wrapper is called. Instead, the broadcast is deferred to the first call of the
-   ``smdistributed.modelparallel.torch.step``-decorated function when the partition is done.
-
-   **Parameters**
-
-   -  ``module`` (``torch.nn.Module``): Module to be distributed (data parallelism and model parallelism).
-
-   -  ``trace_device`` (``"cpu"`` or ``"gpu"``) (default: ``"gpu"``)
-      Whether to perform the tracing step on the GPU or CPU. The tracing step gathers
-      information on the order of execution of modules, the shapes of
-      intermediate outputs, and execution times, to be used by the
-      partitioning algorithm. If ``trace_device`` is set to GPU, accurate
-      module execution times can be gathered during tracing for potentially
-      improved partitioning decision. However, if the model is too large to
-      fit in a single GPU, then ``trace_device`` should be set to CPU.
-
-   -  ``trace_execution_times`` (``bool``) (default: ``False``): If ``True``,
-      the library profiles the execution time of each module during tracing, and uses
-      it in the partitioning decision. This improves the partitioning
-      decision, but it might make the tracing slower. It may also introduce
-      some degree of non-determinism in partitioning results, because of the
-      inherent randomness in module execution times. Must be ``False`` if
-      ``trace_device`` is ``"cpu"``.
-
-   -  ``overlapping_allreduce`` (``bool``) (default: ``True``): This is only
-      applicable for hybrid data parallelism/model parallelism use cases (when
-      ``ddp`` is set to ``True`` while launching training). The library uses this flag
-      to decide whether to do overlapping allreduce whenever a parameter
-      gradients are ready. This leads to overlapping of communication and
-      computation and can improve performance. If this is set to ``False`` ,
-      allreduce is performed at the end of the step.
-
-   -  ``backward_passes_per_step`` (``int``) (default: 1): This is only
-      applicable for hybrid data parallelism/model parallelism use cases (when
-      ``ddp`` is set to ``True`` in config). This parameter indicates the
-      number of backward passes to perform before calling allreduce on DDP.
-      This allows accumulating updates over multiple mini-batches before
-      reducing and applying them.
-
-   -  ``average_grads_across_microbatches`` (``bool``) (default: ``True``):
-      Whether or not the computed gradients should be averaged across
-      microbatches. If ``False``, the computed gradients will be summed across
-      microbatches, but not divided by the number of microbatches. In typical
-      use case where the computed loss is averaged over the mini-batch, this
-      should be left as ``True``. If you use a loss function that only sums
-      the per-sample loss across the batch (and not divide by the batch size),
-      then this must be set to ``False`` for correctness.
-
-   -  ``bucket_cap_mb`` (default: 25): \ ``DistributedDataParallel`` buckets
-      parameters into multiple buckets so that gradient reduction of each
-      bucket can potentially overlap with backward
-      computation. \ ``bucket_cap_mb``\ controls the bucket size in MegaBytes
-      (MB).
-
-   -  ``trace_memory_usage`` (default: False): When set to True, the library attempts
-      to measure memory usage per module during tracing. If this is disabled,
-      memory usage will be estimated through the sizes of tensors returned from
-      the module.
-
-   -  ``broadcast_buffers`` (default: True): Flag to be used with ``ddp=True``.
-      This parameter is forwarded to the underlying ``DistributedDataParallel`` wrapper.
-      Please see: `broadcast_buffer <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel>`__.
-
-   -  ``gradient_as_bucket_view`` (default: False): To be
-      used with ``ddp=True``. This parameter is forwarded to the underlying
-      ``DistributedDataParallel`` wrapper. Please see `gradient_as_bucket_view <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel>`__.
-
-   **Properties**
-
-   -  ``partitioned``: Is ``True`` if the model is partitioned, ``False``
-      otherwise. Initialized to ``False`` when ``DistributedModel`` is first
-      created. It becomes be ``True`` during the first call
-      to ``smdistributed.modelparallel.torch.step``-decorated function. Once the model is partitioned, the
-      local parameters or local ``state_dict`` can be fetched using the
-      following methods.
-
-   **Methods**
-
-   .. function:: backward(tensors, grad_tensors)
-      :noindex:
-
-      Triggers a distributed backward
-      pass across model partitions. Example usage provided in the previous
-      section. The API is very similar
-      to https://pytorch.org/docs/stable/autograd.html#torch.autograd.backward.
-      ``retain_grad`` and ``create_graph``  flags are not supported.
-
-   .. function:: local_buffers( )
-      :noindex:
-
-      Returns an iterator over buffers for the modules in
-      the partitioned model that have been assigned to the current process.
-
-   .. function:: local_named_buffers( )
-      :noindex:
-
-      Returns an iterator over buffers for the
-      modules in the partitioned model that have been assigned to the current
-      process. This yields both the name of the buffer as well as the buffer
-      itself.
-
-   .. function:: local_parameters( )
-      :noindex:
-
-      Returns an iterator over parameters for the
-      modules in the partitioned model that have been assigned to the current
-      process.
-
-   .. function:: local_named_parameters( )
-      :noindex:
-
-      Returns an iterator over parameters for
-      the modules in the partitioned model that have been assigned to the
-      current process. This yields both the name of the parameter as well as
-      the parameter itself.
-
-   .. function:: local_modules( )
-      :noindex:
-
-      Returns an iterator over the modules in the
-      partitioned model that have been assigned to the current process.
-
-   .. function:: local_named_modules( )
-      :noindex:
-
-      Returns an iterator over the modules in the
-      partitioned model that have been assigned to the current process. This
-      yields both the name of the module as well as the module itself.
-
-   .. function:: local_state_dict( )
-      :noindex:
-
-      Returns the ``state_dict`` that contains local
-      parameters that belong to the current \ ``mp_rank``. This ``state_dict``
-      contains a key \ ``_smp_is_partial`` to indicate this is a
-      partial \ ``state_dict``, which indicates whether the
-      ``state_dict`` contains elements corresponding to only the current
-      partition, or to the entire model.
-
-   .. function:: state_dict( )
-      :noindex:
-
-      Returns the ``state_dict`` that contains parameters
-      for the entire model. It first collects the \ ``local_state_dict``  and
-      gathers and merges the \ ``local_state_dict`` from all ``mp_rank``\ s to
-      create a full ``state_dict``. Please note that this needs to be called on all ranks with
-      ``dp_rank()==0`` to ensure the gather happens properly.
-      If it is only called on all such ranks, it can hang.
-
-   .. function:: load_state_dict( )
-      :noindex:
-
-      Same as the ``torch.module.load_state_dict()`` ,
-      except: It first gathers and merges the ``state_dict``\ s across
-      ``mp_rank``\ s, if they are partial. The actual loading happens after the
-      model partition so that each rank knows its local parameters.
-
-   .. function:: register_post_partition_hook(hook)
-      :noindex:
-
-      Registers a callable ``hook`` to
-      be executed after the model is partitioned. This is useful in situations
-      where an operation needs to be executed after the model partition during
-      the first call to ``smdistributed.modelparallel.torch.step``, but before the actual execution of the
-      first forward pass. Returns a ``RemovableHandle`` object ``handle``,
-      which can be used to remove the hook by calling ``handle.remove()``.
-
-   .. function:: cpu( )
-      :noindex:
-
-      Allgathers parameters and buffers across all ``mp_rank``\ s and moves them
-      to the CPU.
-
-   .. function:: join( )
-      :noindex:
-
-      A context manager to be used in conjunction with an instance of
-      ``smdistributed.modelparallel.torch.DistributedModel`` to be able to train with uneven inputs across
-      participating processes. This is only supported when ``ddp=True``. This will use the join with the wrapped
-      ``DistributedDataParallel`` instance. For more information, see:
-      `join <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel.join>`__
-      in the PyTorch documentation.
-
-   .. function:: register_comm_hook( state, callable )
-      :noindex:
-
-      **Available for PyTorch 1.8.1 only**
-      Registers a communication hook which is an enhancement that provides
-      a flexible hook ``callable`` to users where they can specify how
-      gradients are aggregated across multiple workers. This method will be called on the wrapped ``DistributedDataParallel`` instance.
-
-      Please note that when you register a comm hook you have full control of how the gradients are processed.
-      When using only data parallelism with Torch DDP you are expected to average grads across data parallel replicas within the hook.
-      Similarly, when using DistributedModel you have to averaging grads across data parallel replicas within the hook.
-      In addition to that, you also have to average grads across microbatches within the hook unless you explicitly desire to not average based on your loss function.
-      See ``average_grads_across_microbatches`` for more information about averaging grads across microbatches.
-
-      This is only supported when ``ddp=True`` and ``overlapping_allreduce=True`` (default).
-      For more information, see:
-      `register_comm_hook <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel.register_comm_hook>`__
-      in the PyTorch documentation.
-
-  **Behavior of** ``smdistributed.modelparallel.torch.DistributedModel`` **with Tensor Parallelism**
-
-  When a model is wrapped by ``smdistributed.modelparallel.torch.DistributedModel``, the library
-  immediately traverses the modules of the model object, and replaces the
-  modules that are supported for tensor parallelism with their distributed
-  counterparts. This replacement happens in place. If there are no other
-  references to the original modules in the script, they are
-  garbage-collected. The module attributes that previously referred to the
-  original submodules now refer to the distributed versions of those
-  submodules.
-
-  **Example:**
-
-  .. code:: python
-
-     # register DistributedSubmodule as the distributed version of Submodule
-     # (note this is a hypothetical example, smp.nn.DistributedSubmodule does not exist)
-     import smdistributed.modelparallel.torch as smp
-
-     smp.tp_register_with_module(Submodule, smp.nn.DistributedSubmodule)
-
-     class MyModule(nn.Module):
-         def __init__(self):
-             ...
-
-             self.submodule = Submodule()
-         ...
-
-     # enabling tensor parallelism for the entire model
-     with smp.tensor_parallelism():
-         model = MyModule()
-
-     # here model.submodule is still a Submodule object
-     assert isinstance(model.submodule, Submodule)
-
-     model = smp.DistributedModel(model)
-
-     # now model.submodule is replaced with an equivalent instance
-     # of smp.nn.DistributedSubmodule
-     assert isinstance(model.module.submodule, smp.nn.DistributedSubmodule)
-
-  If ``pipeline_parallel_degree`` (equivalently, ``partitions``) is 1, the
-  placement of model partitions into GPUs and the initial broadcast of
-  model parameters and buffers across data-parallel ranks take place
-  immediately. This is because it does not need to wait for the model
-  partition when ``smdistributed.modelparallel.torch.DistributedModel`` wrapper is called. For other
-  cases with ``pipeline_parallel_degree`` greater than 1, the broadcast
-  and device placement will be deferred until the first call of an
-  ``smdistributed.modelparallel.torch.step``-decorated function happens. This is because the first
-  ``smdistributed.modelparallel.torch.step``-decorated function call is when the model partitioning
-  happens if pipeline parallelism is enabled.
-
-  Because of the module replacement during the ``smdistributed.modelparallel.torch.DistributedModel``
-  call, any ``load_state_dict`` calls on the model, as well as any direct
-  access to model parameters, such as during the optimizer creation,
-  should be done **after** the ``smdistributed.modelparallel.torch.DistributedModel`` call.
-
-  Since the broadcast of the model parameters and buffers happens
-  immediately during ``smdistributed.modelparallel.torch.DistributedModel`` call when the degree of
-  pipeline parallelism is 1, using ``@smp.step`` decorators is not
-  required when tensor parallelism is used by itself (without pipeline
-  parallelism).
-
-  For more information about the library's tensor parallelism APIs for PyTorch,
-  see :ref:`smdmp-pytorch-tensor-parallel`.
-
-  **Additional Methods of** ``smdistributed.modelparallel.torch.DistributedModel`` **for Tensor Parallelism**
-
-  The following are the new methods of ``smdistributed.modelparallel.torch.DistributedModel``, in
-  addition to the ones listed in the
-  `documentation <https://sagemaker.readthedocs.io/en/stable/api/training/smp_versions/v1.2.0/smd_model_parallel_pytorch.html#smp.DistributedModel>`__.
-
-  .. function:: distributed_modules()
-   :noindex:
-
-     -  An iterator that runs over the set of distributed
-        (tensor-parallelized) modules in the model
-
-  .. function:: is_distributed_parameter(param)
-   :noindex:
-
-     -  Returns ``True`` if the given ``nn.Parameter`` is distributed over
-        tensor-parallel ranks.
-
-  .. function::  is_distributed_buffer(buf)
-   :noindex:
-
-     -  Returns ``True`` if the given buffer is distributed over
-        tensor-parallel ranks.
-
-  .. function::  is_scaled_batch_parameter(param)
-   :noindex:
-
-     -  Returns ``True`` if the given ``nn.Parameter`` is operates on the
-        scaled batch (batch over the entire ``TP_GROUP``, and not only the
-        local batch).
-
-  .. function::  is_scaled_batch_buffer(buf)
-   :noindex:
-
-     -  Returns ``True`` if the parameter corresponding to the given
-        buffer operates on the scaled batch (batch over the entire
-        ``TP_GROUP``, and not only the local batch).
-
-  .. function::  default_reducer_named_parameters()
-   :noindex:
-
-     -  Returns an iterator that runs over ``(name, param)`` tuples, for
-        ``param`` that is allreduced over the ``DP_GROUP``.
-
-  .. function::  scaled_batch_reducer_named_parameters()
-   :noindex:
-
-     -  Returns an iterator that runs over ``(name, param)`` tuples, for
-        ``param`` that is allreduced over the ``RDP_GROUP``.
-
-smdistributed.modelparallel.torch.DistributedOptimizer
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-.. class:: smdistributed.modelparallel.torch.DistributedOptimizer(optimizer, static_loss_scale=1.0, dynamic_loss_scale=False, **dynamic_loss_args)
-   :noindex:
-
-   An optimizer wrapper for saving and loading optimizer states.
-
-   :param optimizer: An optimizer object.
-   :type optimizer: object
-   :param static_loss_scale: Effective only for FP16 training. The default value is ``1.0``.
-   :type static_loss_scale: float
-   :param dynamic_loss_scale: Effective only for FP16 training. Set to ``True`` to
-      use dynamic loss scale. The default value is ``False``.
-   :type dynamic_loss_scale: boolean
-   :param dynamic_loss_args: Effective only for FP16 training.
-      If ``dynamic_loss_scale=True``, you can configure additional scale
-      parameters for dynamic loss scale.
-      The following list shows available parameters.
-
-      * ``"init_scale"``: Default is ``2**32``
-      * ``"scale_factor"``: Default is ``2.``
-      * ``"scale_window"``: Default is ``1000``
-      * ``"min_scale"``: Default is ``1``
-      * ``"delayed_shift"``: Default is ``1``
-      * ``"consecutive_hysteresis"``: Default is ``False``
-   :type dynamic_loss_args: dict
-
-   **Example usage of an FP32 Optimizer:**
-
-   .. code:: python
-
-      optimizer = torch.optim.AdaDelta(...)
-      optimizer = smdistributed.modelparallel.torch.DistributedOptimizer(optimizer)
-
-   **Example usage of an FP16 Optimizer with static loss scale:**
-
-   .. code:: python
-
-      optimizer = torch.optim.AdaDelta(...)
-      optimizer = smdistributed.modelparallel.torch.DistributedOptimizer(
-          optimizer,
-          static_loss_scale=1.0
-      )
-
-   **Example usage of an FP16 Optimizer with dynamic loss scale:**
-
-   .. code:: python
-
-      optimizer = torch.optim.AdaDelta(...)
-      optimizer = smdistributed.modelparallel.torch.DistributedOptimizer(
-          optimizer,
-          static_loss_scale=None,
-          dynamic_loss_scale=True,
-          dynamic_loss_args={
-              "scale_window": 1000,
-              "min_scale": 1,
-              "delayed_shift": 2
-          }
-      )
-
-   .. tip::
-
-      After you modify training scripts with
-      :class:`smdistributed.modelparallel.torch.DistributedModel` and
-      :class:`smdistributed.modelparallel.torch.DistributedOptimizer`,
-      use the SageMaker PyTorch estimator's distribution configuration to enable FP16 training.
-      You simply need to add ``"fp16": True`` to the ``smp_options`` config dictionary's
-      ``"parameters"`` key as shown in
-      `Using the SageMaker TensorFlow and PyTorch Estimators
-      <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-sm-sdk.html>`_.
-      For more information about available parameters for the ``smp_options`` config,
-      see :ref:`sm-sdk-modelparallel-general`.
-
-   This wrapper returns an ``optimizer`` object with the following methods overridden:
-
-   .. method:: state_dict( )
-      :noindex:
-
-      Returns the ``state_dict`` that contains optimizer state for the entire model.
-      It first collects the ``local_state_dict`` and gathers and merges
-      the ``local_state_dict`` from all ``mp_rank``\ s to create a full
-      ``state_dict``.
-
-   .. method::  load_state_dict( )
-      :noindex:
-
-      Same as the ``torch.optimizer.load_state_dict()`` , except:
-
-         -  It first gathers and merges the local ``state_dict``\ s if they are
-            partial.
-         -  The actual loading happens after the model partition so that each
-            rank knows its local parameters.
-
-   .. method::  local_state_dict( )
-      :noindex:
-
-      Returns the ``state_dict`` that contains the
-      local optimizer state that belongs to the current \ ``mp_rank``. This
-      ``state_dict`` contains a key \ ``_smp_is_partial`` to indicate this is
-      a partial \ ``state_dict``, which indicates whether the
-      ``state_dict`` contains elements corresponding to only the current
-      partition, or to the entire model.
-
-
-smdistributed.modelparallel.torch Context Managers and Util Functions
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-.. function:: smdistributed.modelparallel.torch.model_creation(tensor_parallelism=False, dtype=None, **tensor_parallel_config)
-   :noindex:
-
-   Context manager to create a ``torch`` model. This API combines both the
-   :class:`smdistributed.modelparallel.torch.tensor_parallelism` and
-   :class:`smdistributed.modelparallel.torch.delay_param_initialization` decorators,
-   so you can simply use this single context when creating the torch model.
-
-   :param tensor_parallelism: Whether to enable tensor parallelism during model creation.
-   :type tensor_parallelism: boolean
-   :param dtype: The dtype to use when creating the model. It has the following rules.
-
-      * If dtype is specified, it will be used during model creation.
-      * If dtype is not specified, the default dtype will be used during model creation,
-        which is usually FP32. This is for the best performance on CPU.
-      * Any model that causes out-of-memory problems with FP32 initialization
-        is recommended to be created with
-        :class:`smdistributed.modelparallel.torch.delayed_parameter_initialization`.
-      * ``FP16_Module`` casts the model back to FP16 if FP16 training is enabled
-        with the ``smp`` config. For more inforamtion about FP16 training
-        in SageMaker with the model parallel library, see `FP16 Training
-        <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-fp16.html>`_
-        in the *Amazon SageMaker Developer Guide*.
-
-   :type dtype: ``torch.dtype``
-   :param tensor_parallel_config: kwargs to specifiy other tensor parallel configs.
-      This is not used if ``tensor_parallelism`` is ``False``.
-   :type tensor_parallel_config: dict
-
-   **Example Usage:**
-
-   .. code:: python
-
-      import smdistributed.modelparallel.torch as smp
-
-      with smp.model_creation(
-          tensor_parallelism=smp.tp_size() > 1,
-          dtype=torch.float16 if args.fp16 else torch.get_default_dtype()
-      ):
-          model = MyModel(...)
-
-.. function:: smdistributed.modelparallel.torch.partition(index)
-   :noindex:
-
-   :param index: The index of the partition.
-   :type index: int
-
-   A context manager which places all modules defined inside into the
-   partition with ID ``index``.  The ``index`` argument must be less than
-   the number of partitions.
-
-   Use ``smdistributed.modelparallel.torch.partition`` to implement manual partitioning.
-   If ``"auto_partition"`` is ``True``, then the
-   ``smdistributed.modelparallel.torch.partition`` contexts are ignored. Any module that is not placed in
-   any ``smdistributed.modelparallel.torch.partition`` context is placed in the
-   ``default_partition`` defined through the SageMaker Python SDK.
-
-   When ``smdistributed.modelparallel.torch.partition`` contexts are nested, the innermost context
-   overrides the rest (see the following example). In PyTorch, manual
-   partitioning should be done inside the module \ ``__init__``, and the
-   partition assignment applies to the modules that are *created* inside
-   the ``smdistributed.modelparallel.torch.partition`` context.
-
-   Example:
-
-   .. code:: python
-
-      import smdistributed.modelparallel.torch as smp
-
-      class Model(torch.nn.Module):
-          def __init__(self):
-              with smp.partition(1):
-                  self.child0 = Child0()            # child0 on partition 1
-                  with smp.partition(2):
-                      self.child1 = Child1()        # child1 on partition 2
-                  self.child2 = Child2()            # child2 on partition 1
-              self.child3 = Child3()                # child3 on default_partition
-
-.. data:: smdistributed.modelparallel.torch.amp.GradScaler
-   :noindex:
-
-   `Torch AMP Gradscaler <https://pytorch.org/docs/stable/amp.html#torch.cuda.amp.GradScaler>`__
-   currently doesn’t work with the library. ``smdistributed.modelparallel.torch.amp.GradScaler`` replaces
-   ``torch.amp.GradScaler`` and provides the same functionality.
-
-.. function:: smdistributed.modelparallel.torch.delay_param_initialization(enabled=True)
-   :noindex:
-
-   If enabled, it delays the initialization of parameters
-   to save CPU memory. That is, parameter initialization takes place
-   after the model is partitioned on GPUs.
-
-.. function:: smdistributed.modelparallel.torch.get_world_process_group( )
-   :noindex:
-
-   Returns a ``torch.distributed`` ``ProcessGroup`` that consists of all
-   processes, which can be used with the ``torch.distributed`` API.
-   Requires ``"ddp": True`` in SageMaker Python SDK parameters.
-
-.. function:: smdistributed.modelparallel.torch.get_mp_process_group( )
-   :noindex:
-
-   Returns a ``torch.distributed`` ``ProcessGroup`` that consists of the
-   processes in the ``MP_GROUP`` which contains the current process, which
-   can be used with the \ ``torch.distributed`` API. Requires
-   ``"ddp": True`` in SageMaker Python SDK parameters.
-
-.. function:: smdistributed.modelparallel.torch.get_dp_process_group( )
-   :noindex:
-
-   Returns a ``torch.distributed`` ``ProcessGroup`` that consists of the
-   processes in the ``DP_GROUP`` which contains the current process, which
-   can be used with the \ ``torch.distributed`` API. Requires
-   ``"ddp": True`` in SageMaker Python SDK parameters.
-
-.. function:: smdistributed.modelparallel.torch.is_initialized( )
-   :noindex:
-
-   Returns ``True`` if ``smdistributed.modelparallel.torch.init`` has already been called for the
-   process, and ``False`` otherwise.
-
-.. function::smp.is_tracing( )
-   :noindex:
-
-   Returns ``True`` if the current process is running the tracing step, and
-   ``False`` otherwise.
-
-.. data:: smdistributed.modelparallel.torch.nn.FusedLayerNorm
-   :noindex:
-
-   `Apex Fused Layer Norm <https://nvidia.github.io/apex/layernorm.html>`__ is currently not
-   supported by the library. ``smdistributed.modelparallel.torch.nn.FusedLayerNorm`` replaces ``apex``
-   ``FusedLayerNorm`` and provides the same functionality. This requires
-   ``apex`` to be installed on the system.
-
-.. data:: smdistributed.modelparallel.torch.optimizers.FusedNovoGrad
-   :noindex:
-
-
-   `Fused Novo Grad optimizer <https://nvidia.github.io/apex/optimizers.html#apex.optimizers.FusedNovoGrad>`__ is
-   currently not supported by the library. ``smdistributed.modelparallel.torch.optimizers.FusedNovoGrad`` replaces ``apex`` ``FusedNovoGrad``
-   optimizer and provides the same functionality. This requires ``apex`` to
-   be installed on the system.
-
-.. data:: smdistributed.modelparallel.torch.optimizers.FusedLamb
-   :noindex:
-
-
-   `FusedLamb optimizer <https://nvidia.github.io/apex/optimizers.html#apex.optimizers.FusedLAMB>`__
-   currently doesn’t work with the library. ``smdistributed.modelparallel.torch.optimizers.FusedLamb`` replaces
-   ``apex`` ``FusedLamb`` optimizer and provides the same functionality.
-   This requires ``apex`` to be installed on the system.
-
-.. _pytorch_saving_loading:
-   :noindex:
-
-smdistributed.modelparallel.torch APIs for Saving and Loading
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-.. function:: smdistributed.modelparallel.torch.save(obj, f, partial=True, pickel_module=picklemodule, pickle_protocol=2, )
-   :noindex:
-
-   Saves an object. This operation is similar to `torch.save()
-   <https://pytorch.org/docs/stable/generated/torch.save.html>`_, except that
-   it has an additional keyword argument, ``partial``, and accepts only
-   string type for the argument ``f`` (file). If ``partial=True``, each
-   ``mp_rank`` saves a separate checkpoint file and the library adds an ``mp_rank``
-   index to your saved file.
-
-   **Parameters**
-
-   -  ``obj`` (dict): A saved object.
-   -  ``f`` (str): A string containing a file name.
-   -  ``partial`` (bool, default= ``True``):  When set to ``True``, each
-      ``mp_rank`` saves a separate checkpoint file and the library adds an
-      ``mp_rank`` index to the saved file. If you want to be able to load
-      and further train a model that you save with ``smdistributed.modelparallel.torch.save()``, you must
-      set ``partial=True``.
-   -  ``pickle_module`` (picklemodule, default = module ``"pickle"`` from ``"/opt/conda/lib/python3.6/pickle.py"``):
-      A module used for pickling metadata and objects.
-   -  ``pickle_protocol``  (int, default=2): Can be specified to
-      override the defaultprotocol.
-
-.. function:: smdistributed.modelparallel.torch.load(f, map_location, pickle_module, pickle_load_args, partial=True)
-   :noindex:
-
-   Loads an object saved with ``smdistributed.modelparallel.torch.save()`` from a file.
-
-   Similar to, `torch.load() <https://pytorch.org/docs/stable/generated/torch.load.html>`__,
-   except it has an additional keyword argument, ``partial``, and accepts
-   only string type for the argument ``f`` (file). If \ ``partial=True``,
-   then each ``mp_rank`` loads a separate checkpoint file.
-
-   **Parameters**
-
-   -  ``f`` (string): A string containing a file name.
-   -  ``map_location`` (function): A function
-      `torch.device <https://pytorch.org/docs/stable/tensor_attributes.html#torch.torch.device>`__,
-      a string, or a dict specifying how to remap storage locations.
-   -  ``pickle_module`` (pickle module): A module used for unpickling
-      metadata and objects (has to match the \ ``pickle_module``\ used to
-      serialize file).
-   -  ``pickle_load_args`` (Python 3 only): Optional keyword arguments
-      passed to ``pickle_module.load()`` and ``pickle_module.Unpickler()``.
-   -  ``partial`` (bool, default= ``True``): When set to ``True``, each
-      ``mp_rank`` loads the checkpoint corresponding to the ``mp_rank``.
-      Should be used when loading a model trained with the library.
-
-.. function:: smdistributed.modelparallel.torch.save_checkpoint(path, tag, partial=True, model=None, optimizer=None, user_content=None, translate_if_full=True, num_kept_partial_checkpoints=None)
-   :noindex:
-
-   Saves a checkpoint. While :class:`smdistributed.modelparallel.torch.save` saves
-   model and optimizer objects,
-   this function checkpoints model and optimizer and saves the checkpoints as separate files.
-   It creates checkpoint folders in the following structure.
-
-   .. code:: text
-
-      - path
-      - ${tag}_partial        (folder for partial checkpoint)
-        - model_rankinfo.pt
-        - optimizer_rankinfo.pt
-        - fp16_states_rankinfo.pt
-        - user_content.pt
-      - $tag                  (checkpoint file for full checkpoint)
-      - user_content_$tag     (user_content file for full checkpoint)
-      - newest                (a file that indicates the newest checkpoint)
-
-   **Parameters**
-
-   * ``path`` (str) (required): Path to save the checkpoint. The library creates
-     the directory if it does not already exist.
-     For example, ``/opt/ml/checkpoint/model_parallel``.
-   * ``tag`` (str) (required): A tag for the current checkpoint, usually the train
-     steps. Note: tag needs to be the same across all ranks (GPU workers).
-     When ``partial=False`` this will be the checkpoint file name.
-   * ``partial`` (boolean) (default: True): Whether to save the partial checkpoint.
-   * ``model`` (:class:`smdistributed.modelparallel.torch.DistributedModel`)
-     (default: None): The model to save. It needs to an ``smp.DistributedModel`` object.
-   * ``optimizer`` (:class:`smdistributed.modelparallel.torch.DistributedOptimizer`)
-     (default: None): The optimizer to save. It needs to be an ``smp.DistributedOptimizer`` object.
-   * ``user_content`` (any) (default: None): User-defined content to save.
-   * ``translate_if_full`` (boolean) (default: True): Whether to translate the
-     full ``state_dict`` to HF ``state_dict`` if possible.
-   * ``num_kept_partial_checkpoints`` (int) (default: None): The maximum number
-     of partial checkpoints to keep on disk.
-
-.. function:: smdistributed.modelparallel.torch.resume_from_checkpoint(path, tag=None, partial=True, strict=True, load_optimizer_states=True, translate_function=None)
-   :noindex:
-
-   While :class:`smdistributed.modelparallel.torch.load` loads saved
-   model and optimizer objects, this function resumes from a saved checkpoint file.
-
-   **Parameters**
-
-   * ``path`` (str) (required): Path to load the checkpoint.
-   * ``tag`` (str) (default: None): Tag of the checkpoint to resume. If not provided,
-     the library tries to locate the newest checkpoint from the saved newest file.
-   * ``partial`` (boolean) (default: True): Whether to load the partial checkpoint.
-   * ``strict`` (boolean) (default: True): Load with strict load, no extra key or
-     missing key is allowed.
-   * ``load_optimizer_states`` (boolean) (default: True): Whether to load ``optimizer_states``.
-   * ``translate_function`` (function) (default: None): function to translate the full
-     checkpoint into smdistributed.modelparallel format.
-     For supported models, this is not required.
-
-   **Example usage**
-
-   .. code:: python
-
-     # Save
-     smp.save_checkpoint(
-         checkpoint_dir,
-         tag=f"total_steps{total_steps}",
-         partial=True,
-         model=model,
-         optimizer=optimizer,
-         user_content=user_content
-         num_kept_partial_checkpoints=args.num_kept_checkpoints)
-
-     # Load: this will automatically load the newest checkpoint
-     user_content = smp.resume_from_checkpoint(path, partial=partial)
-
-.. _pytorch_saving_loading_instructions:
-   :noindex:
-
-General instruction on saving and loading
------------------------------------------
-
-The library can save partial or full checkpoints.
-
--  For partial checkpoints, each ``mp_rank`` saves its own checkpoint
-   file with only the parameters that belong to that rank.
--  For full checkpoints, the library saves a single checkpoint that contains
-   entire model parameters.
-
-When **saving** using ``smdistributed.modelparallel.torch.save()``, each rank only holds its own
-parameters. If you want to save the full model, there will be some
-communication between the ranks to create the full model. If you save
-checkpoints often, you should save partial checkpoints for best
-performance.
-
-When **loading** using ``smdistributed.modelparallel.torch.load()``, the library can load either partial or |
-full checkpoints or full checkpoints saved by a non-model-parallel model. If you
-want to resume training with a non-model-parallel model or do inference, you need
-a full checkpoint.
-
-The following is an example of how you can save and load a checkpoint:
-
-.. code:: python
-
-   import smdistributed.modelparallel.torch as smp
-   # Original model and optimizer
-   model = MyModel(...)
-   optimizer = MyOpt(...)
-
-   # model parallel wrapper
-   model = smp.DistributedModel(model)
-   optimizer = smp.DistributedOptimizer(optimizer)
-
-   # To save, always save on dp_rank 0 to avoid data racing
-   if partial:
-       # To save the partial model on each mp rank
-       # the library will create `checkpoint.pt_{mprank}` for each mp rank
-       if save_partial_model:
-           if smp.dp_rank() == 0:
-               model_dict = model.local_state_dict() # save the partial model
-               opt_dict = optimizer.local_state_dict() # save the partial optimizer state
-               smp.save(
-                   {"model_state_dict": model_dict, "optimizer_state_dict": opt_dict},
-                   f"/checkpoint.pt",
-                   partial=True,
-               )
-
-       # To save the full model
-       if save_full_model:
-           if smp.dp_rank() == 0:
-               model_dict = model.state_dict() # save the full model
-               opt_dict = optimizer.state_dict() # save the full optimizer state
-               smp.save(
-                   {"model_state_dict": model_dict, "optimizer_state_dict": opt_dict},
-                   "/checkpoint.pt",
-                   partial=False,
-               )
-
-   # To load, load on all ranks.
-   # The only difference for partial/full loading is the partial flag in smp.load
-   # Load partial checkpoint
-   if partial_checkpoint:
-       checkpoint = smp.load("/checkpoint.pt", partial=True)
-       model.load_state_dict(checkpoint["model_state_dict"])
-       optimizer.load_state_dict(checkpoint["optimizer_state_dict"])
-   # Load full checkpoint
-   if full_checkpoint:
-       checkpoint = smp.load("/checkpoint.pt", partial=False)
-       model.load_state_dict(checkpoint["model_state_dict"])
-       optimizer.load_state_dict(checkpoint["optimizer_state_dict"])
diff --git a/doc/api/training/smp_versions/v1.10.0/smd_model_parallel_pytorch_tensor_parallel.rst b/doc/api/training/smp_versions/v1.10.0/smd_model_parallel_pytorch_tensor_parallel.rst
deleted file mode 100644
index 96231b55fe..0000000000
--- a/doc/api/training/smp_versions/v1.10.0/smd_model_parallel_pytorch_tensor_parallel.rst
+++ /dev/null
@@ -1,903 +0,0 @@
-.. _smdmp-pytorch-tensor-parallel:
-   :noindex:
-
-PyTorch API for Tensor Parallelism
-==================================
-
-SageMaker distributed tensor parallelism works by replacing specific submodules
-in the model with their distributed implementations. The distributed modules
-have their parameters and optimizer states partitioned across tensor-parallel
-ranks. This is to compute the same output as it would have been computed by
-the original modules. Since tensor parallelism occurs across data-parallel
-ranks, a rank might collect slices of the activations corresponding to the
-data shards on other devices that are part of the same tensor parallelism group.
-
-You can enable or disable tensor parallelism for specific parts of the model.
-Within the enabled parts, the replacements with distributed modules will take
-place on a best-effort basis for those module supported for tensor parallelism.
-Alternatively, you can directly import and use the library’s distributed
-modules in the model definition.
-
-Some of the supported modules (such as ``smdistributed.modelparallel.torch.nn.Transformer``) are high-level
-blocks that contain many operations. Because custom implementations
-(as opposed to the built-in PyTorch modules) are typically used for these
-high-level blocks, the library offers an API that you can use to register
-specific distributed versions with such custom modules (provided that they
-are functionally equivalent). This allows the library to automatically replace
-the occurrences of such PyTorch modules with their distributed counterparts
-provided by the library.
-For more information, see the following topics.
-
-.. contents:: Topics
-  :depth: 3
-  :local:
-
-.. _registering-tp-modules:
-   :noindex:
-
-Registering Tensor Parallelism Distributed Modules
---------------------------------------------------
-
-Although PyTorch natively provides some of the commonly used (and
-tensor-parallelizable) building blocks such as Transformer, users often
-use custom implementations for such higher-level modules. To distribute
-such modules with tensor parallelism, you need to register the
-distributed modules to the custom module implementation in your class,
-so that the library knows how to distribute the custom module. When you
-register the distributed modules, make sure the custom module that you
-use is functionally equivalent to the distributed module. You can verify
-this by taking a look at the equivalent reference implementations in the
-:ref:`smdmp-tp-appendix`.
-These implementations are functionally equivalent to their distributed
-versions in ``smdistributed.modelparallel.torch.nn`` module.
-
-.. class:: smdistributed.modelparallel.torch.tp_register(dist_module, init_hook=None, forward_hook=None, return_hook=None)
-   :noindex:
-
-   -  A decorator class that registers the ``dist_module`` class with
-      the module class that it is attached to. The hooks can be used to
-      adapt to different interfaces used with ``__init__`` and
-      ``forward`` methods.
-   -  **Arguments:**
-
-      -  ``dist_module``: A subclass of ``smdistributed.modelparallel.torch.nn.DistributedModule``
-         that implements the distributed version of the module class the
-         decorator is attached to. Any distributed module class defined
-         in ``smdistributed.modelparallel.torch.nn`` module can be used.
-      -  ``init_hook``: A callable that translates the arguments of the
-         original module ``__init__`` method to an ``(args, kwargs)``
-         tuple compatible with the arguments of the corresponding
-         distributed module ``__init__`` method. Must return a tuple,
-         whose first element is an iterable representing the positional
-         arguments, and second element is a ``dict`` representing the
-         keyword arguments. The input signature of the ``init_hook``
-         must **exactly** match the signature of the original
-         ``__init__`` method (including argument order and default
-         values), except it must exclude ``self``.
-      -  ``forward_hook``: A callable that translates the arguments of
-         the original module ``forward`` method to an ``(args, kwargs)``
-         tuple compatible with the arguments of the corresponding
-         distributed module ``forward`` method. Must return a tuple,
-         whose first element is an iterable representing the positional
-         arguments, and second element is a ``dict`` representing the
-         keyword arguments. The input signature of the ``init_hook``
-         must **exactly** match the signature of the original
-         ``forward`` method (including argument order and default
-         values), except it must exclude ``self``.
-      -  ``return_hook``: A callable that translates the object returned
-         from the distributed module to the return object expected of
-         the original module.
-
-   -  **Example:**
-
-      .. code:: python
-
-         import smdistributed.modelparallel.torch as smp
-
-         init_hook = lambda config: ((), config.to_dict())
-
-         # register smp.nn.DistributedTransformer
-         # as the distributed version of MyTransformer
-         @smp.tp_register(smp.nn.DistributedTransformer, init_hook=init_hook)
-         class MyTransformer(nn.Module):
-             def __init__(self, config):
-                 ...
-
-             def forward(self, hidden_states, attention_mask):
-                 ...
-
-.. function:: smdistributed.modelparallel.torch.tp_register_with_module(module_cls, dist_module, init_hook=None, forward_hook=None, return_hook=None)
-   :noindex:
-
-   -  When you do not have direct access to model definition code, you
-      can use this API to similarly register a distributed module with
-      an existing module class.
-
-   -  **Arguments:**
-
-      -  ``module_cls``: The existing module class that will be
-         distributed.
-      -  ``dist_module``: A subclass of ``smdistributed.modelparallel.torch.nn.DistributedModule``
-         that implements the distributed version of the module class the
-         decorator is attached to. Any distributed module class defined
-         in ``smdistributed.modelparallel.torch.nn`` module can be used.
-      -  ``init_hook``: A callable that translates the arguments of the
-         original module ``__init__`` method to an ``(args, kwargs)``
-         tuple compatible with the arguments of the corresponding
-         distributed module ``__init__`` method. Must return a tuple,
-         whose first element is an iterable representing the positional
-         arguments, and second element is a ``dict`` representing the
-         keyword arguments. The input signature of the ``init_hook``
-         must **exactly** match the signature of the original
-         ``__init__`` method (including argument order and default
-         values), except it must exclude ``self``.
-      -  ``forward_hook``: A callable that translates the arguments of
-         the original module ``forward`` method to an ``(args, kwargs)``
-         tuple compatible with the arguments of the corresponding
-         distributed module ``forward`` method. Must return a tuple,
-         whose first element is an iterable representing the positional
-         arguments, and second element is a ``dict`` representing the
-         keyword arguments. The input signature of the ``init_hook``
-         must **exactly** match the signature of the original
-         ``forward`` method (including argument order and default
-         values), except it must exclude ``self``.
-      -  ``return_hook``: A callable that translates the object returned
-         from the distributed module to the return object expected of
-         the original module.
-
-   -  **Example:**
-
-      .. code:: python
-
-         import smdistributed.modelparallel.torch as smp
-
-         from somelibrary import MyTransformer
-
-         init_hook = lambda config: ((), config.to_dict())
-
-         # register smp.nn.DistributedTransformer as the distributed version of MyTransformer
-         smp.tp_register_with_module(MyTransformer,
-                                     smp.nn.DistributedTransformer,
-                                     init_hook=init_hook)
-
-.. _smdmp-supported-modules-for-tp:
-   :noindex:
-
-Supported Modules for Tensor Parallelism
-----------------------------------------
-
-The following modules are supported for tensor parallelism.
-
-.. contents:: Topics
-  :depth: 3
-  :local:
-
-.. _tp-module-api:
-   :noindex:
-
-Tensor Parallelism Module APIs
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
--  :class:`smdistributed.modelparallel.torch.nn.DistributedLinear` (implements ``nn.Linear``)
--  :class:`smdistributed.modelparallel.torch.nn.DistributedTransformerLMHead`
--  :class:`smdistributed.modelparallel.torch.nn.DistributedTransformer`
--  :class:`smdistributed.modelparallel.torch.nn.DistributedTransformerLayer`
--  :class:`smdistributed.modelparallel.torch.nn.DistributedAttentionLayer`
--  :class:`smdistributed.modelparallel.torch.nn.DistributedTransformerOutputLayer`
--  :class:`smdistributed.modelparallel.torch.nn.DistributedEmbedding`
-
-.. class:: smdistributed.modelparallel.torch.nn.DistributedLinear(in_features, out_features)
-   :noindex:
-
-    Tensor-parallel implementation of the ``nn.Linear`` class.
-    Functionally equivalent to an ``nn.Linear`` module with the same
-    ``in_features`` and ``out_features``. In other words,
-    ``in_features`` and ``out_features`` are the number of *global*
-    channels across tensor-parallel ranks.
-
-    For more information about what's the reference implementation of this module,
-    see :ref:`smdmp-tp-appendix`.
-
-
-    -  **Arguments:**
-
-      -  ``in_features``: The total number of input channels for the
-         linear layer across all tensor-parallel ranks.
-      -  ``out_features``: The total number of output channels for the
-         linear layer across all tensor-parallel ranks.
-
-.. class:: smdistributed.modelparallel.torch.nn.DistributedTransformerLMHead(num_layers=12, num_attention_heads=32, attention_head_size=32, hidden_size=1024, intermediate_size=4096, vocab_size=30522, num_positions=1024, attention_dropout_prob=0.1, hidden_dropout_prob=0.1, activation="gelu", layernorm_epsilon=1e-5, num_token_types=0, causal_mask_size=None, add_cross_attention=False, add_lm_head=True,  initializer_range=0.02, use_normal_initialization=False, pre_layernorm=False, post_layernorm=True)
-   :noindex:
-
-    Constructs a distributed transformer model, including embeddings
-    and a single LM head. A word embedding of size
-    ``(vocab_size, hidden_size)`` is created, as well as a positional
-    embedding of size ``(num_positions, hidden_size)``, and the
-    embeddings are added together. If ``num_token_types`` is larger
-    than 0, a separate embedding of size
-    ``(num_token_types, hidden_size)`` is created, and further added
-    on top.
-
-    -  The embeddings are fed through a ``DistributedTransformer``, and
-       if ``add_lm_head`` is ``True``, the output passes through a single
-       LM head, which is a linear module without bias whose weight is
-       tied to the word embeddings.
-    -  See :class:`smdistributed.modelparallel.torch.nn.DistributedTransformerLayer` for descriptions of the rest
-       of the arguments.
-    -  **Methods:**
-
-      -  ``forward(self, inputs)``
-
-         -  If ``add_cross_attention`` is ``True``, ``inputs`` must be a
-            tuple
-            ``(input_ids, attention_mask, token_type_ids, position_ids, cross_states, cross_states, cross_mask, labels)``.
-         -  Otherwise, ``inputs`` must be a tuple
-            ``(input_ids, attention_mask, token_type_ids, position_ids, labels)``.
-         -  If ``token_type_ids`` is ``None``, token type embedding will
-            not be used.
-         -  ``input_ids`` is assumed to be of shape ``[N, S]``, where
-            ``N`` is the batch size and ``S`` is sequence length.
-         -  ``attention_mask`` is assumed to be a 0-1 tensor of shape
-            ``[N, S]``, where 1 represents a masked position.
-
-.. class:: smdistributed.modelparallel.torch.nn.DistributedTransformer(num_layers=12, num_attention_heads=32, attention_head_size=32, hidden_size=1024, intermediate_size=4096, attention_dropout_prob=0.1, hidden_dropout_prob=0.1, activation="gelu", layernorm_epsilon=1e-5, initializer_range=0.02, use_normal_initialization=False, causal_mask_size=None, add_cross_attention=False, pre_layernorm=False, post_layernorm=True)
-   :noindex:
-
-   A sequence of :class:`smdistributed.modelparallel.torch.nn.DistributedTransformerLayer`\ s, whose
-   number is given by ``num_layers`` argument. For the other
-   arguments and methods, refer to
-   :class:`smdistributed.modelparallel.torch.nn.DistributedTransformerLayer`.
-
-   If both ``pre_layernorm`` and ``post_layernorm`` are ``True``,
-   layer normalization is applied to both the input and the output of
-   the ``DistributedTransformer``, in addition to the intermediate
-   attention and transformer-output layers.
-
-.. class:: smdistributed.modelparallel.torch.nn.DistributedTransformerLayer(num_attention_heads=32, attention_head_size=32, hidden_size=1024, intermediate_size=4096, attention_dropout_prob=0.1, hidden_dropout_prob=0.1, activation="gelu", layernorm_epsilon=1e-5, initializer_range=0.02, use_normal_initialization=False, causal_mask_size=None, add_cross_attention=False, pre_layernorm=False, post_layernorm=True)
-   :noindex:
-
-   Tensor-parallel implementation of a single transformer layer.
-   Number of attention heads, hidden size, and intermediate size
-   refer to the global quantities across all tensor-parallel ranks.
-
-   For more information about what's the reference implementation of this module,
-   see :ref:`smdmp-tp-appendix`.
-
-   -  **Arguments:**
-
-      -  ``num_attention_heads``: The total number of attention heads
-         across tensor-parallel ranks
-      -  ``attention_head_size``: The number of channels of a single
-         attention head.
-      -  ``hidden_size``: The hidden dimension of the transformer. The
-         input tensor ``hidden_states`` is assumed to have its last
-         dimension size equal to ``hidden_size``.
-      -  ``intermediate_size``: The number of output channels in the
-         first linear transformation of the transformer output layer.
-         ``DistributedTransformerOutputLayer`` first maps
-         ``hidden_size`` dimensions of its input tensor into
-         ``intermediate_size`` dimensions, and then maps it back into
-         ``hidden_size`` dimensions.
-      -  ``attention_dropout_prob``: The dropout probability applied to
-         the attention probabilities.
-      -  ``hidden_dropout_prob``: The dropout probability used in
-         dropout layers other than the one applied to the attention
-         probabilities.
-      -  ``activation``: Choice of activation function to use at the
-         output layer. Must be ``"gelu"`` or ``"relu"``.
-      -  ``layernorm_epsilon``: The epsilon added to the denominator of
-         layer normalization for numerical stability.
-      -  ``initializer_range``: If ``use_normal_initialization`` is
-         ``True``, the standard deviation of the normal random variable
-         to initialize the weights with.
-      -  ``use_normal_initialization``: If ``True``, the weights are
-         initialized with normal distribution with standard deviation
-         given by ``initializer_range``. Otherwise, default PyTorch
-         initialization is used.
-      -  ``causal_mask_size``: If ``None``, no causal mask is used on
-         attentions. Otherwise, should be set to maximum sequence length
-         to apply a causal mask to the attention scores. This is used,
-         for instance, in GPT-2.
-      -  ``add_cross_attention``: If ``True``, a cross-attention layer
-         will be added after the self-attention block. The
-         cross-attention layer computes the attention keys and values
-         based on the ``cross_states`` input (instead of
-         ``hidden_states`` input, as in self-attention. This is used in
-         the decoder block of encoder-decoder architectures. For
-         encoder-only architectures that only use self-attention, this
-         should be kept ``False``.
-      -  ``pre_layernorm``: If ``True``, inserts layer normalization at
-         the input. At least one of ``pre_layernorm`` and
-         ``post_layernorm`` must be ``True``.
-      -  ``post_layernorm``: If ``True``, inserts layer normalization at
-         the output. At least one of ``pre_layernorm`` and
-         ``post_layernorm`` must be ``True``.
-
-   -  **Methods:**
-
-      -  ``forward(self, inputs)``: Forward pass for the transformer
-         layer.
-
-         -  **Arguments:**
-
-            -  If ``add_cross_attention=False``, ``inputs`` must be a
-               tuple ``(hidden_states, attention_mask)``, where
-               ``hidden_states`` is assumed to be a tensor of dimensions
-               ``[N, S, H]``, where ``N`` is batch size, ``S`` is
-               sequence length, and ``H`` is ``hidden_size``.
-               ``attention_mask`` is assumed to be a tensor of
-               dimensions ``[N, 1, 1, S]``, where ``N`` is the batch
-               size, and ``S`` is the sequence length.
-            -  If ``add_cross_attention=True``, ``inputs`` must be a
-               tuple
-               ``(hidden_states, cross_states, attention_mask, cross_mask)``,
-               where ``hidden_states`` is assumed to be a tensor of
-               dimensions ``[N, S_1, H]``, where ``N`` is batch size,
-               ``S_1`` is sequence length, and ``H`` is ``hidden_size``.
-               ``cross_states`` is assumed to be a tensor of size
-               ``[N, S_2, H]``, similarly interpreted.
-               ``attention_mask`` is assumed to be a tensor of
-               dimensions ``[N, 1, 1, S_1]``, where ``N`` is the batch
-               size, and ``S_1`` is the sequence length, and
-               ``cross_mask`` is assumed to be a tensor of size
-               ``[N, 1, 1, S_2]``. Keys and values for the attention
-               heads in the cross-attention layer (but not the
-               self-attention layer) are computed using
-               ``cross_states``, and ``cross_mask`` is applied as the
-               attention mask in the cross-attention layer (but not the
-               self-attention layer).
-
-         -  **Returns:**
-
-            -  If ``add_cross_attention=False``, a tuple
-               ``(hidden_states, attention_mask)``, where
-               ``hidden_states`` is the output of the transformer, and
-               ``attention_mask`` is the same the ``attention_mask``
-               argument.
-            -  If ``add_cross_attention=True``, a tuple
-               ``(hidden_states, cross_states, attention_mask, cross_mask)``,
-               where ``hidden_states`` is the output of the transformer,
-               and the next three tensors are the same as the input
-               arguments.
-
-.. class:: smdistributed.modelparallel.torch.nn.DistributedAttentionLayer(num_attention_heads=32, attention_head_size=32, hidden_size=1024, attention_dropout_prob=0.1, hidden_dropout_prob=0.1, layernorm_epsilon=1e-5, initializer_range=0.02, use_normal_initialization=False, cross_attention=False, causal_mask_size=None, pre_layernorm=False, post_layernorm=True)
-   :noindex:
-
-   A distributed implementation for the attention block. Includes the
-   computation of the self- or cross-attention (context layer),
-   followed by a linear mapping and dropout, which is optionally
-   followed by the residual-connection and layer normalization.
-
-   For more information about what's the reference implementation of this module,
-   see :ref:`smdmp-tp-appendix`.
-
-   -  **Arguments:**
-
-      -  See :class:`smdistributed.modelparallel.torch.nn.DistributedTransformerLayer` for descriptions of the
-         arguments.
-      -  ``cross_attention``: If ``True``, it computes the attentions
-         with respect to the ``cross_states`` tensor of the ``forward``
-         method input tuple. (Default: ``False``)
-
-   -  **Methods:**
-
-      -  ``forward(self, inputs)``: Forward pass for the attention
-         layer.
-
-         -  **Arguments:**
-
-            -  If ``cross_attention=False``, ``inputs`` must be a tuple
-               ``(hidden_states, attention_mask)``, where
-               ``hidden_states`` is assumed to be a tensor of dimensions
-               ``[N, S, H]``, where ``N`` is batch size, ``S`` is
-               sequence length, and ``H`` is ``hidden_size``.
-               ``attention_mask`` is assumed to be a tensor of
-               dimensions ``[N, 1, 1, S]``, where ``N`` is the
-               batch size, and ``S`` is the sequence length.
-            -  If ``cross_attention=True``, ``inputs`` must be a tuple
-               ``(hidden_states, cross_states, attention_mask)``, where
-               ``hidden_states`` is assumed to be a tensor of dimensions
-               ``[N, S_1, H]``, where ``N`` is batch size, ``S_1`` is
-               sequence length, and ``H`` is ``hidden_size``.
-               ``cross_states`` is assumed to be a tensor of size
-               ``[N, S_2, H]``, similarly interpreted.
-               ``attention_mask`` is assumed to be a tensor of
-               dimensions ``[N, 1, 1, S_2]``, where ``N`` is the batch
-               size, and ``S_2`` is the sequence length. Keys and values
-               for the attention heads are computed using
-               ``cross_states``.
-
-         -  **Returns:**
-
-            -  A single tensor that is the output of the attention
-               layer.
-
-.. class:: smdistributed.modelparallel.torch.nn.DistributedTransformerOutputLayer(hidden_size=1024, intermediate_size=4096,  hidden_dropout_prob=0.1, activation="gelu", layernorm_epsilon=1e-5, initializer_range=0.02, use_normal_initialization=False, pre_layernorm=False, post_layernorm=True, fp32_residual_addition=False)
-   :noindex:
-
-   -  Distributed implementation of a single transformer output layer. A
-      single :class:`smdistributed.modelparallel.torch.nn.DistributedTransformerLayer` with
-      ``add_cross_attention=False`` consists of a single
-      ``DistributedAttentionLayer`` immediately followed by a single
-      ``DistributedTransformerOutputLayer``. The latter linearly maps
-      the last channel of the input tensor from ``hidden_size`` to
-      ``intermediate_size``, and then maps it back to ``hidden_size``.
-
-      For more information about what's the reference implementation of this module,
-      see :ref:`smdmp-tp-appendix`.
-
-   -  **Arguments:**
-
-      -  See :class:`smdistributed.modelparallel.torch.nn.DistributedTransformerLayer` for descriptions of the
-         arguments.
-      - ``fp32_residual_addition``: Set to ``True`` if you want to avoid overflow
-        (NaN loss values) for large models with more than 100 billion parameters
-        when using FP16. (Default: False)
-
-.. class:: smdistributed.modelparallel.torch.nn.DistributedEmbedding(num_embeddings,embedding_dim, padding_idx=None, max_norm=None, norm_type=2.0, scale_grad_by_freq=False, sparse=False, _weight=None, initializer_range=0.02, _skip_allgather=False,_skip_scatter_and_merge=False,)
-   :noindex:
-
-   -  Distributed implementation of a single Embedding Layer. Currently
-      only supports splitting across the embedding_dim.
-   -  **Arguments:**
-
-      -  See :class:`smdistributed.modelparallel.torch.nn.DistributedEmbedding` for descriptions of the
-         arguments.
-
-.. _enabling-tp:
-   :noindex:
-
-Enabling Tensor Parallelism
-~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-There are two ways tensor parallelism can be enabled.
-
-First, you can use
-the distributed module implementations in ``smdistributed.modelparallel.torch.nn`` module directly in
-your model definition. See :ref:`smdmp-supported-modules-for-tp`
-for a complete list of built-in distributed modules. Here is an example
-of how this can be done:
-
-.. code:: python
-
-   import torch.nn as nn
-   import smdistributed.modelparallel.torch as smp
-
-   class TransformerModel:
-       def __init__(self):
-           self.embedding = nn.Embedding(vocab_size, hidden_size)
-
-           # directly instantiate smp.nn.DistributedTransformer and use it
-           self.encoder = smp.nn.DistributedTransformer(num_layers, hidden_size, **kwargs)
-
-           self.pooler = nn.Linear(hidden_size, hidden_size)
-
-       def forward(self, hidden_states):
-           emb_out = self.embedding(hidden_states)
-           enc_out = self.encoder(emb_out)
-           return self.pooler(enc_out)
-
-Second, you can enable tensor parallelism for specific modules or blocks
-of code, which will automatically enable tensor parallelism for the
-supported modules within that scope. To do this, you can use the
-following API:
-
-.. decorator:: smdistributed.modelparallel.torch.tensor_parallelism(enabled=True, **kwargs)
-   :noindex:
-
-   -  A context manager that enables or disables tensor parallelism for
-      any supported module that is created inside. If there are nested
-      contexts, the innermost overrides the rest. If there are
-      multiple supported modules created within the context, where one
-      is the submodule of the other, only the outermost module will be
-      distributed. If a supported module shares weights with another
-      (supported or unsupported) module, or if its hyperparameters do
-      not support distribution (e.g., not divisible by the tensor
-      parallelism degree), tensor parallelism will **not** be enabled
-      for this module even if this API is used.
-
-      **Example:**
-
-      .. code:: python
-
-         import smdistributed.modelparallel.torch as smp
-
-         with smp.tensor_parallelism():
-             self.m0 = nn.Linear(20, 20)                   # will be distributed
-             with smp.tensor_parallelism(enabled=False):
-                 self.m1 = nn.Linear(20, 20)               # will not be distributed
-
-   - ``kwargs`` - Keyword arguments that can be used to modify the configurations of
-     the distributed modules created inside the context.
-     If a keyword argument provided through it matches any ``__init__`` method arguments
-     of a ``DistributedModule`` that substitutes a module created inside
-     the ``smdistributed.modelparallel.torch.tensor_parallelism`` context, this keyword will override
-     the value defined in the ``init_hook``.
-
-     - (*For v1.7.0 and later*) Through the following additional keyword arguments,
-       the library supports `NVIDIA Megatron’s fused kernels
-       <https://github.com/NVIDIA/Megatron-LM/tree/main/megatron/fused_kernels>`_
-
-       - ``fused_softmax`` (bool) - Fusion of attention masking and softmax.
-         By default, it is set to ``True``. You can deactivate it by setting
-         ``fused_softmax=False`` in the ``smdistributed.modelparallel.torch.tensor_parallelism`` context manager.
-       - ``fused_bias_gelu`` (bool) - Fusion of bias addition and Gelu activation.
-         By default, it is set to ``False``. You can activate it by setting
-         ``fused_bias_gelu=True`` in the ``smdistributed.modelparallel.torch.tensor_parallelism`` context manager.
-
-
-
-.. function:: smdistributed.modelparallel.torch.set_tensor_parallelism(module, enabled=True, **kwargs)
-   :noindex:
-
-   -  Enables or disables tensor parallelism for the supported
-      submodules of ``module``. If enabling, the outermost supported
-      modules will be distributed. If disabling, tensor parallelism will
-      be disabled for the entire module subtree of ``module``. Unlike
-      the context manager, this API can be used after the model creation
-      (but before wrapping with :class:`smdistributed.modelparallel.torch.DistributedModel`), so direct
-      access to model definition code is not required. If a supported
-      module shares weights with another (supported or unsupported)
-      module, or if its hyperparameters do not support distribution
-      (e.g., not divisible by the tensor parallelism degree), tensor
-      parallelism will **not** be enabled for this module.
-   -  Keyword arguments ``kwargs`` can be used to modify the
-      configurations of the distributed modules created inside the
-      context. If a keyword argument provided here matches any
-      ``__init__`` method arguments of a :class:`smdistributed.modelparallel.torch.DistributedModel` that
-      substitutes a module created inside the ``smdistributed.modelparallel.torch.tensor_parallelism``
-      context, this keyword will override the value defined in the
-      ``init_hook``.
-   -  **Example:**
-
-      .. code:: python
-
-         import smdistributed.modelparallel.torch as smp
-
-         model = MyModel()
-         smp.set_tensor_parallelism(model.encoder, True)
-         smp.set_tensor_parallelism(model.encoder.embedding, True)
-
-         # outermost supported submodules in model.encoder will be distributed, except for
-         # model.encoder.embedding
-         model = smp.DistributedModel(model)
-         optimizer = smp.DistributedOptimizer(optimizer)
-
-.. _activation-checkpointing-api:
-   :noindex:
-
-Activation Checkpointing APIs
------------------------------
-
-``smdistributed.modelparallel`` provides three APIs to enable
-activation checkpointing: one for checkpointing modules,
-one for checkpointing sequential modules, and
-one for checkpointing pretrained models.
-
-For a conceptual guide and examples, see
-`Activation Checkpointing <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-activation-checkpointing.html>`_
-in the *SageMaker's Distributed Model Parallel developer guide*.
-
-.. class:: smdistributed.modelparallel.torch.patches.checkpoint.checkpoint(module, *args, preserve_rng_state=True)
-   :noindex:
-
-   -  Checkpoints the module passed. Throws error if, during manual
-      partitioning, all children of module are not on same rank as the
-      module itself, i.e. the module tree is split across multiple
-      partitions. During auto-partitioning, if the module is split
-      across multiple partitions, then this call is ignored(with a
-      warning). Note that this call applies to the module instance only,
-      not to the module class.
-
-   -  **Arguments:**
-
-      -  ``module (Instance of nn.Module)``: The module to be
-         checkpointed. Note that unlike native checkpointing in
-         PyTorch’s, activation checkpointing in
-         ``smdistributed.modelparallel`` is at the granularity of a
-         module. A generic function cannot be passed here.
-      -  ``args``: Tuple containing inputs to the module.
-      -  ``preserve_rng_state (bool, default=True)``: Omit stashing and
-         restoring the RNG state during each checkpoint.
-
-.. class:: smdistributed.modelparallel.torch.patches.checkpoint.checkpoint_sequential(sequential_module, input, strategy="each", preserve_rng_state=True, pack_args_as_tuple=False)
-   :noindex:
-
-   -  Checkpoints the modules inside
-      `nn.Sequential <https://pytorch.org/docs/stable/generated/torch.nn.Sequential.html>`__.
-      This can be used even if different layers that are part of the
-      sequential container lie on different partitions. Each layer part
-      of the sequential module that is checkpointed must lie completely
-      within one partition. If this is not the case during manual
-      partitioning, then an error will be thrown. If this is not the
-      case during auto partitioning, a warning will be raised and this
-      module will be run without checkpointing.
-
-   -  **Arguments**
-
-      -  ``sequential_module (nn.Sequential)``: the sequential module to
-         be checkpointed.
-      -  ``input (torch.Tensor or a tuple of torch.Tensors)``: input to
-         the module, which can be a tensor or a tuple of tensors. If a
-         tuple is passed, then pack_args_as_tuple should be set to True.
-      -  ``strategy (string, default=“each”)`` : Strategy determines how
-         many layers part of the sequential module need to be grouped
-         together for one checkpointing call. This determines how much
-         memory can be reduced. It can take the following values
-
-         -  ``each`` : The default is to checkpoint each module inside
-            the sequential separately.
-         -  ``contiguous``: Groups consecutive layers on the same
-            partition together. For example, if a sequential consists of
-            [a, b, c, d] where a,b are on pp_rank0 and c,d are on
-            pp_rank 1, then this strategy would checkpoint a,b together
-            and then c,d together. This means effectively, inputs of a,
-            outputs of b, inputs of c, and outputs of d are in memory;
-            the reamining activations are recomputed.
-         -  ``group_2, group_3, group_4, etc:`` More generally,
-            ``group_x`` where x is an integer. This strategy provides
-            more flexibility in how many layers to group together.
-            ``group_x`` groups x layers together on a best effort basis.
-            It can group x layers together if there are x layers
-            consecutively on the same partition. For example:
-            [a,b,c,d,e] where a,b are on pp_rank0 and c,d,e are on
-            pp_rank 1. If the strategy is ``group_3,`` then a,b are
-            checkpointed together on pp_rank0 and c,d,e are checkpointed
-            together on pp_rank1.
-
-      -  ``preserve_rng_state (bool, default=True)``: Set to ``False``
-         to omit stashing and restoring the RNG state during each
-         checkpoint.
-      -  ``pack_args_as_tuple (bool, default=False)``: To ensure that
-         backward works correctly, the autograd function has to unpack
-         any tuples received. If the checkpointed layer takes a tuple as
-         input, then this needs to be set to True.
-
-.. class:: smdistributed.modelparallel.torch.set_activation_checkpointing(module, preserve_rng_state=True, pack_args_as_tuple=False, strategy="each")
-   :noindex:
-
-   -  This API is recommended when importing pretrained models from
-      libraries, such as PyTorch and Hugging Face Transformers. This is
-      particularly useful when you don’t have access to the model
-      definition code and not be able to replace a module call with
-      checkpoint.
-
-   -  **Arguments**:
-
-      -  ``module (Instance of nn.Module or nn.Sequential)``: The module
-         to checkpoint.
-      -  ``preserve_rng_state (bool, default=True)``: Set to ``False``
-         to omit stashing and restoring the RNG state during each
-         checkpoint.
-      -  ``pack_args_as_tuple (bool, default=False)``: *Can only be
-         passed when module is a sequential module.* To ensure that
-         backward works correctly, the autograd function has to unpack
-         any tuples received. If the layer checkpointed takes a tuple as
-         input, then this needs to be set to True.
-      -  ``strategy: (string, default=“each”)``: *Can only be passed
-         when module is a sequential module.* Strategy determines how
-         many layers part of the sequential module need to be grouped
-         together for one checkpointing call.
-      -  This determines how much memory can be reduced. It can take the
-         following values
-
-         -  ``each`` : The default is to checkpoint each module inside
-            the sequential separately.
-         -  ``contiguous``: Groups consecutive layers on the same
-            partition together. For example if a sequential consists of
-            ``[a, b, c, d]`` where ``a, b`` are on ``pp_rank0`` and ``c, d`` are on
-            ``pp_rank 1``, then this strategy would checkpoint a,b together
-            and then ``c, d`` together. This means effectively, the inputs of
-            ``a``, outputs of ``b``, inputs of ``c``, and outputs of ``d`` are in
-            memory, and the rest of the activations are recomputed.
-         -  ``group_2, group_3, group_4, etc:`` More generally,
-            ``group_x`` where x is an integer. This strategy provides
-            more flexibility in how many layers to group together.
-            ``group_x`` groups x number of layers together on a best
-            effort basis if there are x layers consecutively in the same
-            partition. **Example**: Assume a module with layers ``[a, b,
-            c, d, e]``. The layers a and b are on pp_rank0, and ``c``, ``d``, and
-            ``e`` are on ``pp_rank 1``. If the strategy is ``group_3,`` then ``a``,
-            ``b`` are checkpointed together on ``pp_rank0``, and ``c``, ``d``, ``e`` are
-            checkpointed together on ``pp_rank1``.
-
-.. _smdmp-tp-appendix:
-   :noindex:
-
-Appendix: Reference Implementations for Modules
------------------------------------------------
-
-The following are reference implementations for transformer-related
-modules. Note that this is not the actual ``smdistributed`` source code,
-but the distributed implementations provided in the library are the
-distributed versions of these reference implementations, and can be used
-to determine whether the distributed modules perform the same operations
-as the custom modules in your script.
-
-To keep the implementations simple, we only assume keyword arguments,
-and assume the existence of a method ``parse_args(kwargs)``, which
-parses the arguments to ``__init__`` methods and sets the relevant
-attributes of the module, such as ``hidden_size`` and
-``num_attention_heads``.
-
-``smdistributed.modelparallel.torch.nn.DistributedTransformer``
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. code:: python
-
-   class Transformer(nn.Module):
-       def __init__(self, **kwargs):
-           super(Transformer, self).__init__()
-           self.parse_args(kwargs)
-
-           self.layers = []
-           for l in range(self.num_layers):
-               self.layers.append(TransformerLayer(**kwargs))
-
-           self.seq_layers = nn.Sequential(*self.layers)
-
-       def forward(self, inp):
-           return self.seq_layers(inp)
-
-``smdistributed.modelparallel.torch.nn.DistributedTransformerLayer``
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. code:: python
-
-   class TransformerLayer(nn.Module):
-       def __init__(self, **kwargs):
-           super(TransformerLayer, self).__init__()
-           self.parse_args(kwargs)
-
-           self.attention = AttentionLayer(**kwargs)
-           self.output = TransformerOutputLayer(**kwargs)
-
-           if self.add_cross_attention:
-               self.cross_attention = AttentionLayer(cross_attention=True, **kwargs)
-
-       def forward(self, inp):
-           if self.add_cross_attention:
-               hidden_states, cross_states, attention_mask, cross_mask = inp
-           else:
-               hidden_states, attention_mask = inp
-
-           attention_output = self.attention((hidden_states, attention_mask))
-           if self.add_cross_attention:
-               attention_output = self.cross_attention((attention_output,
-                                                        cross_states,
-                                                        cross_mask))
-
-           output = self.output(attention_output)
-
-           if self.add_cross_attention:
-               return output, cross_states, attention_mask, cross_mask
-           else:
-               return output, attention_mask
-
-``smdistributed.modelparallel.torch.nn.DistributedAttentionLayer``
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. code:: python
-
-   class AttentionLayer(nn.Module):
-       def __init__(self, **kwargs):
-           super(AttentionLayer, self).__init__()
-           self.parse_args(kwargs)
-           self.attention_head_size = self.hidden_size // self.num_attention_heads
-
-           self.query = nn.Linear(self.hidden_size, self.hidden_size)
-           self.key = nn.Linear(self.hidden_size, self.hidden_size)
-           self.value = nn.Linear(self.hidden_size, self.hidden_size)
-           self.dense = nn.Linear(self.hidden_size, self.hidden_size)
-
-           self.dropout1 = nn.Dropout(self.attention_dropout_prob)
-           self.dropout2 = nn.Dropout(self.hidden_dropout_prob)
-
-           if self.pre_layernorm:
-               self.pre_layernorm = nn.LayerNorm(self.hidden_size,
-                                       eps=self.layernorm_epsilon)
-
-           if self.post_layernorm:
-               self.layernorm = nn.LayerNorm(self.hidden_size,
-                                       eps=self.layernorm_epsilon)
-
-       def transpose(self, tensor, key=False):
-           shape = tensor.size()[:-1] +
-                           (self.num_attention_heads, self.attention_head_size)
-           tensor = torch.reshape(tensor, shape)
-           if key:
-               return tensor.permute(0, 2, 3, 1)
-           else:
-               return tensor.permute(0, 2, 1, 3)
-
-       def forward(self, inp):
-           if self.cross_attention:
-               hidden_states, cross_states, attention_mask = inp
-           else:
-               hidden_states, attention_mask = inp
-
-           if self.pre_layernorm:
-               norm_states = self.pre_layernorm(hidden_states)
-           else:
-               norm_states = hidden_states
-
-           query_layer = self.query(norm_states)
-
-           if self.cross_attention:
-               key_layer = self.key(cross_states)
-               value_layer = self.value(cross_states)
-           else:
-               key_layer = self.key(norm_states)
-               value_layer = self.value(norm_states)
-
-           query_layer = self.transpose(query_layer)
-           key_layer = self.transpose(key_layer, key=True)
-           value_layer = self.transpose(value_layer)
-
-           attention_scores = torch.matmul(query_layer, key_layer)
-           attention_scores = attention_scores / math.sqrt(self.attention_head_size)
-
-           if not self.cross_attention and self.causal_mask is not None:
-               attention_scores = self.apply_causal_mask(attention_scores)
-
-           attention_scores = attention_scores + attention_mask
-
-           attention_probs = F.softmax(attention_scores, dim=-1)
-           attention_probs = self.dropout1(attention_probs)
-
-           context_layer = torch.matmul(attention_probs, value_layer)
-           context_layer = context_layer.permute(0, 2, 1, 3)
-           new_context_layer_shape = context_layer.size()[:-2] + \
-                                       (self.local_attention_size,)
-           context_layer = torch.reshape(context_layer, new_context_layer_shape)
-
-           self_attention = self.dense(context_layer)
-           self_attention = self.dropout2(self_attention)
-
-           if self.post_layernorm:
-               return self.layernorm(self_attention + hidden_states)
-           else:
-               return self_attention
-
-``smdistributed.modelparallel.torch.nn.DistributedTransformerOutputLayer``
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. code:: python
-
-   class TransformerOutputLayer(nn.Module):
-       def __init__(self, **kwargs):
-           super(TransformerOutputLayer, self).__init__()
-           self.parse_args(kwargs)
-
-           self.dense1 = nn.Linear(self.hidden_size, self.intermediate_size)
-           self.dense2 = nn.Linear(self.intermediate_size, self.hidden_size)
-
-           self.dropout = nn.Dropout(self.attention_dropout_prob)
-
-           if self.pre_layernorm:
-               self.pre_layernorm = nn.LayerNorm(self.hidden_size,
-                                       eps=self.layernorm_epsilon)
-
-           if self.post_layernorm:
-               self.layernorm = nn.LayerNorm(self.hidden_size,
-                                       eps=self.layernorm_epsilon)
-
-       def forward(self, inp):
-           if self.pre_layernorm:
-               norm_inp = self.pre_layernorm(inp)
-           else:
-               norm_inp = inp
-
-           dense1_output = self.dense1(norm_inp)
-           if self.activation == "gelu":
-               act_output = F.gelu(dense1_output)
-           else:
-               act_output = F.relu(dense1_output)
-
-           dense2_output = self.dense2(act_output)
-           output = self.dropout(dense2_output)
-
-           if self.post_layernorm:
-               return self.layernorm(inp + output)
-           else:
-               return output
diff --git a/doc/api/training/smp_versions/v1.10.0/smd_model_parallel_tensorflow.rst b/doc/api/training/smp_versions/v1.10.0/smd_model_parallel_tensorflow.rst
deleted file mode 100644
index 6630371b94..0000000000
--- a/doc/api/training/smp_versions/v1.10.0/smd_model_parallel_tensorflow.rst
+++ /dev/null
@@ -1,171 +0,0 @@
-TensorFlow API
-==============
-
-To use the TensorFlow-specific APIs for SageMaker distributed model parallism,
-you need to add the following import statement at the top of your training script.
-
-.. code:: python
-
-   import smdistributed.modelparallel.tensorflow as smp
-
-.. tip::
-
-   Refer to
-   `Modify a TensorFlow Training Script
-   <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-customize-training-script-tf.html>`_
-   to learn how to use the following APIs in your TensorFlow training script.
-
-.. class:: smp.DistributedModel
-   :noindex:
-
-   A sub-class of the Keras \ ``Model`` class, which defines the model to
-   be partitioned. Model definition is done by sub-classing
-   ``smp.DistributedModel`` class, and implementing the ``call()`` method,
-   in the same way as the Keras model sub-classing API. Any operation that
-   is part of the \ ``smp.DistributedModel.call()`` method is subject to
-   partitioning, meaning that every operation placed inside executes in
-   exactly one of the devices (the operations outside run on all devices).
-
-
-   Similar to the regular Keras API, the forward pass is done by directly
-   calling the model object on the input tensors. For example:
-
-   .. code:: python
-
-      predictions = model(inputs)   # model is a smp.DistributedModel object
-
-   However, ``model()`` calls can only be made inside a
-   ``smp.step``-decorated function.
-
-   The outputs from a ``smp.DistributedModel`` are available in all ranks,
-   regardless of which rank computed the last operation.
-
-   **Methods:**
-
-   .. function:: save_model(save_path="/opt/ml/model")
-      :noindex:
-
-      **Inputs**
-      - ``save_path`` (``string``): A path to save an unpartitioned model with latest training weights.
-
-      Saves the entire,
-      unpartitioned model with the latest trained weights to ``save_path`` in
-      TensorFlow ``SavedModel`` format. Defaults to ``"/opt/ml/model"``, which
-      SageMaker monitors to upload the model artifacts to Amazon S3.
-
-.. function:: smp.partition(index)
-   :noindex:
-
-   **Inputs**
-
-   -  ``index`` (``int``): The index of the partition.
-
-   A context manager which places all operations defined inside into the
-   partition whose ID is equal to ``index``. When
-   ``smp.partition`` contexts are nested, the innermost context overrides
-   the rest. The ``index`` argument must be smaller than the number of
-   partitions.
-
-   ``smp.partition`` is used in the manual partitioning API;
-   if \ ``"auto_partition"`` parameter is set to ``True`` while launching
-   training, then ``smp.partition`` contexts are ignored. Any operation
-   that is not placed in any ``smp.partition`` context is placed in the
-   ``default_partition``, as shown in the following example:
-
-   .. code:: python
-
-      # auto_partition: False
-      # default_partition: 0
-      smp.init()
-      [...]
-      x = tf.constant(1.2)                     # placed in partition 0
-      with smp.partition(1):
-          y = tf.add(x, tf.constant(2.3))      # placed in partition 1
-          with smp.partition(3):
-              z = tf.reduce_sum(y)             # placed in partition 3
-
-
-.. function:: register_post_partition_hook(hook)
-   :noindex:
-
-    Registers a callable ``hook`` to
-    be executed after the model is partitioned. This is useful in situations
-    where an operation needs to be executed after the model partition during
-    the first call to ``smp.step``, but before the actual execution of the
-    first forward pass.
-
-    .. code:: python
-
-        @smp.register_post_partition_hook
-        def test_eager():
-            # All statements here will be executed right after partition but before the first forward pass
-            tf.print("Entered hook through eager context")
-
-.. class:: smp.CheckpointManager
-   :noindex:
-
-
-   A subclass of TensorFlow
-   `CheckpointManager <https://www.tensorflow.org/api_docs/python/tf/train/CheckpointManager>`__,
-   which is used to manage checkpoints. The usage is similar to TensorFlow
-   ``CheckpointManager``.
-
-   The following returns a ``CheckpointManager`` object.
-
-   .. code:: python
-
-      smp.CheckpointManager(checkpoint,
-                            directory="/opt/ml/checkpoints",
-                            max_to_keep=None,
-                            checkpoint_name="ckpt")
-
-   **Parameters**
-
-   -  ``checkpoint``: A `tf.train.Checkpoint
-      <https://www.tensorflow.org/api_docs/python/tf/train/Checkpoint>`__ instance
-      that represents a model checkpoint.
-
-   -  ``directory``: (``str``) The path to a directory in which to write
-      checkpoints. A file named "checkpoint" is also written to this
-      directory (in a human-readable text format) which contains the state
-      of the ``CheckpointManager``. Defaults to
-      ``"/opt/ml/checkpoints"``, which is the directory that SageMaker
-      monitors for uploading the checkpoints to Amazon S3.
-   -  ``max_to_keep`` (``int``): The number of checkpoints to keep. If
-      ``None``, all checkpoints are kept.
-   -  ``checkpoint_name`` (``str``): Custom name for the checkpoint file.
-      Defaults to ``"ckpt"``.
-
-
-   **Methods:**
-
-   .. function:: save( )
-      :noindex:
-
-      Saves a new checkpoint in the specified directory. Internally uses ``tf.train.CheckpointManager.save()``.
-
-   .. function:: restore( )
-      :noindex:
-
-      Restores the latest checkpoint in the specified directory.
-      Internally uses ``tf.train.CheckpointManager.restore()``.
-
-
-   **Examples:**
-
-   .. code:: python
-
-      checkpoint = tf.train.Checkpoint(optimizer=optimizer, model=model)
-      ckpt_manager = smp.CheckpointManager(checkpoint, max_to_keep=5)  # use /opt/ml/checkpoints
-
-      for inputs in train_ds:
-          loss = train_step(inputs)
-          # [...]
-          ckpt_manager.save()  # save a new checkpoint in /opt/ml/checkpoints
-
-   .. code:: python
-
-      for step, inputs in enumerate(train_ds):
-          if step == 0:
-              ckpt_manager.restore()
-          loss = train_step(inputs)
diff --git a/doc/api/training/smp_versions/v1.2.0/smd_model_parallel_common_api.rst b/doc/api/training/smp_versions/v1.2.0/smd_model_parallel_common_api.rst
deleted file mode 100644
index 533611ef5e..0000000000
--- a/doc/api/training/smp_versions/v1.2.0/smd_model_parallel_common_api.rst
+++ /dev/null
@@ -1,487 +0,0 @@
-.. admonition:: Contents
-
-   - :ref:`communication_api`
-   - :ref:`mpi_basics`
-
-Common API
-==========
-
-The following SageMaker distribute model parallel APIs are common across all frameworks.
-
-**Important**: This API document assumes you use the following import statement in your training scripts.
-
-**TensorFlow**
-
-.. code:: python
-
-   import smdistributed.modelparallel.tensorflow as smp
-
-**PyTorch**
-
-.. code:: python
-
-   import smdistributed.modelparallel.torch as smp
-
-
-.. function:: smp.init( )
-   :noindex:
-
-   Initialize the library. Must be called at the beginning of training script.
-
-.. function:: @smp.step(non_split_inputs, input_split_axes, [*args, **kwargs])
-   :noindex:
-
-   A decorator that must be placed over a function that represents a single
-   forward and backward pass (for training use cases), or a single forward
-   pass (for evaluation use cases). Any computation that is defined inside
-   the ``smp.step``-decorated function is executed in a pipelined manner.
-
-   By default, every tensor input to the function is split across its batch
-   dimension into a number of microbatches specified while launching the
-   training job. This behavior can be customized through the arguments to
-   ``smp.step``, described below. The library then orchestrates the execution of
-   each microbatch across all partitions, based on the chosen pipeline
-   type.
-
-   In a typical use case, forward pass and back-propagation are executed
-   inside an \ ``smp.step``-decorated function and gradients, loss, and
-   other relevant metrics (such as accuracy, etc.) are returned from
-   ``smp.step``-decorated function.
-
-   Any gradient post-processing operation, such as gradient clipping and
-   allreduce, as well as ``optimizer.apply_gradients`` calls (for TF) or
-   ``optimizer.step`` (for PT) should be applied on the gradients returned
-   from the ``smp.step`` function, and not inside the ``smp.step``
-   function. This is because every operation inside ``smp.step`` is
-   executed once per microbatch, so having these operations inside
-   ``smp.step`` can either be inefficient (in the case of allreduce), or
-   lead to wrong results (in the case of ``apply_gradients`` /
-   ``optimizer.step``).
-
-   If the objects returned from the ``smp.step``-decorated function contain
-   ``tf.Tensor``\ s / ``torch.Tensor``\ s, they are converted to
-   ``StepOutput`` objects. A ``StepOutput`` object encapsulates all
-   versions of the tensor across different microbatches
-   (see ``StepOutput`` entry for more information).
-
-   The argument to ``smp.step`` decorated function should either be a tensor
-   or an instance of list, tuple, dict or set for it to be split across
-   microbatches. If your object doesn't fall into this category, you can make
-   the library split your object, by implementing ``smp_slice`` method.
-
-   Below is an example of how to use it with PyTorch.
-
-   .. code:: python
-
-      class CustomType:
-          def __init__(self, tensor):
-              self.data = tensor
-
-          # The library will call this to invoke slicing on the object passing in total microbatches (num_mb)
-          # and the current microbatch index (mb).
-          def smp_slice(self, num_mb, mb, axis):
-              dim_size = list(self.data.size())[axis]
-
-              split_size = dim_size // num_mb
-              sliced_tensor = self.data.narrow(axis, mb * split_size, split_size)
-              return CustomType(sliced_tensor, self.other)
-
-      custom_obj = CustomType(torch.ones(4,))
-
-      @smp.step()
-      def step(custom_obj):
-          loss = model(custom_obj)
-          model.backward(loss)
-          return loss
-
-
-   **Important:** ``smp.step`` splits the batch into microbatches, and
-   executes everything inside the decorated function once per microbatch.
-   This might affect the behavior of batch normalization, any operation
-   that explicitly uses the batch size information, or any other Python
-   code that is expected to run once.
-
-   **TensorFlow-specific behavior**
-
-   ``smp.step`` is a wrapper that
-   inherits from and extends the behavior of ``tf.function``, and as such,
-   all the caveats that apply to the use of ``tf.function``\ s also apply
-   to ``smp.step``. In particular, any operation that is inside
-   ``smp.step`` executes in graph mode, and not eager mode.
-
-   In the first call, ``smp.step`` performs tracing of the wrapped function every time
-   one of the tensor arguments changes their shape or dtype, or for every
-   new value of a Python argument, if there is one. Tracing is expensive,
-   so such scenarios should be avoided as much as possible or,
-   alternatively, an ``input_signature`` argument must be provided. For
-   more information on the usage of ``tf.function``, refer to the
-   TensorFlow documentation:
-
-   -  https://www.tensorflow.org/api_docs/python/tf/function\
-   -  https://www.tensorflow.org/guide/function\
-
-   Each ``smp.step`` decorated function must have a return value that depends on the
-   output of ``smp.DistributedModel``.
-
-   **Common parameters**
-
-   -  ``non_split_inputs`` (``list``): The list of arguments to the decorated function
-      that should not be split along the batch dimension. Should be used
-      for all input tensors that do not have a batch dimension. Should be a
-      list of argument names as ``str``, as they appear in the signature of
-      the ``smp.step``-decorated function. By default it is considered an
-      empty list.
-
-   -  ``input_split_axes`` (``dict``): A dict that maps the argument name to its batch
-      axis. The keys should be the argument names as ``str``, as they
-      appear in the signature of the ``smp.step``-decorated function.  By
-      default all batch axes are assumed to be the 0-axis.
-
-   **TensorFlow-only parameters**
-
-   -  All arguments of ``tf.function``. Note:
-      The \ ``experimental_compile`` argument of ``tf.function`` may not
-      work as expected with ``smp.step``, since it interferes with
-      pipelining and model partitioning. To enable XLA with the library, you can
-      instead use \ ``tf.config.optimizer.set_jit(True)``.
-
-   **PyTorch-only parameters**
-
-   -  ``detach_outputs`` (``bool``) : If ``True``, calls ``torch.Tensor.detach()`` on
-      all returned ``torch.Tensor`` outputs. Setting it to ``False``
-      increases memory consumption, unless ``detach()`` is manually called
-      on the returned tensors, because the model graph is not cleared from
-      memory after the training step. Set to \ ``True`` by default.
-
-   **Returns**
-
-   -  The same object(s) returned from the decorated function. All
-      returned \ ``tf.Tensor``, \ ``tf.Variable``  objects (for TF) or
-      ``torch.Tensor`` objects (for PT) are wrapped inside
-      a \ ``StepOutput`` object, even when they are inside a Python
-      ``list``, ``tuple``, or ``dict``.
-
-
-
-.. class:: StepOutput
-   :noindex:
-
-   A class that encapsulates all versions of a ``tf.Tensor``
-   or \ ``torch.Tensor`` across all microbatches.
-
-   When a particular ``tf.Tensor`` or ``torch.Tensor`` is computed inside
-   ``smp.step``, different versions of the tensor are computed for each
-   microbatch.
-
-   When this tensor is returned from ``smp.step`` and is accessed outside
-   of the decorated function, it appears as a ``StepOutput`` object, which
-   contains all such versions. For example,
-
-   -  In the case of Tensorflow, the gradient for a particular
-      ``tf.Variable`` is computed on each microbatch individually, and if
-      this gradient is returned from ``smp.step``, all gradients for this
-      ``tf.Variable`` become part of the same ``StepOutput`` object. The
-      ``StepOutput`` class offers the following API for commonly-used
-      post-processing operations on such tensors.
-   -  In the case of PyTorch, the loss for each microbatch is computed
-      individually and all the ``torch.Tensor``\ s that represent the loss
-      for different microbatches become part of same ``StepOutput`` object,
-      if loss is returned from the ``smp.step`` function.
-
-
-   The ``StepOutput`` class offers the following API for commonly-used
-   post-processing operations on tensors.
-
-   .. data:: StepOutput.outputs
-      :noindex:
-
-      Returns a list of the underlying tensors, indexed by microbatch.
-
-   .. function:: StepOutput.reduce_mean( )
-      :noindex:
-
-      Returns a ``tf.Tensor``, ``torch.Tensor`` that averages the constituent ``tf.Tensor`` s
-      ``torch.Tensor`` s. This is commonly used for averaging loss and gradients across microbatches.
-
-   .. function:: StepOutput.reduce_sum( )
-      :noindex:
-
-      Returns a ``tf.Tensor`` /
-      ``torch.Tensor`` that sums the constituent
-      ``tf.Tensor``\ s/\ ``torch.Tensor``\ s.
-
-   .. function:: StepOutput.concat( )
-      :noindex:
-
-      Returns a
-      ``tf.Tensor``/``torch.Tensor`` that concatenates tensors along the
-      batch dimension using ``tf.concat`` / ``torch.cat``.
-
-   .. function:: StepOutput.stack( )
-      :noindex:
-
-      Applies ``tf.stack`` / ``torch.stack``
-      operation to the list of constituent ``tf.Tensor``\ s /
-      ``torch.Tensor``\ s.
-
-   **TensorFlow-only methods**
-
-   .. function:: StepOutput.merge( )
-      :noindex:
-
-      Returns a ``tf.Tensor`` that
-      concatenates the constituent ``tf.Tensor``\ s along the batch
-      dimension. This is commonly used for merging the model predictions
-      across microbatches.
-
-   .. function:: StepOutput.accumulate(method="variable", var=None)
-      :noindex:
-
-      Functionally the same as ``StepOutput.reduce_mean()``. However, it is
-      more memory-efficient, especially for large numbers of microbatches,
-      since it does not wait for all constituent \ ``tf.Tensor``\ s to be
-      ready to start averaging them, thereby saving memory.
-
-      In some cases (XLA for example) ``StepOutput.reduce_mean()`` might end
-      up being more memory-efficient than ``StepOutput.accumulate()``.
-
-      **Parameters**
-
-      -  ``method`` (``"add_n"`` or ``"accumulate_n"`` or ``"variable"``):
-         If ``"add_n"`` or ``"accumulate_n"``, the library uses
-         ``tf.add_n`` and ``tf.accumulate_n``, respectively, to implement
-         accumulation. If ``"variable"``, the library uses an internal ``tf.Variable``
-         into which to accumulate the tensors. Default is \ ``"variable"``.
-         Note: Memory usage behavior of these choices can depend on the model
-         and implementation.
-
-      -  ``var``: A ``tf.Variable`` into which, if provided, the library uses to
-         accumulate the tensors. If \ ``None``, the library internally creates a
-         variable. If ``method`` is not ``"variable"``, this argument is
-         ignored.
-
-.. _mpi_basics:
-   :noindex:
-
-MPI Basics
-^^^^^^^^^^
-
-The library exposes the following basic MPI primitives to its Python API:
-
--  ``smp.rank()``: The rank of the current process.
--  ``smp.size()``: The total number of processes.
--  ``smp.mp_rank()``: The rank of the process among the processes that
-   hold the current model replica.
--  ``smp.dp_rank()``: The rank of the process among the processes that
-   hold different replicas of the same model partition.
--  ``smp.dp_size()``: The total number of model replicas.
--  ``smp.local_rank()``: The rank among the processes on the current
-   instance.
--  ``smp.local_size()``: The total number of processes on the current
-   instance.
--  ``smp.get_mp_group()``: The list of ranks over which the current
-   model replica is partitioned.
--  ``smp.get_dp_group()``: The list of ranks that hold different
-   replicas of the same model partition.
-
-.. _communication_api:
-   :noindex:
-
-Communication API
-^^^^^^^^^^^^^^^^^
-
-The library provides a few communication primitives which can be helpful while
-developing the training script. These primitives use the following
-``enum`` s as arguments to specify which processes the communication
-should involve.
-​
-
-**Helper structures**
-
-.. data:: smp.CommGroup
-   :noindex:
-
-   An ``enum`` that takes the values
-   ``CommGroup.WORLD``, ``CommGroup.MP_GROUP``, and ``CommGroup.DP_GROUP``.
-   These values can also be accessed as ``smp.WORLD``, ``smp.MP_GROUP``,
-   and ``smp.DP_GROUP`` respectively.
-
-   -  ``CommGroup.WORLD``: Represents the entire group of processes used in
-      training
-   -  ``CommGroup.MP_GROUP``: Represents the group of processes that hold
-      the same model replica as the current process. The processes in a
-      single ``MP_GROUP`` collectively store an entire replica of the
-      model.
-   -  ``CommGroup.DP_GROUP``: Represents the group of processes that hold
-      the same model partition as the current process. The processes in a
-      single ``DP_GROUP`` perform data parallelism/allreduce among
-      themselves.
-
-.. data:: smp.RankType
-   :noindex:
-
-   An ``enum`` that takes the values
-   ``RankType.WORLD_RANK``, ``RankType.MP_RANK``, and ``RankType.DP_RANK``.
-
-   -  ``RankType.WORLD_RANK``: The associated rank is to be interpreted as
-      the rank of the process across all processes used in training.
-   -  ``RankType.MP_RANK``: The associated rank is to be interpreted as the
-      rank of the process within the ``MP_GROUP``.
-   -  ``RankType.DP_RANK``: The associated rank is to be interpreted as the
-      rank of the process within the ``DP_GROUP``.
-
-
-**Communication primitives:**
-
-.. function:: smp.broadcast(obj, group)
-   :noindex:
-
-   Sends the object to all processes in the
-   group. The receiving process must call ``smp.recv_from`` to receive the
-   sent object.
-
-   **Inputs**
-
-   -  ``obj``: An arbitrary picklable Python object that will be broadcast.
-
-   -  ``group``: A ``CommGroup`` argument that represents to which group of
-      processes the object will be sent.
-
-   **Notes**
-
-   -  When you use ``broadcast`` on the sender process, there needs
-      to be an accompanying ``smp.recv_from()`` call on the receiver
-      processes.
-
-   -  This is a synchronous call; the ``broadcast`` statement
-      returns only after all ranks participating in the call have made a
-      matching ``recv_from`` call.
-
-   **Example**
-
-   .. code:: python
-
-      if smp.rank() == 0:
-          smp.broadcast(something, group=smp.CommGroup.WORLD)
-      else:
-          smp.recv_from(0, rank_type=smp.RankType.WORLD_RANK)
-
-.. function:: smp.send(obj, dest_rank, rank_type)
-   :noindex:
-
-   Sends the object ``obj`` to
-   ``dest_rank``, which is of a type specified by ``rank_type``.
-
-   **Inputs**
-
-   -  ``obj``: An arbitrary picklable Python object that will be sent.
-
-   -  ``dest_rank`` (``int``): An integer denoting the rank of the receiving process.
-
-   -  ``rank_type`` (``enum``): A ``smp.RankType`` ``enum`` that determines how
-      ``dest_rank`` is to be interpreted. For example if ``dest_rank`` is 1
-      and ``rank_type`` is ``MP_RANK``, then ``obj`` is sent to process
-      with ``mp_rank`` 1 in the ``MP_GROUP`` which contains the current
-      process.
-
-   **Notes**
-
-   -  Note: \ This is a synchronous call; the ``send`` statement returns
-      only after the destination rank has made a matching
-      ``recv_from`` call.
-
-.. function:: smp.recv_from(src_rank, rank_type)
-   :noindex:
-
-   Receive an object from a peer process. Can be used with a matching
-   ``smp.send`` or a ``smp.broadcast`` call.
-
-   **Inputs**
-
-   -  ``src_rank`` (``int``): An integer denoting rank of the sending process.
-
-   -  ``rank_type`` (``enum``): A ``smp.RankType`` ``enum`` that determines how
-      ``dest_rank`` is to be interpreted. For example if ``src_rank`` is 1
-      and ``rank_type`` is ``MP_RANK``, then the object is received from
-      the process with ``mp_rank`` 1 in the ``MP_GROUP`` which contains the
-      current process.
-
-   **Returns**
-
-   Returns the python object that is sent by the peer process.
-
-   **Notes**
-
-   -  Note: This is a synchronous call; the ``recv_from`` statement returns
-      only after the source rank has made a matching ``send`` or
-      ``broadcast`` call, and the object is received.
-
-.. function:: smp.allgather(obj, group)
-   :noindex:
-
-   A collective call that gathers all the
-   submitted objects across all ranks in the specified ``group``. Returns a
-   list whose ``i``\ th index contains the object submitted by the
-   ``i``\ th rank in ``group``.
-
-   **Inputs**
-
-   -  ``obj``: An arbitrary picklable Python object that will be
-      allgathered.
-
-   -  ``group`` : A ``CommGroup`` argument that represents which group of
-      processes participate in ``allgather``.
-
-   **Notes**
-
-   -  Note: This is a synchronous call; the ``allgather`` statement returns
-      only after all ranks participating in the call have made a matching
-      ``allgather`` call, and all the objects are received at the current
-      rank.
-
-   **Examples**
-
-   .. code:: python
-
-      # assuming mp_size() == 2
-
-      if smp.mp_rank() == 0:
-          out = smp.allgather(obj1, smp.CommGroup.MP_GROUP)  # returns [obj1, obj2]
-      else:
-          out = smp.allgather(obj2, smp.CommGroup.MP_GROUP)  # returns [obj1, obj2]
-
-.. function:: smp.barrier(group=smp.WORLD)
-   :noindex:
-
-   A statement that hangs until all
-   processes in the specified group reach the barrier statement, similar to
-   ``MPI_Barrier()``.
-
-   **Inputs**
-
-   -  ``group``: An ``smp.CommGroup`` ``enum`` that specifies the group of
-      processes participating in the barrier call. Defaults to
-      ``smp.WORLD``.
-
-   **Examples**
-
-   -  Assume there are 8 processes and 2 model partitions, and
-      therefore 4 \ ``mp_group``\ s, and 2 ``dp_group``\ s. If
-      the \ ``barrier`` call is passed the value ``smp.MP_GROUP`` for its
-      group argument, then each process only waits until the other process
-      of its own ``mp_group`` reaches that point. It does not wait for
-      processes outside that ``mp_group``.
-
-.. function:: smp.dp_barrier()
-   :noindex:
-
-   Same as passing ``smp.DP_GROUP``\ to ``smp.barrier()``.
-   Waits for the processes in the same \ ``dp_group`` as
-   the current process to reach the same point in execution.
-
-.. function:: smp.mp_barrier()
-   :noindex:
-
-   Same as passing ``smp.MP_GROUP`` to
-   ``smp.barrier()``. Waits for the processes in the same ``mp_group`` as
-   the current process to reach the same point in execution.
diff --git a/doc/api/training/smp_versions/v1.2.0/smd_model_parallel_pytorch.rst b/doc/api/training/smp_versions/v1.2.0/smd_model_parallel_pytorch.rst
deleted file mode 100644
index 7e09d64262..0000000000
--- a/doc/api/training/smp_versions/v1.2.0/smd_model_parallel_pytorch.rst
+++ /dev/null
@@ -1,553 +0,0 @@
-.. admonition:: Contents
-
-   - :ref:`pytorch_saving_loading`
-   - :ref:`pytorch_saving_loading_instructions`
-
-PyTorch API
-===========
-
-**Supported versions: 1.7.1, 1.6.0**
-
-This API document assumes you use the following import statements in your training scripts.
-
-.. code:: python
-
-   import smdistributed.modelparallel.torch as smp
-
-
-.. tip::
-
-   Refer to
-   `Modify a PyTorch Training Script
-   <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-customize-training-script.html#model-parallel-customize-training-script-pt>`_
-   to learn how to use the following API in your PyTorch training script.
-
-.. class:: smp.DistributedModel
-   :noindex:
-
-   A sub-class of ``torch.nn.Module`` which specifies the model to be
-   partitioned. Accepts a ``torch.nn.Module`` object ``module`` which is
-   the model to be partitioned. The returned ``DistributedModel`` object
-   internally manages model parallelism and data parallelism. Only one
-   model in the training script can be wrapped with
-   ``smp.DistributedModel``.
-
-   **Example:**
-
-   .. code:: python
-
-      model = smp.DistributedModel(model)
-
-   **Important**: The ``__call__`` and  ``backward`` method calls on the
-   ``smp.DistributedModel`` object (in the following example, the object
-   is \ ``model``) can only be made inside a ``smp.step``-decorated
-   function.
-
-
-   Since ``DistributedModel``  is a ``torch.nn.Module``, a forward pass can
-   be performed by calling the \ ``DistributedModel`` object on the input
-   tensors.
-
-   .. code:: python
-
-      predictions = model(inputs)   # model is a smp.DistributedModel object
-
-   For a backward pass, one needs to call the backward function on
-   the \ ``DistributedModel`` object, with tensors and gradients as
-   arguments, replacing the PyTorch operations \ ``torch.Tensor.backward``
-   or ``torch.autograd.backward``.
-
-
-   The API for ``model.backward`` is very similar to
-   ``torch.autograd.backward``. For example, the following
-   ``backward`` calls:
-
-   .. code:: python
-
-      torch.autograd.backward(loss) or loss.backward()
-
-   should be replaced with:
-
-   .. code:: python
-
-      model.backward(loss) # loss is a tensor with only one element as its data
-
-   Similarly, for non-scalar tensors, replace the following
-   ``backward`` call containing incoming gradient arguments:
-
-   .. code:: python
-
-      torch.autograd.backward(outputs, out_grads)
-
-   with the following line:
-
-   .. code:: python
-
-      model.backward(outputs, out_grads)
-
-   In these examples, all ``__call__``  and ``backward`` method calls on
-   the model objects (``model(inputs)`` and ``model.backward(loss)``) must be made inside
-   a ``smp.step``-decorated function.
-
-   **Using DDP**
-
-   If DDP is enabled, do not not place a PyTorch
-   ``DistributedDataParallel`` wrapper around the ``DistributedModel`` because
-   the ``DistributedModel`` wrapper will also handle data parallelism.
-
-   Unlike the original DDP wrapper, when you use ``DistributedModel``,
-   model parameters and buffers are not immediately broadcast across
-   processes when the wrapper is called. Instead, the broadcast is deferred to the first call of the
-   ``smp.step``-decorated function when the partition is done.
-
-   **Parameters**
-
-   -  ``module`` (``torch.nn.Module``): Module to be distributed (data parallelism and model parallelism).
-
-   -  ``trace_device`` (``"cpu"`` or ``"gpu"``) (default: ``"gpu"``)
-      Whether to perform the tracing step on the GPU or CPU. The tracing step gathers
-      information on the order of execution of modules, the shapes of
-      intermediate outputs, and execution times, to be used by the
-      partitioning algorithm. If ``trace_device`` is set to GPU, accurate
-      module execution times can be gathered during tracing for potentially
-      improved partitioning decision. However, if the model is too large to
-      fit in a single GPU, then ``trace_device`` should be set to CPU.
-
-   -  ``trace_execution_times`` (``bool``) (default: ``False``): If ``True``,
-      the library profiles the execution time of each module during tracing, and uses
-      it in the partitioning decision. This improves the partitioning
-      decision, but it might make the tracing slower. It may also introduce
-      some degree of non-determinism in partitioning results, because of the
-      inherent randomness in module execution times. Must be ``False`` if
-      ``trace_device`` is ``"cpu"``.
-
-   -  ``overlapping_allreduce`` (``bool``) (default: ``True``): This is only
-      applicable for hybrid data parallelism/model parallelism use cases (when
-      ``ddp`` is set to ``True`` while launching training). The library uses this flag
-      to decide whether to do overlapping allreduce whenever a parameter
-      gradients are ready. This leads to overlapping of communication and
-      computation and can improve performance. If this is set to ``False`` ,
-      allreduce is performed at the end of the step.
-
-   -  ``backward_passes_per_step`` (``int``) (default: 1): This is only
-      applicable for hybrid data parallelism/model parallelism use cases (when
-      ``ddp`` is set to ``True`` in config). This parameter indicates the
-      number of backward passes to perform before calling allreduce on DDP.
-      This allows accumulating updates over multiple mini-batches before
-      reducing and applying them.
-
-   -  ``average_grads_across_microbatches`` (``bool``) (default: ``True``):
-      Whether or not the computed gradients should be averaged across
-      microbatches. If ``False``, the computed gradients will be summed across
-      microbatches, but not divided by the number of microbatches. In typical
-      use case where the computed loss is averaged over the mini-batch, this
-      should be left as ``True``. If you use a loss function that only sums
-      the per-sample loss across the batch (and not divide by the batch size),
-      then this must be set to ``False`` for correctness.
-
-   -  ``bucket_cap_mb`` (default: 25): \ ``DistributedDataParallel`` buckets
-      parameters into multiple buckets so that gradient reduction of each
-      bucket can potentially overlap with backward
-      computation. \ ``bucket_cap_mb``\ controls the bucket size in MegaBytes
-      (MB).
-
-   -  ``trace_memory_usage`` (default: False): When set to True, the library attempts
-      to measure memory usage per module during tracing. If this is disabled,
-      memory usage will be estimated through the sizes of tensors returned from
-      the module.
-
-   -  ``broadcast_buffers`` (default: True): Flag to be used with ``ddp=True``.
-      This parameter is forwarded to the underlying ``DistributedDataParallel`` wrapper.
-      Please see: `broadcast_buffer <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel>`__.
-
-   -  ``gradient_as_bucket_view (PyTorch 1.7.1 only)`` (default: False): To be
-      used with ``ddp=True``. This parameter is forwarded to the underlying
-      ``DistributedDataParallel`` wrapper. Please see `gradient_as_bucket_view <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel>`__.
-
-   **Properties**
-
-   -  ``partitioned``: Is ``True`` if the model is partitioned, ``False``
-      otherwise. Initialized to ``False`` when ``DistributedModel`` is first
-      created. It becomes be ``True`` during the first call
-      to ``smp.step``-decorated function. Once the model is partitioned, the
-      local parameters or local ``state_dict`` can be fetched using the
-      following methods.
-
-   **Methods**
-
-   .. function:: backward(tensors, grad_tensors)
-      :noindex:
-
-      Triggers a distributed backward
-      pass across model partitions. Example usage provided in the previous
-      section. The API is very similar
-      to https://pytorch.org/docs/stable/autograd.html#torch.autograd.backward.
-      ``retain_grad`` and ``create_graph``  flags are not supported.
-
-   .. function:: local_buffers( )
-      :noindex:
-
-      Returns an iterator over buffers for the modules in
-      the partitioned model that have been assigned to the current process.
-
-   .. function:: local_named_buffers( )
-      :noindex:
-
-      Returns an iterator over buffers for the
-      modules in the partitioned model that have been assigned to the current
-      process. This yields both the name of the buffer as well as the buffer
-      itself.
-
-   .. function:: local_parameters( )
-      :noindex:
-
-      Returns an iterator over parameters for the
-      modules in the partitioned model that have been assigned to the current
-      process.
-
-   .. function:: local_named_parameters( )
-      :noindex:
-
-      Returns an iterator over parameters for
-      the modules in the partitioned model that have been assigned to the
-      current process. This yields both the name of the parameter as well as
-      the parameter itself.
-
-   .. function:: local_modules( )
-      :noindex:
-
-      Returns an iterator over the modules in the
-      partitioned model that have been assigned to the current process.
-
-   .. function:: local_named_modules( )
-      :noindex:
-
-      Returns an iterator over the modules in the
-      partitioned model that have been assigned to the current process. This
-      yields both the name of the module as well as the module itself.
-
-   .. function:: local_state_dict( )
-      :noindex:
-
-      Returns the ``state_dict`` that contains local
-      parameters that belong to the current \ ``mp_rank``. This ``state_dict``
-      contains a key \ ``_smp_is_partial`` to indicate this is a
-      partial \ ``state_dict``, which indicates whether the
-      ``state_dict`` contains elements corresponding to only the current
-      partition, or to the entire model.
-
-   .. function:: state_dict( )
-      :noindex:
-
-      Returns the ``state_dict`` that contains parameters
-      for the entire model. It first collects the \ ``local_state_dict``  and
-      gathers and merges the \ ``local_state_dict`` from all ``mp_rank``\ s to
-      create a full ``state_dict``. Please note that this needs to be called on all ranks with
-      ``dp_rank()==0`` to ensure the gather happens properly.
-      If it is only called on all such ranks, it can hang.
-
-   .. function:: load_state_dict( )
-      :noindex:
-
-      Same as the ``torch.module.load_state_dict()`` ,
-      except: It first gathers and merges the ``state_dict``\ s across
-      ``mp_rank``\ s, if they are partial. The actual loading happens after the
-      model partition so that each rank knows its local parameters.
-
-   .. function:: register_post_partition_hook(hook)
-      :noindex:
-
-      Registers a callable ``hook`` to
-      be executed after the model is partitioned. This is useful in situations
-      where an operation needs to be executed after the model partition during
-      the first call to ``smp.step``, but before the actual execution of the
-      first forward pass. Returns a ``RemovableHandle`` object ``handle``,
-      which can be used to remove the hook by calling ``handle.remove()``.
-
-   .. function:: cpu( )
-      :noindex:
-
-      Allgathers parameters and buffers across all ``mp_rank``\ s and moves them
-      to the CPU.
-
-   .. function:: join( )
-      :noindex:
-
-      **Available for PyTorch 1.7.1 only**
-
-      A context manager to be used in conjunction with an instance of
-      ``smp.DistributedModel`` to be able to train with uneven inputs across
-      participating processes. This is only supported when ``ddp=True`` for
-      ``smp.DistributedModel``. This will use the join with the wrapped
-      ``DistributedDataParallel`` instance. For more information, see:
-      `join <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel.join>`__
-      in the PyTorch documentation.
-
-
-.. class:: smp.DistributedOptimizer
-   :noindex:
-
-   **Parameters**
-   - ``optimizer``
-
-   An optimizer wrapper for saving/loading optimizer states. This wrapper
-   returns ``optimizer`` with the following methods overridden:
-
-   .. function:: state_dict( )
-      :noindex:
-
-      Returns the ``state_dict`` that contains optimizer state for the entire model.
-      It first collects the ``local_state_dict`` and gathers and merges
-      the ``local_state_dict`` from all ``mp_rank``s to create a full
-      ``state_dict``.
-
-   .. function::  load_state_dict( )
-      :noindex:
-
-      Same as the ``torch.optimizer.load_state_dict()`` , except:
-
-         -  It first gathers and merges the local ``state_dict``\ s if they are
-            partial.
-         -  The actual loading happens after the model partition so that each
-            rank knows its local parameters.
-
-   .. function::  local_state_dict( )
-      :noindex:
-
-      Returns the ``state_dict`` that contains the
-      local optimizer state that belongs to the current \ ``mp_rank``. This
-      ``state_dict`` contains a key \ ``_smp_is_partial`` to indicate this is
-      a partial \ ``state_dict``, which indicates whether the
-      ``state_dict`` contains elements corresponding to only the current
-      partition, or to the entire model.
-
-   ​
-.. function:: smp.partition(index)
-   :noindex:
-
-   **Inputs**
-
-   -  ``index`` (int) - The index of the partition.
-
-   A context manager which places all modules defined inside into the
-   partition with ID ``index``.  The ``index`` argument must be less than
-   the number of partitions.
-
-   Use ``smp.partition`` to implement manual partitioning.
-   If ``"auto_partition"`` is ``True``, then the
-   ``smp.partition`` contexts are ignored. Any module that is not placed in
-   any ``smp.partition`` context is placed in the
-   ``default_partition`` defined through the SageMaker Python SDK.
-
-   When ``smp.partition`` contexts are nested, the innermost context
-   overrides the rest (see the following example). In PyTorch, manual
-   partitioning should be done inside the module \ ``__init__``, and the
-   partition assignment applies to the modules that are *created* inside
-   the ``smp.partition`` context.
-
-   Example:
-
-   .. code:: python
-
-      class Model(torch.nn.Module):
-          def __init__(self):
-              with smp.partition(1):
-                  self.child0 = Child0()            # child0 on partition 1
-                  with smp.partition(2):
-                      self.child1 = Child1()        # child1 on partition 2
-                  self.child2 = Child2()            # child2 on partition 1
-              self.child3 = Child3()                # child3 on default_partition
-
-.. function:: smp.get_world_process_group( )
-   :noindex:
-
-   Returns a ``torch.distributed`` ``ProcessGroup`` that consists of all
-   processes, which can be used with the ``torch.distributed`` API.
-   Requires ``"ddp": True`` in SageMaker Python SDK parameters.
-
-.. function:: smp.get_mp_process_group( )
-   :noindex:
-
-   Returns a ``torch.distributed`` ``ProcessGroup`` that consists of the
-   processes in the ``MP_GROUP`` which contains the current process, which
-   can be used with the \ ``torch.distributed`` API. Requires
-   ``"ddp": True`` in SageMaker Python SDK parameters.
-
-.. function:: smp.get_dp_process_group( )
-   :noindex:
-
-   Returns a ``torch.distributed`` ``ProcessGroup`` that consists of the
-   processes in the ``DP_GROUP`` which contains the current process, which
-   can be used with the \ ``torch.distributed`` API. Requires
-   ``"ddp": True`` in SageMaker Python SDK parameters.
-
-.. function:: smp.is_initialized( )
-   :noindex:
-
-   Returns ``True`` if ``smp.init`` has already been called for the
-   process, and ``False`` otherwise.
-
-.. function::smp.is_tracing( )
-   :noindex:
-
-   Returns ``True`` if the current process is running the tracing step, and
-   ``False`` otherwise.
-
-.. data:: smp.nn.FusedLayerNorm
-   :noindex:
-
-   `Apex Fused Layer Norm <https://nvidia.github.io/apex/layernorm.html>`__ is currently not
-   supported by the library. ``smp.nn.FusedLayerNorm`` replaces ``apex``
-   ``FusedLayerNorm`` and provides the same functionality. This requires
-   ``apex`` to be installed on the system.
-
-.. data:: smp.optimizers.FusedNovoGrad
-   :noindex:
-
-   `Fused Novo Grad optimizer <https://nvidia.github.io/apex/optimizers.html#apex.optimizers.FusedNovoGrad>`__ is
-   currently not supported by the library. ``smp.optimizers.FusedNovoGrad`` replaces ``apex`` ``FusedNovoGrad``
-   optimizer and provides the same functionality. This requires ``apex`` to
-   be installed on the system.
-
-.. data:: smp.optimizers.FusedLamb
-   :noindex:
-
-   `FusedLamb optimizer <https://nvidia.github.io/apex/optimizers.html#apex.optimizers.FusedLAMB>`__
-   currently doesn’t work with the library. ``smp.optimizers.FusedLamb`` replaces
-   ``apex`` ``FusedLamb`` optimizer and provides the same functionality.
-   This requires ``apex`` to be installed on the system.
-
-.. data:: smp.amp.GradScaler
-   :noindex:
-
-   `Torch AMP Gradscaler <https://pytorch.org/docs/stable/amp.html#torch.cuda.amp.GradScaler>`__
-   currently doesn’t work with the library. ``smp.amp.GradScaler`` replaces
-   ``torch.amp.GradScaler`` and provides the same functionality.
-
-.. _pytorch_saving_loading:
-   :noindex:
-
-APIs for Saving and Loading
-^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-.. function:: smp.save( )
-   :noindex:
-
-   Saves an object. This operation is similar to ``torch.save()``, except
-   it has an additional keyword argument, ``partial``, and accepts only
-   string type for the argument ``f`` (file). If ``partial=True``, each
-   ``mp_rank`` saves a separate checkpoint file and the library adds an ``mp_rank``
-   index to your saved file.
-
-   **Parameters**
-
-   -  ``obj`` (dict): A saved object.
-   -  ``f`` (str): A string containing a file name.
-   -  ``partial`` (bool, default= ``True``):  When set to ``True``, each
-      ``mp_rank`` saves a separate checkpoint file and the library adds an
-      ``mp_rank`` index to the saved file. If you want to be able to load
-      and further train a model that you save with ``smp.save()``, you must
-      set ``partial=True``.
-   -  ``pickle_module`` (picklemodule, default = module ``"pickle"`` from ``"/opt/conda/lib/python3.6/pickle.py"``):
-      A module used for pickling metadata and objects.
-   -  ``pickle_protocol``  (int, default=2): Can be specified to
-      override the defaultprotocol.
-
-.. function:: smp.load( )
-   :noindex:
-
-   Loads an object saved with ``smp.save()`` from a file.
-
-   Similar to, `torch.load() <https://pytorch.org/docs/stable/generated/torch.load.html>`__,
-   except it has an additional keyword argument, ``partial``, and accepts
-   only string type for the argument ``f`` (file). If \ ``partial=True``,
-   then each ``mp_rank`` loads a separate checkpoint file.
-
-   **Parameters**
-
-   -  ``f`` (string): A string containing a file name.
-   -  ``map_location`` (function): A function
-      `torch.device <https://pytorch.org/docs/stable/tensor_attributes.html#torch.torch.device>`__,
-      a string, or a dict specifying how to remap storage locations.
-   -  ``pickle_module`` (pickle module): A module used for unpickling
-      metadata and objects (has to match the \ ``pickle_module``\ used to
-      serialize file).
-   -  ``pickle_load_args`` (Python 3 only): Optional keyword arguments
-      passed to ``pickle_module.load()`` and ``pickle_module.Unpickler()``.
-   -  ``partial`` (bool, default= ``True``): When set to ``True``, each
-      ``mp_rank`` loads the checkpoint corresponding to the ``mp_rank``.
-      Should be used when loading a model trained with the library.
-
-.. _pytorch_saving_loading_instructions:
-   :noindex:
-
-General Instruction For Saving and Loading
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-The library can save partial or full checkpoints.
-
--  For partial checkpoints, each ``mp_rank`` saves its own checkpoint
-   file with only the parameters that belong to that rank.
--  For full checkpoints, the library saves a single checkpoint that contains
-   entire model parameters.
-
-When **saving** using ``smp.save()``, each rank only holds its own
-parameters. If you want to save the full model, there will be some
-communication between the ranks to create the full model. If you save
-checkpoints often, you should save partial checkpoints for best
-performance.
-
-When **loading** using ``smp.load()``, the library can load either partial or |
-full checkpoints or full checkpoints saved by a non-model-parallel model. If you
-want to resume training with a non-model-parallel model or do inference, you need
-a full checkpoint.
-
-The following is an example of how you can save and load a checkpoint:
-
-.. code:: python
-
-   # Original model and optimizer
-   model = MyModel(...)
-   optimizer = MyOpt(...)
-
-   # model parallel wrapper
-   model = smp.DistributedModel(model)
-   optimizer = smp.DistributedOptimizer(optimizer)
-
-   # To save, always save on dp_rank 0 to avoid data racing
-   if partial:
-       # To save the partial model on each mp rank
-       # the library will create `checkpoint.pt_{mprank}` for each mp rank
-       if save_partial_model:
-           if smp.dp_rank() == 0:
-               model_dict = model.local_state_dict() # save the partial model
-               opt_dict = optimizer.local_state_dict() # save the partial optimizer state
-               smp.save(
-                   {"model_state_dict": model_dict, "optimizer_state_dict": opt_dict},
-                   f"/checkpoint.pt",
-                   partial=True,
-               )
-
-       # To save the full model
-       if save_full_model:
-           if smp.dp_rank() == 0:
-               model_dict = model.state_dict() # save the full model
-               opt_dict = optimizer.state_dict() # save the full optimizer state
-               smp.save(
-                   {"model_state_dict": model_dict, "optimizer_state_dict": opt_dict},
-                   "/checkpoint.pt",
-                   partial=False,
-               )
-
-   # To load, load on all ranks.
-   # The only difference for partial/full loading is the partial flag in smp.load
-   # Load partial checkpoint
-   if partial_checkpoint:
-       checkpoint = smp.load("/checkpoint.pt", partial=True)
-       model.load_state_dict(checkpoint["model_state_dict"])
-       optimizer.load_state_dict(checkpoint["optimizer_state_dict"])
-   # Load full checkpoint
-   if full_checkpoint:
-       checkpoint = smp.load("/checkpoint.pt", partial=False)
-       model.load_state_dict(checkpoint["model_state_dict"])
-       optimizer.load_state_dict(checkpoint["optimizer_state_dict"])
diff --git a/doc/api/training/smp_versions/v1.2.0/smd_model_parallel_tensorflow.rst b/doc/api/training/smp_versions/v1.2.0/smd_model_parallel_tensorflow.rst
deleted file mode 100644
index e47d313a4c..0000000000
--- a/doc/api/training/smp_versions/v1.2.0/smd_model_parallel_tensorflow.rst
+++ /dev/null
@@ -1,164 +0,0 @@
-TensorFlow API
-==============
-
-**Supported version: 2.4.1, 2.3.1**
-
-**Important**: This API document assumes you use the following import statement in your training scripts.
-
-.. code:: python
-
-   import smdistributed.modelparallel.tensorflow as smp
-
-.. tip::
-
-   Refer to
-   `Modify a TensorFlow Training Script
-   <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-customize-training-script.html#model-parallel-customize-training-script-tf>`_
-   to learn how to use the following API in your TensorFlow training script.
-
-.. class:: smp.DistributedModel
-   :noindex:
-
-   A sub-class of the Keras \ ``Model`` class, which defines the model to
-   be partitioned. Model definition is done by sub-classing
-   ``smp.DistributedModel`` class, and implementing the ``call()`` method,
-   in the same way as the Keras model sub-classing API. Any operation that
-   is part of the \ ``smp.DistributedModel.call()`` method is subject to
-   partitioning, meaning that every operation placed inside executes in
-   exactly one of the devices (the operations outside run on all devices).
-
-
-   Similar to the regular Keras API, the forward pass is done by directly
-   calling the model object on the input tensors. For example:
-
-   .. code:: python
-
-      predictions = model(inputs)   # model is a smp.DistributedModel object
-
-   However, ``model()`` calls can only be made inside a
-   ``smp.step``-decorated function.
-
-   The outputs from a ``smp.DistributedModel`` are available in all ranks,
-   regardless of which rank computed the last operation.
-
-   **Methods:**
-
-   .. function:: save_model(save_path="/opt/ml/model")
-      :noindex:
-
-      **Inputs**
-      - ``save_path`` (``string``): A path to save an unpartitioned model with latest training weights.
-
-      Saves the entire,
-      unpartitioned model with the latest trained weights to ``save_path`` in
-      TensorFlow ``SavedModel`` format. Defaults to ``"/opt/ml/model"``, which
-      SageMaker monitors to upload the model artifacts to Amazon S3.
-
-.. function:: smp.partition(index)
-   :noindex:
-
-   **Inputs**
-
-   -  ``index`` (``int``): The index of the partition.
-
-   A context manager which places all operations defined inside into the
-   partition whose ID is equal to ``index``. When
-   ``smp.partition`` contexts are nested, the innermost context overrides
-   the rest. The ``index`` argument must be smaller than the number of
-   partitions.
-
-   ``smp.partition`` is used in the manual partitioning API;
-   if \ ``"auto_partition"`` parameter is set to ``True`` while launching
-   training, then ``smp.partition`` contexts are ignored. Any operation
-   that is not placed in any ``smp.partition`` context is placed in the
-   ``default_partition``, as shown in the following example:
-
-   .. code:: python
-
-      # auto_partition: False
-      # default_partition: 0
-      smp.init()
-      [...]
-      x = tf.constant(1.2)                     # placed in partition 0
-      with smp.partition(1):
-          y = tf.add(x, tf.constant(2.3))      # placed in partition 1
-          with smp.partition(3):
-              z = tf.reduce_sum(y)             # placed in partition 3
-
-   ​
-
-.. class:: smp.CheckpointManager
-   :noindex:
-
-   A subclass of TensorFlow
-   `CheckpointManager <https://www.tensorflow.org/api_docs/python/tf/train/CheckpointManager>`__,
-   which is used to manage checkpoints. The usage is similar to TensorFlow
-   ``CheckpointManager``.
-
-   The following returns a ``CheckpointManager`` object.
-
-   .. code:: python
-
-      smp.CheckpointManager(checkpoint,
-                            directory="/opt/ml/checkpoints",
-                            max_to_keep=None,
-                            checkpoint_name="ckpt")
-
-
-   **Important:** ``smp.CheckpointManager.restore()`` must be called after
-   the first training step. This is because the first call of the
-   ``smp.step`` function constructs and partitions the model, which must
-   take place before the checkpoint restore. Calling it before the first
-   ``smp.step`` call might result in hangs or unexpected behavior.
-
-   **Parameters**
-
-   -  ``checkpoint``: A `tf.train.Checkpoint
-      <https://www.tensorflow.org/api_docs/python/tf/train/Checkpoint>`__ instance
-      that represents a model checkpoint.
-
-   -  ``directory``: (``str``) The path to a directory in which to write
-      checkpoints. A file named "checkpoint" is also written to this
-      directory (in a human-readable text format) which contains the state
-      of the ``CheckpointManager``. Defaults to
-      ``"/opt/ml/checkpoints"``, which is the directory that SageMaker
-      monitors for uploading the checkpoints to Amazon S3.
-   -  ``max_to_keep`` (``int``): The number of checkpoints to keep. If
-      ``None``, all checkpoints are kept.
-   -  ``checkpoint_name`` (``str``): Custom name for the checkpoint file.
-      Defaults to ``"ckpt"``.
-
-
-   **Methods:**
-
-   .. function:: save( )
-      :noindex:
-
-      Saves a new checkpoint in the specified directory. Internally uses ``tf.train.CheckpointManager.save()``.
-
-   .. function:: restore( )
-      :noindex:
-
-      Restores the latest checkpoint in the specified directory.
-      Internally uses ``tf.train.CheckpointManager.restore()``.
-
-
-   **Examples:**
-
-   .. code:: python
-
-      checkpoint = tf.train.Checkpoint(optimizer=optimizer, model=model)
-      ckpt_manager = smp.CheckpointManager(checkpoint, max_to_keep=5)  # use /opt/ml/checkpoints
-
-      for inputs in train_ds:
-          loss = train_step(inputs)
-          # [...]
-          ckpt_manager.save()  # save a new checkpoint in /opt/ml/checkpoints
-
-   .. code:: python
-
-      for step, inputs in enumerate(train_ds):
-          if step == 1:                    # NOTE: restore occurs on the second step
-              ckpt_manager.restore()
-          loss = train_step(inputs)
-
diff --git a/doc/api/training/smp_versions/v1.3.0/add_smd_version.sh b/doc/api/training/smp_versions/v1.3.0/add_smd_version.sh
deleted file mode 100755
index 92d99ca43c..0000000000
--- a/doc/api/training/smp_versions/v1.3.0/add_smd_version.sh
+++ /dev/null
@@ -1,10 +0,0 @@
-#!/usr/bin/env python
-# add_no_index2.py
-import fileinput
-import sys
-
-for line in fileinput.input(inplace=True):
-    if '.. class::' in line or '.. function::' in line or '.. data::' in line or '.. _' in line:
-        sys.stdout.write(line + '   :noindex:\n')
-    else:
-        sys.stdout.write(line)
diff --git a/doc/api/training/smp_versions/v1.3.0/smd_model_parallel_common_api.rst b/doc/api/training/smp_versions/v1.3.0/smd_model_parallel_common_api.rst
deleted file mode 100644
index 625a7fcbf1..0000000000
--- a/doc/api/training/smp_versions/v1.3.0/smd_model_parallel_common_api.rst
+++ /dev/null
@@ -1,488 +0,0 @@
-.. admonition:: Contents
-
-   - :ref:`communication_api`
-   - :ref:`mpi_basics`
-
-Common API
-==========
-
-The following SageMaker distribute model parallel APIs are common across all frameworks.
-
-**Important**: This API document assumes you use the following import statement in your training scripts.
-
-**TensorFlow**
-
-.. code:: python
-
-   import smdistributed.modelparallel.tensorflow as smp
-
-**PyTorch**
-
-.. code:: python
-
-   import smdistributed.modelparallel.torch as smp
-
-
-.. function:: smp.init( )
-   :noindex:
-
-   Initialize the library. Must be called at the beginning of training script.
-
-.. function:: @smp.step(non_split_inputs, input_split_axes, [*args, **kwargs])
-   :noindex:
-
-   A decorator that must be placed over a function that represents a single
-   forward and backward pass (for training use cases), or a single forward
-   pass (for evaluation use cases). Any computation that is defined inside
-   the ``smp.step``-decorated function is executed in a pipelined manner.
-
-   By default, every tensor input to the function is split across its batch
-   dimension into a number of microbatches specified while launching the
-   training job. This behavior can be customized through the arguments to
-   ``smp.step``, described below. The library then orchestrates the execution of
-   each microbatch across all partitions, based on the chosen pipeline
-   type.
-
-   In a typical use case, forward pass and back-propagation are executed
-   inside an \ ``smp.step``-decorated function and gradients, loss, and
-   other relevant metrics (such as accuracy, etc.) are returned from
-   ``smp.step``-decorated function.
-
-   Any gradient post-processing operation, such as gradient clipping and
-   allreduce, as well as ``optimizer.apply_gradients`` calls (for TF) or
-   ``optimizer.step`` (for PT) should be applied on the gradients returned
-   from the ``smp.step`` function, and not inside the ``smp.step``
-   function. This is because every operation inside ``smp.step`` is
-   executed once per microbatch, so having these operations inside
-   ``smp.step`` can either be inefficient (in the case of allreduce), or
-   lead to wrong results (in the case of ``apply_gradients`` /
-   ``optimizer.step``).
-
-   If the objects returned from the ``smp.step``-decorated function contain
-   ``tf.Tensor``\ s / ``torch.Tensor``\ s, they are converted to
-   ``StepOutput`` objects. A ``StepOutput`` object encapsulates all
-   versions of the tensor across different microbatches
-   (see ``StepOutput`` entry for more information).
-
-   The argument to ``smp.step`` decorated function should either be a tensor
-   or an instance of list, tuple, dict or set for it to be split across
-   microbatches. If your object doesn't fall into this category, you can make
-   the library split your object, by implementing ``smp_slice`` method.
-
-   Below is an example of how to use it with PyTorch.
-
-   .. code:: python
-
-      class CustomType:
-          def __init__(self, tensor):
-              self.data = tensor
-
-          # The library will call this to invoke slicing on the object passing in total microbatches (num_mb)
-          # and the current microbatch index (mb).
-          def smp_slice(self, num_mb, mb, axis):
-              dim_size = list(self.data.size())[axis]
-
-              split_size = dim_size // num_mb
-              sliced_tensor = self.data.narrow(axis, mb * split_size, split_size)
-              return CustomType(sliced_tensor, self.other)
-
-      custom_obj = CustomType(torch.ones(4,))
-
-      @smp.step()
-      def step(custom_obj):
-          loss = model(custom_obj)
-          model.backward(loss)
-          return loss
-
-
-   **Important:** ``smp.step`` splits the batch into microbatches, and
-   executes everything inside the decorated function once per microbatch.
-   This might affect the behavior of batch normalization, any operation
-   that explicitly uses the batch size information, or any other Python
-   code that is expected to run once.
-
-   **TensorFlow-specific behavior**
-
-   ``smp.step`` is a wrapper that
-   inherits from and extends the behavior of ``tf.function``, and as such,
-   all the caveats that apply to the use of ``tf.function``\ s also apply
-   to ``smp.step``. In particular, any operation that is inside
-   ``smp.step`` executes in graph mode, and not eager mode.
-
-   In the first call, ``smp.step`` performs tracing of the wrapped function every time
-   one of the tensor arguments changes their shape or dtype, or for every
-   new value of a Python argument, if there is one. Tracing is expensive,
-   so such scenarios should be avoided as much as possible or,
-   alternatively, an ``input_signature`` argument must be provided. For
-   more information on the usage of ``tf.function``, refer to the
-   TensorFlow documentation:
-
-   -  https://www.tensorflow.org/api_docs/python/tf/function\
-   -  https://www.tensorflow.org/guide/function\
-
-   Each ``smp.step`` decorated function must have a return value that depends on the
-   output of ``smp.DistributedModel``.
-
-   **Common parameters**
-
-   -  ``non_split_inputs`` (``list``): The list of arguments to the decorated function
-      that should not be split along the batch dimension. Should be used
-      for all input tensors that do not have a batch dimension. Should be a
-      list of argument names as ``str``, as they appear in the signature of
-      the ``smp.step``-decorated function. By default it is considered an
-      empty list.
-
-   -  ``input_split_axes`` (``dict``): A dict that maps the argument name to its batch
-      axis. The keys should be the argument names as ``str``, as they
-      appear in the signature of the ``smp.step``-decorated function.  By
-      default all batch axes are assumed to be the 0-axis.
-
-   **TensorFlow-only parameters**
-
-   -  All arguments of ``tf.function``. Note:
-      The \ ``experimental_compile`` argument of ``tf.function`` may not
-      work as expected with ``smp.step``, since it interferes with
-      pipelining and model partitioning. To enable XLA with the library, you can
-      instead use \ ``tf.config.optimizer.set_jit(True)``.
-
-   **PyTorch-only parameters**
-
-   -  ``detach_outputs`` (``bool``) : If ``True``, calls ``torch.Tensor.detach()`` on
-      all returned ``torch.Tensor`` outputs. Setting it to ``False``
-      increases memory consumption, unless ``detach()`` is manually called
-      on the returned tensors, because the model graph is not cleared from
-      memory after the training step. Set to \ ``True`` by default.
-
-   **Returns**
-
-   -  The same object(s) returned from the decorated function. All
-      returned \ ``tf.Tensor``, \ ``tf.Variable``  objects (for TF) or
-      ``torch.Tensor`` objects (for PT) are wrapped inside
-      a \ ``StepOutput`` object, even when they are inside a Python
-      ``list``, ``tuple``, or ``dict``.
-
-
-
-.. class:: StepOutput
-   :noindex:
-
-
-   A class that encapsulates all versions of a ``tf.Tensor``
-   or \ ``torch.Tensor`` across all microbatches.
-
-   When a particular ``tf.Tensor`` or ``torch.Tensor`` is computed inside
-   ``smp.step``, different versions of the tensor are computed for each
-   microbatch.
-
-   When this tensor is returned from ``smp.step`` and is accessed outside
-   of the decorated function, it appears as a ``StepOutput`` object, which
-   contains all such versions. For example,
-
-   -  In the case of Tensorflow, the gradient for a particular
-      ``tf.Variable`` is computed on each microbatch individually, and if
-      this gradient is returned from ``smp.step``, all gradients for this
-      ``tf.Variable`` become part of the same ``StepOutput`` object. The
-      ``StepOutput`` class offers the following API for commonly-used
-      post-processing operations on such tensors.
-   -  In the case of PyTorch, the loss for each microbatch is computed
-      individually and all the ``torch.Tensor``\ s that represent the loss
-      for different microbatches become part of same ``StepOutput`` object,
-      if loss is returned from the ``smp.step`` function.
-
-
-   The ``StepOutput`` class offers the following API for commonly-used
-   post-processing operations on tensors.
-
-   .. data:: StepOutput.outputs
-      :noindex:
-
-      Returns a list of the underlying tensors, indexed by microbatch.
-
-   .. function:: StepOutput.reduce_mean( )
-      :noindex:
-
-      Returns a ``tf.Tensor``, ``torch.Tensor`` that averages the constituent ``tf.Tensor`` s
-      ``torch.Tensor`` s. This is commonly used for averaging loss and gradients across microbatches.
-
-   .. function:: StepOutput.reduce_sum( )
-      :noindex:
-
-      Returns a ``tf.Tensor`` /
-      ``torch.Tensor`` that sums the constituent
-      ``tf.Tensor``\ s/\ ``torch.Tensor``\ s.
-
-   .. function:: StepOutput.concat( )
-      :noindex:
-
-      Returns a
-      ``tf.Tensor``/``torch.Tensor`` that concatenates tensors along the
-      batch dimension using ``tf.concat`` / ``torch.cat``.
-
-   .. function:: StepOutput.stack( )
-      :noindex:
-
-      Applies ``tf.stack`` / ``torch.stack``
-      operation to the list of constituent ``tf.Tensor``\ s /
-      ``torch.Tensor``\ s.
-
-   **TensorFlow-only methods**
-
-   .. function:: StepOutput.merge( )
-      :noindex:
-
-      Returns a ``tf.Tensor`` that
-      concatenates the constituent ``tf.Tensor``\ s along the batch
-      dimension. This is commonly used for merging the model predictions
-      across microbatches.
-
-   .. function:: StepOutput.accumulate(method="variable", var=None)
-      :noindex:
-
-      Functionally the same as ``StepOutput.reduce_mean()``. However, it is
-      more memory-efficient, especially for large numbers of microbatches,
-      since it does not wait for all constituent \ ``tf.Tensor``\ s to be
-      ready to start averaging them, thereby saving memory.
-
-      In some cases (XLA for example) ``StepOutput.reduce_mean()`` might end
-      up being more memory-efficient than ``StepOutput.accumulate()``.
-
-      **Parameters**
-
-      -  ``method`` (``"add_n"`` or ``"accumulate_n"`` or ``"variable"``):
-         If ``"add_n"`` or ``"accumulate_n"``, the library uses
-         ``tf.add_n`` and ``tf.accumulate_n``, respectively, to implement
-         accumulation. If ``"variable"``, the library uses an internal ``tf.Variable``
-         into which to accumulate the tensors. Default is \ ``"variable"``.
-         Note: Memory usage behavior of these choices can depend on the model
-         and implementation.
-
-      -  ``var``: A ``tf.Variable`` into which, if provided, the library uses to
-         accumulate the tensors. If \ ``None``, the library internally creates a
-         variable. If ``method`` is not ``"variable"``, this argument is
-         ignored.
-
-.. _mpi_basics:
-   :noindex:
-
-MPI Basics
-^^^^^^^^^^
-
-The library exposes the following basic MPI primitives to its Python API:
-
--  ``smp.rank()``: The rank of the current process.
--  ``smp.size()``: The total number of processes.
--  ``smp.mp_rank()``: The rank of the process among the processes that
-   hold the current model replica.
--  ``smp.dp_rank()``: The rank of the process among the processes that
-   hold different replicas of the same model partition.
--  ``smp.dp_size()``: The total number of model replicas.
--  ``smp.local_rank()``: The rank among the processes on the current
-   instance.
--  ``smp.local_size()``: The total number of processes on the current
-   instance.
--  ``smp.get_mp_group()``: The list of ranks over which the current
-   model replica is partitioned.
--  ``smp.get_dp_group()``: The list of ranks that hold different
-   replicas of the same model partition.
-
-   .. _communication_api:
-      :noindex:
-
-Communication API
-^^^^^^^^^^^^^^^^^
-
-The library provides a few communication primitives which can be helpful while
-developing the training script. These primitives use the following
-``enum`` s as arguments to specify which processes the communication
-should involve.
-​
-
-**Helper structures**
-
-.. data:: smp.CommGroup
-   :noindex:
-
-   An ``enum`` that takes the values
-   ``CommGroup.WORLD``, ``CommGroup.MP_GROUP``, and ``CommGroup.DP_GROUP``.
-   These values can also be accessed as ``smp.WORLD``, ``smp.MP_GROUP``,
-   and ``smp.DP_GROUP`` respectively.
-
-   -  ``CommGroup.WORLD``: Represents the entire group of processes used in
-      training
-   -  ``CommGroup.MP_GROUP``: Represents the group of processes that hold
-      the same model replica as the current process. The processes in a
-      single ``MP_GROUP`` collectively store an entire replica of the
-      model.
-   -  ``CommGroup.DP_GROUP``: Represents the group of processes that hold
-      the same model partition as the current process. The processes in a
-      single ``DP_GROUP`` perform data parallelism/allreduce among
-      themselves.
-
-.. data:: smp.RankType
-   :noindex:
-
-   An ``enum`` that takes the values
-   ``RankType.WORLD_RANK``, ``RankType.MP_RANK``, and ``RankType.DP_RANK``.
-
-   -  ``RankType.WORLD_RANK``: The associated rank is to be interpreted as
-      the rank of the process across all processes used in training.
-   -  ``RankType.MP_RANK``: The associated rank is to be interpreted as the
-      rank of the process within the ``MP_GROUP``.
-   -  ``RankType.DP_RANK``: The associated rank is to be interpreted as the
-      rank of the process within the ``DP_GROUP``.
-
-
-**Communication primitives:**
-
-.. function:: smp.broadcast(obj, group)
-   :noindex:
-
-   Sends the object to all processes in the
-   group. The receiving process must call ``smp.recv_from`` to receive the
-   sent object.
-
-   **Inputs**
-
-   -  ``obj``: An arbitrary picklable Python object that will be broadcast.
-
-   -  ``group``: A ``CommGroup`` argument that represents to which group of
-      processes the object will be sent.
-
-   **Notes**
-
-   -  When you use ``broadcast`` on the sender process, there needs
-      to be an accompanying ``smp.recv_from()`` call on the receiver
-      processes.
-
-   -  This is a synchronous call; the ``broadcast`` statement
-      returns only after all ranks participating in the call have made a
-      matching ``recv_from`` call.
-
-   **Example**
-
-   .. code:: python
-
-      if smp.rank() == 0:
-          smp.broadcast(something, group=smp.CommGroup.WORLD)
-      else:
-          smp.recv_from(0, rank_type=smp.RankType.WORLD_RANK)
-
-.. function:: smp.send(obj, dest_rank, rank_type)
-   :noindex:
-
-   Sends the object ``obj`` to
-   ``dest_rank``, which is of a type specified by ``rank_type``.
-
-   **Inputs**
-
-   -  ``obj``: An arbitrary picklable Python object that will be sent.
-
-   -  ``dest_rank`` (``int``): An integer denoting the rank of the receiving process.
-
-   -  ``rank_type`` (``enum``): A ``smp.RankType`` ``enum`` that determines how
-      ``dest_rank`` is to be interpreted. For example if ``dest_rank`` is 1
-      and ``rank_type`` is ``MP_RANK``, then ``obj`` is sent to process
-      with ``mp_rank`` 1 in the ``MP_GROUP`` which contains the current
-      process.
-
-   **Notes**
-
-   -  Note: \ This is a synchronous call; the ``send`` statement returns
-      only after the destination rank has made a matching
-      ``recv_from`` call.
-
-.. function:: smp.recv_from(src_rank, rank_type)
-   :noindex:
-
-   Receive an object from a peer process. Can be used with a matching
-   ``smp.send`` or a ``smp.broadcast`` call.
-
-   **Inputs**
-
-   -  ``src_rank`` (``int``): An integer denoting rank of the sending process.
-
-   -  ``rank_type`` (``enum``): A ``smp.RankType`` ``enum`` that determines how
-      ``dest_rank`` is to be interpreted. For example if ``src_rank`` is 1
-      and ``rank_type`` is ``MP_RANK``, then the object is received from
-      the process with ``mp_rank`` 1 in the ``MP_GROUP`` which contains the
-      current process.
-
-   **Returns**
-
-   Returns the python object that is sent by the peer process.
-
-   **Notes**
-
-   -  Note: This is a synchronous call; the ``recv_from`` statement returns
-      only after the source rank has made a matching ``send`` or
-      ``broadcast`` call, and the object is received.
-
-.. function:: smp.allgather(obj, group)
-   :noindex:
-
-   A collective call that gathers all the
-   submitted objects across all ranks in the specified ``group``. Returns a
-   list whose ``i``\ th index contains the object submitted by the
-   ``i``\ th rank in ``group``.
-
-   **Inputs**
-
-   -  ``obj``: An arbitrary picklable Python object that will be
-      allgathered.
-
-   -  ``group`` : A ``CommGroup`` argument that represents which group of
-      processes participate in ``allgather``.
-
-   **Notes**
-
-   -  Note: This is a synchronous call; the ``allgather`` statement returns
-      only after all ranks participating in the call have made a matching
-      ``allgather`` call, and all the objects are received at the current
-      rank.
-
-   **Examples**
-
-   .. code:: python
-
-      # assuming mp_size() == 2
-
-      if smp.mp_rank() == 0:
-          out = smp.allgather(obj1, smp.CommGroup.MP_GROUP)  # returns [obj1, obj2]
-      else:
-          out = smp.allgather(obj2, smp.CommGroup.MP_GROUP)  # returns [obj1, obj2]
-
-.. function:: smp.barrier(group=smp.WORLD)
-   :noindex:
-
-   A statement that hangs until all
-   processes in the specified group reach the barrier statement, similar to
-   ``MPI_Barrier()``.
-
-   **Inputs**
-
-   -  ``group``: An ``smp.CommGroup`` ``enum`` that specifies the group of
-      processes participating in the barrier call. Defaults to
-      ``smp.WORLD``.
-
-   **Examples**
-
-   -  Assume there are 8 processes and 2 model partitions, and
-      therefore 4 \ ``mp_group``\ s, and 2 ``dp_group``\ s. If
-      the \ ``barrier`` call is passed the value ``smp.MP_GROUP`` for its
-      group argument, then each process only waits until the other process
-      of its own ``mp_group`` reaches that point. It does not wait for
-      processes outside that ``mp_group``.
-
-.. function:: smp.dp_barrier()
-   :noindex:
-
-   Same as passing ``smp.DP_GROUP``\ to ``smp.barrier()``.
-   Waits for the processes in the same \ ``dp_group`` as
-   the current process to reach the same point in execution.
-
-.. function:: smp.mp_barrier()
-   :noindex:
-
-   Same as passing ``smp.MP_GROUP`` to
-   ``smp.barrier()``. Waits for the processes in the same ``mp_group`` as
-   the current process to reach the same point in execution.
diff --git a/doc/api/training/smp_versions/v1.3.0/smd_model_parallel_pytorch.rst b/doc/api/training/smp_versions/v1.3.0/smd_model_parallel_pytorch.rst
deleted file mode 100644
index d2fcb95954..0000000000
--- a/doc/api/training/smp_versions/v1.3.0/smd_model_parallel_pytorch.rst
+++ /dev/null
@@ -1,572 +0,0 @@
-.. admonition:: Contents
-
-   - :ref:`pytorch_saving_loading`
-   - :ref:`pytorch_saving_loading_instructions`
-
-PyTorch API
-===========
-
-**Supported versions: 1.7.1, 1.8.1**
-
-This API document assumes you use the following import statements in your training scripts.
-
-.. code:: python
-
-   import smdistributed.modelparallel.torch as smp
-
-
-.. tip::
-
-   Refer to
-   `Modify a PyTorch Training Script
-   <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-customize-training-script.html#model-parallel-customize-training-script-pt>`_
-   to learn how to use the following API in your PyTorch training script.
-
-.. class:: smp.DistributedModel
-   :noindex:
-
-   A sub-class of ``torch.nn.Module`` which specifies the model to be
-   partitioned. Accepts a ``torch.nn.Module`` object ``module`` which is
-   the model to be partitioned. The returned ``DistributedModel`` object
-   internally manages model parallelism and data parallelism. Only one
-   model in the training script can be wrapped with
-   ``smp.DistributedModel``.
-
-   **Example:**
-
-   .. code:: python
-
-      model = smp.DistributedModel(model)
-
-   **Important**: The ``__call__`` and  ``backward`` method calls on the
-   ``smp.DistributedModel`` object (in the following example, the object
-   is \ ``model``) can only be made inside a ``smp.step``-decorated
-   function.
-
-
-   Since ``DistributedModel``  is a ``torch.nn.Module``, a forward pass can
-   be performed by calling the \ ``DistributedModel`` object on the input
-   tensors.
-
-   .. code:: python
-
-      predictions = model(inputs)   # model is a smp.DistributedModel object
-
-   For a backward pass, one needs to call the backward function on
-   the \ ``DistributedModel`` object, with tensors and gradients as
-   arguments, replacing the PyTorch operations \ ``torch.Tensor.backward``
-   or ``torch.autograd.backward``.
-
-
-   The API for ``model.backward`` is very similar to
-   ``torch.autograd.backward``. For example, the following
-   ``backward`` calls:
-
-   .. code:: python
-
-      torch.autograd.backward(loss) or loss.backward()
-
-   should be replaced with:
-
-   .. code:: python
-
-      model.backward(loss) # loss is a tensor with only one element as its data
-
-   Similarly, for non-scalar tensors, replace the following
-   ``backward`` call containing incoming gradient arguments:
-
-   .. code:: python
-
-      torch.autograd.backward(outputs, out_grads)
-
-   with the following line:
-
-   .. code:: python
-
-      model.backward(outputs, out_grads)
-
-   In these examples, all ``__call__``  and ``backward`` method calls on
-   the model objects (``model(inputs)`` and ``model.backward(loss)``) must be made inside
-   a ``smp.step``-decorated function.
-
-   **Using DDP**
-
-   If DDP is enabled, do not not place a PyTorch
-   ``DistributedDataParallel`` wrapper around the ``DistributedModel`` because
-   the ``DistributedModel`` wrapper will also handle data parallelism.
-
-   Unlike the original DDP wrapper, when you use ``DistributedModel``,
-   model parameters and buffers are not immediately broadcast across
-   processes when the wrapper is called. Instead, the broadcast is deferred to the first call of the
-   ``smp.step``-decorated function when the partition is done.
-
-   **Parameters**
-
-   -  ``module`` (``torch.nn.Module``): Module to be distributed (data parallelism and model parallelism).
-
-   -  ``trace_device`` (``"cpu"`` or ``"gpu"``) (default: ``"gpu"``)
-      Whether to perform the tracing step on the GPU or CPU. The tracing step gathers
-      information on the order of execution of modules, the shapes of
-      intermediate outputs, and execution times, to be used by the
-      partitioning algorithm. If ``trace_device`` is set to GPU, accurate
-      module execution times can be gathered during tracing for potentially
-      improved partitioning decision. However, if the model is too large to
-      fit in a single GPU, then ``trace_device`` should be set to CPU.
-
-   -  ``trace_execution_times`` (``bool``) (default: ``False``): If ``True``,
-      the library profiles the execution time of each module during tracing, and uses
-      it in the partitioning decision. This improves the partitioning
-      decision, but it might make the tracing slower. It may also introduce
-      some degree of non-determinism in partitioning results, because of the
-      inherent randomness in module execution times. Must be ``False`` if
-      ``trace_device`` is ``"cpu"``.
-
-   -  ``overlapping_allreduce`` (``bool``) (default: ``True``): This is only
-      applicable for hybrid data parallelism/model parallelism use cases (when
-      ``ddp`` is set to ``True`` while launching training). The library uses this flag
-      to decide whether to do overlapping allreduce whenever a parameter
-      gradients are ready. This leads to overlapping of communication and
-      computation and can improve performance. If this is set to ``False`` ,
-      allreduce is performed at the end of the step.
-
-   -  ``backward_passes_per_step`` (``int``) (default: 1): This is only
-      applicable for hybrid data parallelism/model parallelism use cases (when
-      ``ddp`` is set to ``True`` in config). This parameter indicates the
-      number of backward passes to perform before calling allreduce on DDP.
-      This allows accumulating updates over multiple mini-batches before
-      reducing and applying them.
-
-   -  ``average_grads_across_microbatches`` (``bool``) (default: ``True``):
-      Whether or not the computed gradients should be averaged across
-      microbatches. If ``False``, the computed gradients will be summed across
-      microbatches, but not divided by the number of microbatches. In typical
-      use case where the computed loss is averaged over the mini-batch, this
-      should be left as ``True``. If you use a loss function that only sums
-      the per-sample loss across the batch (and not divide by the batch size),
-      then this must be set to ``False`` for correctness.
-
-   -  ``bucket_cap_mb`` (default: 25): \ ``DistributedDataParallel`` buckets
-      parameters into multiple buckets so that gradient reduction of each
-      bucket can potentially overlap with backward
-      computation. \ ``bucket_cap_mb``\ controls the bucket size in MegaBytes
-      (MB).
-
-   -  ``trace_memory_usage`` (default: False): When set to True, the library attempts
-      to measure memory usage per module during tracing. If this is disabled,
-      memory usage will be estimated through the sizes of tensors returned from
-      the module.
-
-   -  ``broadcast_buffers`` (default: True): Flag to be used with ``ddp=True``.
-      This parameter is forwarded to the underlying ``DistributedDataParallel`` wrapper.
-      Please see: `broadcast_buffer <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel>`__.
-
-   -  ``gradient_as_bucket_view`` (default: False): To be
-      used with ``ddp=True``. This parameter is forwarded to the underlying
-      ``DistributedDataParallel`` wrapper. Please see `gradient_as_bucket_view <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel>`__.
-
-   **Properties**
-
-   -  ``partitioned``: Is ``True`` if the model is partitioned, ``False``
-      otherwise. Initialized to ``False`` when ``DistributedModel`` is first
-      created. It becomes be ``True`` during the first call
-      to ``smp.step``-decorated function. Once the model is partitioned, the
-      local parameters or local ``state_dict`` can be fetched using the
-      following methods.
-
-   **Methods**
-
-   .. function:: backward(tensors, grad_tensors)
-      :noindex:
-
-      Triggers a distributed backward
-      pass across model partitions. Example usage provided in the previous
-      section. The API is very similar
-      to https://pytorch.org/docs/stable/autograd.html#torch.autograd.backward.
-      ``retain_grad`` and ``create_graph``  flags are not supported.
-
-   .. function:: local_buffers( )
-      :noindex:
-
-      Returns an iterator over buffers for the modules in
-      the partitioned model that have been assigned to the current process.
-
-   .. function:: local_named_buffers( )
-      :noindex:
-
-      Returns an iterator over buffers for the
-      modules in the partitioned model that have been assigned to the current
-      process. This yields both the name of the buffer as well as the buffer
-      itself.
-
-   .. function:: local_parameters( )
-      :noindex:
-
-      Returns an iterator over parameters for the
-      modules in the partitioned model that have been assigned to the current
-      process.
-
-   .. function:: local_named_parameters( )
-      :noindex:
-
-      Returns an iterator over parameters for
-      the modules in the partitioned model that have been assigned to the
-      current process. This yields both the name of the parameter as well as
-      the parameter itself.
-
-   .. function:: local_modules( )
-      :noindex:
-
-      Returns an iterator over the modules in the
-      partitioned model that have been assigned to the current process.
-
-   .. function:: local_named_modules( )
-      :noindex:
-
-      Returns an iterator over the modules in the
-      partitioned model that have been assigned to the current process. This
-      yields both the name of the module as well as the module itself.
-
-   .. function:: local_state_dict( )
-      :noindex:
-
-      Returns the ``state_dict`` that contains local
-      parameters that belong to the current \ ``mp_rank``. This ``state_dict``
-      contains a key \ ``_smp_is_partial`` to indicate this is a
-      partial \ ``state_dict``, which indicates whether the
-      ``state_dict`` contains elements corresponding to only the current
-      partition, or to the entire model.
-
-   .. function:: state_dict( )
-      :noindex:
-
-      Returns the ``state_dict`` that contains parameters
-      for the entire model. It first collects the \ ``local_state_dict``  and
-      gathers and merges the \ ``local_state_dict`` from all ``mp_rank``\ s to
-      create a full ``state_dict``. Please note that this needs to be called on all ranks with
-      ``dp_rank()==0`` to ensure the gather happens properly.
-      If it is only called on all such ranks, it can hang.
-
-   .. function:: load_state_dict( )
-      :noindex:
-
-      Same as the ``torch.module.load_state_dict()`` ,
-      except: It first gathers and merges the ``state_dict``\ s across
-      ``mp_rank``\ s, if they are partial. The actual loading happens after the
-      model partition so that each rank knows its local parameters.
-
-   .. function:: register_post_partition_hook(hook)
-      :noindex:
-
-      Registers a callable ``hook`` to
-      be executed after the model is partitioned. This is useful in situations
-      where an operation needs to be executed after the model partition during
-      the first call to ``smp.step``, but before the actual execution of the
-      first forward pass. Returns a ``RemovableHandle`` object ``handle``,
-      which can be used to remove the hook by calling ``handle.remove()``.
-
-   .. function:: cpu( )
-      :noindex:
-
-      Allgathers parameters and buffers across all ``mp_rank``\ s and moves them
-      to the CPU.
-
-   .. function:: join( )
-      :noindex:
-
-      A context manager to be used in conjunction with an instance of
-      ``smp.DistributedModel`` to be able to train with uneven inputs across
-      participating processes. This is only supported when ``ddp=True``. This will use the join with the wrapped
-      ``DistributedDataParallel`` instance. For more information, see:
-      `join <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel.join>`__
-      in the PyTorch documentation.
-
-   .. function:: register_comm_hook( state, callable )
-      :noindex:
-
-      **Available for PyTorch 1.8.1 only**
-      Registers a communication hook which is an enhancement that provides
-      a flexible hook ``callable`` to users where they can specify how
-      gradients are aggregated across multiple workers. This method will be called on the wrapped ``DistributedDataParallel`` instance.
-
-      Please note that when you register a comm hook you have full control of how the gradients are processed.
-      When using only data parallelism with Torch DDP you are expected to average grads across data parallel replicas within the hook.
-      Similarly, when using DistributedModel you have to averaging grads across data parallel replicas within the hook.
-      In addition to that, you also have to average grads across microbatches within the hook unless you explicitly desire to not average based on your loss function.
-      See ``average_grads_across_microbatches`` for more information about averaging grads across microbatches.
-
-      This is only supported when ``ddp=True`` and ``overlapping_allreduce=True`` (default).
-      For more information, see:
-      `register_comm_hook <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel.register_comm_hook>`__
-      in the PyTorch documentation.
-
-
-
-.. class:: smp.DistributedOptimizer
-   :noindex:
-
-   **Parameters**
-   - ``optimizer``
-
-   An optimizer wrapper for saving/loading optimizer states. This wrapper
-   returns ``optimizer`` with the following methods overridden:
-
-   .. function:: state_dict( )
-      :noindex:
-
-      Returns the ``state_dict`` that contains optimizer state for the entire model.
-      It first collects the ``local_state_dict`` and gathers and merges
-      the ``local_state_dict`` from all ``mp_rank``s to create a full
-      ``state_dict``.
-
-   .. function::  load_state_dict( )
-      :noindex:
-
-      Same as the ``torch.optimizer.load_state_dict()`` , except:
-
-         -  It first gathers and merges the local ``state_dict``\ s if they are
-            partial.
-         -  The actual loading happens after the model partition so that each
-            rank knows its local parameters.
-
-   .. function::  local_state_dict( )
-      :noindex:
-
-      Returns the ``state_dict`` that contains the
-      local optimizer state that belongs to the current \ ``mp_rank``. This
-      ``state_dict`` contains a key \ ``_smp_is_partial`` to indicate this is
-      a partial \ ``state_dict``, which indicates whether the
-      ``state_dict`` contains elements corresponding to only the current
-      partition, or to the entire model.
-
-   ​
-.. function:: smp.partition(index)
-   :noindex:
-
-   **Inputs**
-
-   -  ``index`` (int) - The index of the partition.
-
-   A context manager which places all modules defined inside into the
-   partition with ID ``index``.  The ``index`` argument must be less than
-   the number of partitions.
-
-   Use ``smp.partition`` to implement manual partitioning.
-   If ``"auto_partition"`` is ``True``, then the
-   ``smp.partition`` contexts are ignored. Any module that is not placed in
-   any ``smp.partition`` context is placed in the
-   ``default_partition`` defined through the SageMaker Python SDK.
-
-   When ``smp.partition`` contexts are nested, the innermost context
-   overrides the rest (see the following example). In PyTorch, manual
-   partitioning should be done inside the module \ ``__init__``, and the
-   partition assignment applies to the modules that are *created* inside
-   the ``smp.partition`` context.
-
-   Example:
-
-   .. code:: python
-
-      class Model(torch.nn.Module):
-          def __init__(self):
-              with smp.partition(1):
-                  self.child0 = Child0()            # child0 on partition 1
-                  with smp.partition(2):
-                      self.child1 = Child1()        # child1 on partition 2
-                  self.child2 = Child2()            # child2 on partition 1
-              self.child3 = Child3()                # child3 on default_partition
-
-.. function:: smp.get_world_process_group( )
-   :noindex:
-
-   Returns a ``torch.distributed`` ``ProcessGroup`` that consists of all
-   processes, which can be used with the ``torch.distributed`` API.
-   Requires ``"ddp": True`` in SageMaker Python SDK parameters.
-
-.. function:: smp.get_mp_process_group( )
-   :noindex:
-
-   Returns a ``torch.distributed`` ``ProcessGroup`` that consists of the
-   processes in the ``MP_GROUP`` which contains the current process, which
-   can be used with the \ ``torch.distributed`` API. Requires
-   ``"ddp": True`` in SageMaker Python SDK parameters.
-
-.. function:: smp.get_dp_process_group( )
-   :noindex:
-
-   Returns a ``torch.distributed`` ``ProcessGroup`` that consists of the
-   processes in the ``DP_GROUP`` which contains the current process, which
-   can be used with the \ ``torch.distributed`` API. Requires
-   ``"ddp": True`` in SageMaker Python SDK parameters.
-
-.. function:: smp.is_initialized( )
-   :noindex:
-
-   Returns ``True`` if ``smp.init`` has already been called for the
-   process, and ``False`` otherwise.
-
-.. function::smp.is_tracing( )
-   :noindex:
-
-   Returns ``True`` if the current process is running the tracing step, and
-   ``False`` otherwise.
-
-.. data:: smp.nn.FusedLayerNorm
-   :noindex:
-
-   `Apex Fused Layer Norm <https://nvidia.github.io/apex/layernorm.html>`__ is currently not
-   supported by the library. ``smp.nn.FusedLayerNorm`` replaces ``apex``
-   ``FusedLayerNorm`` and provides the same functionality. This requires
-   ``apex`` to be installed on the system.
-
-.. data:: smp.optimizers.FusedNovoGrad
-   :noindex:
-
-
-   `Fused Novo Grad optimizer <https://nvidia.github.io/apex/optimizers.html#apex.optimizers.FusedNovoGrad>`__ is
-   currently not supported by the library. ``smp.optimizers.FusedNovoGrad`` replaces ``apex`` ``FusedNovoGrad``
-   optimizer and provides the same functionality. This requires ``apex`` to
-   be installed on the system.
-
-.. data:: smp.optimizers.FusedLamb
-   :noindex:
-
-
-   `FusedLamb optimizer <https://nvidia.github.io/apex/optimizers.html#apex.optimizers.FusedLAMB>`__
-   currently doesn’t work with the library. ``smp.optimizers.FusedLamb`` replaces
-   ``apex`` ``FusedLamb`` optimizer and provides the same functionality.
-   This requires ``apex`` to be installed on the system.
-
-.. data:: smp.amp.GradScaler
-   :noindex:
-
-   `Torch AMP Gradscaler <https://pytorch.org/docs/stable/amp.html#torch.cuda.amp.GradScaler>`__
-   currently doesn’t work with the library. ``smp.amp.GradScaler`` replaces
-   ``torch.amp.GradScaler`` and provides the same functionality.
-
-.. _pytorch_saving_loading:
-   :noindex:
-
-APIs for Saving and Loading
-^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-.. function:: smp.save( )
-   :noindex:
-
-   Saves an object. This operation is similar to ``torch.save()``, except
-   it has an additional keyword argument, ``partial``, and accepts only
-   string type for the argument ``f`` (file). If ``partial=True``, each
-   ``mp_rank`` saves a separate checkpoint file and the library adds an ``mp_rank``
-   index to your saved file.
-
-   **Parameters**
-
-   -  ``obj`` (dict): A saved object.
-   -  ``f`` (str): A string containing a file name.
-   -  ``partial`` (bool, default= ``True``):  When set to ``True``, each
-      ``mp_rank`` saves a separate checkpoint file and the library adds an
-      ``mp_rank`` index to the saved file. If you want to be able to load
-      and further train a model that you save with ``smp.save()``, you must
-      set ``partial=True``.
-   -  ``pickle_module`` (picklemodule, default = module ``"pickle"`` from ``"/opt/conda/lib/python3.6/pickle.py"``):
-      A module used for pickling metadata and objects.
-   -  ``pickle_protocol``  (int, default=2): Can be specified to
-      override the defaultprotocol.
-
-.. function:: smp.load( )
-   :noindex:
-
-   Loads an object saved with ``smp.save()`` from a file.
-
-   Similar to, `torch.load() <https://pytorch.org/docs/stable/generated/torch.load.html>`__,
-   except it has an additional keyword argument, ``partial``, and accepts
-   only string type for the argument ``f`` (file). If \ ``partial=True``,
-   then each ``mp_rank`` loads a separate checkpoint file.
-
-   **Parameters**
-
-   -  ``f`` (string): A string containing a file name.
-   -  ``map_location`` (function): A function
-      `torch.device <https://pytorch.org/docs/stable/tensor_attributes.html#torch.torch.device>`__,
-      a string, or a dict specifying how to remap storage locations.
-   -  ``pickle_module`` (pickle module): A module used for unpickling
-      metadata and objects (has to match the \ ``pickle_module``\ used to
-      serialize file).
-   -  ``pickle_load_args`` (Python 3 only): Optional keyword arguments
-      passed to ``pickle_module.load()`` and ``pickle_module.Unpickler()``.
-   -  ``partial`` (bool, default= ``True``): When set to ``True``, each
-      ``mp_rank`` loads the checkpoint corresponding to the ``mp_rank``.
-      Should be used when loading a model trained with the library.
-
-.. _pytorch_saving_loading_instructions:
-   :noindex:
-
-General Instruction For Saving and Loading
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-The library can save partial or full checkpoints.
-
--  For partial checkpoints, each ``mp_rank`` saves its own checkpoint
-   file with only the parameters that belong to that rank.
--  For full checkpoints, the library saves a single checkpoint that contains
-   entire model parameters.
-
-When **saving** using ``smp.save()``, each rank only holds its own
-parameters. If you want to save the full model, there will be some
-communication between the ranks to create the full model. If you save
-checkpoints often, you should save partial checkpoints for best
-performance.
-
-When **loading** using ``smp.load()``, the library can load either partial or |
-full checkpoints or full checkpoints saved by a non-model-parallel model. If you
-want to resume training with a non-model-parallel model or do inference, you need
-a full checkpoint.
-
-The following is an example of how you can save and load a checkpoint:
-
-.. code:: python
-
-   # Original model and optimizer
-   model = MyModel(...)
-   optimizer = MyOpt(...)
-
-   # model parallel wrapper
-   model = smp.DistributedModel(model)
-   optimizer = smp.DistributedOptimizer(optimizer)
-
-   # To save, always save on dp_rank 0 to avoid data racing
-   if partial:
-       # To save the partial model on each mp rank
-       # the library will create `checkpoint.pt_{mprank}` for each mp rank
-       if save_partial_model:
-           if smp.dp_rank() == 0:
-               model_dict = model.local_state_dict() # save the partial model
-               opt_dict = optimizer.local_state_dict() # save the partial optimizer state
-               smp.save(
-                   {"model_state_dict": model_dict, "optimizer_state_dict": opt_dict},
-                   f"/checkpoint.pt",
-                   partial=True,
-               )
-
-       # To save the full model
-       if save_full_model:
-           if smp.dp_rank() == 0:
-               model_dict = model.state_dict() # save the full model
-               opt_dict = optimizer.state_dict() # save the full optimizer state
-               smp.save(
-                   {"model_state_dict": model_dict, "optimizer_state_dict": opt_dict},
-                   "/checkpoint.pt",
-                   partial=False,
-               )
-
-   # To load, load on all ranks.
-   # The only difference for partial/full loading is the partial flag in smp.load
-   # Load partial checkpoint
-   if partial_checkpoint:
-       checkpoint = smp.load("/checkpoint.pt", partial=True)
-       model.load_state_dict(checkpoint["model_state_dict"])
-       optimizer.load_state_dict(checkpoint["optimizer_state_dict"])
-   # Load full checkpoint
-   if full_checkpoint:
-       checkpoint = smp.load("/checkpoint.pt", partial=False)
-       model.load_state_dict(checkpoint["model_state_dict"])
-       optimizer.load_state_dict(checkpoint["optimizer_state_dict"])
diff --git a/doc/api/training/smp_versions/v1.3.0/smd_model_parallel_tensorflow.rst b/doc/api/training/smp_versions/v1.3.0/smd_model_parallel_tensorflow.rst
deleted file mode 100644
index 8dc0b56b1f..0000000000
--- a/doc/api/training/smp_versions/v1.3.0/smd_model_parallel_tensorflow.rst
+++ /dev/null
@@ -1,172 +0,0 @@
-TensorFlow API
-==============
-
-**Supported version: 2.3.1, 2.4.1**
-
-**Important**: This API document assumes you use the following import statement in your training scripts.
-
-.. code:: python
-
-   import smdistributed.modelparallel.tensorflow as smp
-
-.. tip::
-
-   Refer to
-   `Modify a TensorFlow Training Script
-   <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-customize-training-script.html#model-parallel-customize-training-script-tf>`_
-   to learn how to use the following API in your TensorFlow training script.
-
-.. class:: smp.DistributedModel
-   :noindex:
-
-   A sub-class of the Keras \ ``Model`` class, which defines the model to
-   be partitioned. Model definition is done by sub-classing
-   ``smp.DistributedModel`` class, and implementing the ``call()`` method,
-   in the same way as the Keras model sub-classing API. Any operation that
-   is part of the \ ``smp.DistributedModel.call()`` method is subject to
-   partitioning, meaning that every operation placed inside executes in
-   exactly one of the devices (the operations outside run on all devices).
-
-
-   Similar to the regular Keras API, the forward pass is done by directly
-   calling the model object on the input tensors. For example:
-
-   .. code:: python
-
-      predictions = model(inputs)   # model is a smp.DistributedModel object
-
-   However, ``model()`` calls can only be made inside a
-   ``smp.step``-decorated function.
-
-   The outputs from a ``smp.DistributedModel`` are available in all ranks,
-   regardless of which rank computed the last operation.
-
-   **Methods:**
-
-   .. function:: save_model(save_path="/opt/ml/model")
-      :noindex:
-
-      **Inputs**
-      - ``save_path`` (``string``): A path to save an unpartitioned model with latest training weights.
-
-      Saves the entire,
-      unpartitioned model with the latest trained weights to ``save_path`` in
-      TensorFlow ``SavedModel`` format. Defaults to ``"/opt/ml/model"``, which
-      SageMaker monitors to upload the model artifacts to Amazon S3.
-
-.. function:: smp.partition(index)
-   :noindex:
-
-   **Inputs**
-
-   -  ``index`` (``int``): The index of the partition.
-
-   A context manager which places all operations defined inside into the
-   partition whose ID is equal to ``index``. When
-   ``smp.partition`` contexts are nested, the innermost context overrides
-   the rest. The ``index`` argument must be smaller than the number of
-   partitions.
-
-   ``smp.partition`` is used in the manual partitioning API;
-   if \ ``"auto_partition"`` parameter is set to ``True`` while launching
-   training, then ``smp.partition`` contexts are ignored. Any operation
-   that is not placed in any ``smp.partition`` context is placed in the
-   ``default_partition``, as shown in the following example:
-
-   .. code:: python
-
-      # auto_partition: False
-      # default_partition: 0
-      smp.init()
-      [...]
-      x = tf.constant(1.2)                     # placed in partition 0
-      with smp.partition(1):
-          y = tf.add(x, tf.constant(2.3))      # placed in partition 1
-          with smp.partition(3):
-              z = tf.reduce_sum(y)             # placed in partition 3
-
-
-.. function:: register_post_partition_hook(hook)
-   :noindex:
-
-    Registers a callable ``hook`` to
-    be executed after the model is partitioned. This is useful in situations
-    where an operation needs to be executed after the model partition during
-    the first call to ``smp.step``, but before the actual execution of the
-    first forward pass.
-
-    .. code:: python
-
-        @smp.register_post_partition_hook
-        def test_eager():
-            # All statements here will be executed right after partition but before the first forward pass
-            tf.print("Entered hook through eager context")
-
-.. class:: smp.CheckpointManager
-   :noindex:
-
-
-   A subclass of TensorFlow
-   `CheckpointManager <https://www.tensorflow.org/api_docs/python/tf/train/CheckpointManager>`__,
-   which is used to manage checkpoints. The usage is similar to TensorFlow
-   ``CheckpointManager``.
-
-   The following returns a ``CheckpointManager`` object.
-
-   .. code:: python
-
-      smp.CheckpointManager(checkpoint,
-                            directory="/opt/ml/checkpoints",
-                            max_to_keep=None,
-                            checkpoint_name="ckpt")
-
-   **Parameters**
-
-   -  ``checkpoint``: A `tf.train.Checkpoint
-      <https://www.tensorflow.org/api_docs/python/tf/train/Checkpoint>`__ instance
-      that represents a model checkpoint.
-
-   -  ``directory``: (``str``) The path to a directory in which to write
-      checkpoints. A file named "checkpoint" is also written to this
-      directory (in a human-readable text format) which contains the state
-      of the ``CheckpointManager``. Defaults to
-      ``"/opt/ml/checkpoints"``, which is the directory that SageMaker
-      monitors for uploading the checkpoints to Amazon S3.
-   -  ``max_to_keep`` (``int``): The number of checkpoints to keep. If
-      ``None``, all checkpoints are kept.
-   -  ``checkpoint_name`` (``str``): Custom name for the checkpoint file.
-      Defaults to ``"ckpt"``.
-
-
-   **Methods:**
-
-   .. function:: save( )
-      :noindex:
-
-      Saves a new checkpoint in the specified directory. Internally uses ``tf.train.CheckpointManager.save()``.
-
-   .. function:: restore( )
-      :noindex:
-
-      Restores the latest checkpoint in the specified directory.
-      Internally uses ``tf.train.CheckpointManager.restore()``.
-
-
-   **Examples:**
-
-   .. code:: python
-
-      checkpoint = tf.train.Checkpoint(optimizer=optimizer, model=model)
-      ckpt_manager = smp.CheckpointManager(checkpoint, max_to_keep=5)  # use /opt/ml/checkpoints
-
-      for inputs in train_ds:
-          loss = train_step(inputs)
-          # [...]
-          ckpt_manager.save()  # save a new checkpoint in /opt/ml/checkpoints
-
-   .. code:: python
-
-      for step, inputs in enumerate(train_ds):
-          if step == 0:
-              ckpt_manager.restore()
-          loss = train_step(inputs)
diff --git a/doc/api/training/smp_versions/v1.4.0/smd_model_parallel_common_api.rst b/doc/api/training/smp_versions/v1.4.0/smd_model_parallel_common_api.rst
deleted file mode 100644
index 625a7fcbf1..0000000000
--- a/doc/api/training/smp_versions/v1.4.0/smd_model_parallel_common_api.rst
+++ /dev/null
@@ -1,488 +0,0 @@
-.. admonition:: Contents
-
-   - :ref:`communication_api`
-   - :ref:`mpi_basics`
-
-Common API
-==========
-
-The following SageMaker distribute model parallel APIs are common across all frameworks.
-
-**Important**: This API document assumes you use the following import statement in your training scripts.
-
-**TensorFlow**
-
-.. code:: python
-
-   import smdistributed.modelparallel.tensorflow as smp
-
-**PyTorch**
-
-.. code:: python
-
-   import smdistributed.modelparallel.torch as smp
-
-
-.. function:: smp.init( )
-   :noindex:
-
-   Initialize the library. Must be called at the beginning of training script.
-
-.. function:: @smp.step(non_split_inputs, input_split_axes, [*args, **kwargs])
-   :noindex:
-
-   A decorator that must be placed over a function that represents a single
-   forward and backward pass (for training use cases), or a single forward
-   pass (for evaluation use cases). Any computation that is defined inside
-   the ``smp.step``-decorated function is executed in a pipelined manner.
-
-   By default, every tensor input to the function is split across its batch
-   dimension into a number of microbatches specified while launching the
-   training job. This behavior can be customized through the arguments to
-   ``smp.step``, described below. The library then orchestrates the execution of
-   each microbatch across all partitions, based on the chosen pipeline
-   type.
-
-   In a typical use case, forward pass and back-propagation are executed
-   inside an \ ``smp.step``-decorated function and gradients, loss, and
-   other relevant metrics (such as accuracy, etc.) are returned from
-   ``smp.step``-decorated function.
-
-   Any gradient post-processing operation, such as gradient clipping and
-   allreduce, as well as ``optimizer.apply_gradients`` calls (for TF) or
-   ``optimizer.step`` (for PT) should be applied on the gradients returned
-   from the ``smp.step`` function, and not inside the ``smp.step``
-   function. This is because every operation inside ``smp.step`` is
-   executed once per microbatch, so having these operations inside
-   ``smp.step`` can either be inefficient (in the case of allreduce), or
-   lead to wrong results (in the case of ``apply_gradients`` /
-   ``optimizer.step``).
-
-   If the objects returned from the ``smp.step``-decorated function contain
-   ``tf.Tensor``\ s / ``torch.Tensor``\ s, they are converted to
-   ``StepOutput`` objects. A ``StepOutput`` object encapsulates all
-   versions of the tensor across different microbatches
-   (see ``StepOutput`` entry for more information).
-
-   The argument to ``smp.step`` decorated function should either be a tensor
-   or an instance of list, tuple, dict or set for it to be split across
-   microbatches. If your object doesn't fall into this category, you can make
-   the library split your object, by implementing ``smp_slice`` method.
-
-   Below is an example of how to use it with PyTorch.
-
-   .. code:: python
-
-      class CustomType:
-          def __init__(self, tensor):
-              self.data = tensor
-
-          # The library will call this to invoke slicing on the object passing in total microbatches (num_mb)
-          # and the current microbatch index (mb).
-          def smp_slice(self, num_mb, mb, axis):
-              dim_size = list(self.data.size())[axis]
-
-              split_size = dim_size // num_mb
-              sliced_tensor = self.data.narrow(axis, mb * split_size, split_size)
-              return CustomType(sliced_tensor, self.other)
-
-      custom_obj = CustomType(torch.ones(4,))
-
-      @smp.step()
-      def step(custom_obj):
-          loss = model(custom_obj)
-          model.backward(loss)
-          return loss
-
-
-   **Important:** ``smp.step`` splits the batch into microbatches, and
-   executes everything inside the decorated function once per microbatch.
-   This might affect the behavior of batch normalization, any operation
-   that explicitly uses the batch size information, or any other Python
-   code that is expected to run once.
-
-   **TensorFlow-specific behavior**
-
-   ``smp.step`` is a wrapper that
-   inherits from and extends the behavior of ``tf.function``, and as such,
-   all the caveats that apply to the use of ``tf.function``\ s also apply
-   to ``smp.step``. In particular, any operation that is inside
-   ``smp.step`` executes in graph mode, and not eager mode.
-
-   In the first call, ``smp.step`` performs tracing of the wrapped function every time
-   one of the tensor arguments changes their shape or dtype, or for every
-   new value of a Python argument, if there is one. Tracing is expensive,
-   so such scenarios should be avoided as much as possible or,
-   alternatively, an ``input_signature`` argument must be provided. For
-   more information on the usage of ``tf.function``, refer to the
-   TensorFlow documentation:
-
-   -  https://www.tensorflow.org/api_docs/python/tf/function\
-   -  https://www.tensorflow.org/guide/function\
-
-   Each ``smp.step`` decorated function must have a return value that depends on the
-   output of ``smp.DistributedModel``.
-
-   **Common parameters**
-
-   -  ``non_split_inputs`` (``list``): The list of arguments to the decorated function
-      that should not be split along the batch dimension. Should be used
-      for all input tensors that do not have a batch dimension. Should be a
-      list of argument names as ``str``, as they appear in the signature of
-      the ``smp.step``-decorated function. By default it is considered an
-      empty list.
-
-   -  ``input_split_axes`` (``dict``): A dict that maps the argument name to its batch
-      axis. The keys should be the argument names as ``str``, as they
-      appear in the signature of the ``smp.step``-decorated function.  By
-      default all batch axes are assumed to be the 0-axis.
-
-   **TensorFlow-only parameters**
-
-   -  All arguments of ``tf.function``. Note:
-      The \ ``experimental_compile`` argument of ``tf.function`` may not
-      work as expected with ``smp.step``, since it interferes with
-      pipelining and model partitioning. To enable XLA with the library, you can
-      instead use \ ``tf.config.optimizer.set_jit(True)``.
-
-   **PyTorch-only parameters**
-
-   -  ``detach_outputs`` (``bool``) : If ``True``, calls ``torch.Tensor.detach()`` on
-      all returned ``torch.Tensor`` outputs. Setting it to ``False``
-      increases memory consumption, unless ``detach()`` is manually called
-      on the returned tensors, because the model graph is not cleared from
-      memory after the training step. Set to \ ``True`` by default.
-
-   **Returns**
-
-   -  The same object(s) returned from the decorated function. All
-      returned \ ``tf.Tensor``, \ ``tf.Variable``  objects (for TF) or
-      ``torch.Tensor`` objects (for PT) are wrapped inside
-      a \ ``StepOutput`` object, even when they are inside a Python
-      ``list``, ``tuple``, or ``dict``.
-
-
-
-.. class:: StepOutput
-   :noindex:
-
-
-   A class that encapsulates all versions of a ``tf.Tensor``
-   or \ ``torch.Tensor`` across all microbatches.
-
-   When a particular ``tf.Tensor`` or ``torch.Tensor`` is computed inside
-   ``smp.step``, different versions of the tensor are computed for each
-   microbatch.
-
-   When this tensor is returned from ``smp.step`` and is accessed outside
-   of the decorated function, it appears as a ``StepOutput`` object, which
-   contains all such versions. For example,
-
-   -  In the case of Tensorflow, the gradient for a particular
-      ``tf.Variable`` is computed on each microbatch individually, and if
-      this gradient is returned from ``smp.step``, all gradients for this
-      ``tf.Variable`` become part of the same ``StepOutput`` object. The
-      ``StepOutput`` class offers the following API for commonly-used
-      post-processing operations on such tensors.
-   -  In the case of PyTorch, the loss for each microbatch is computed
-      individually and all the ``torch.Tensor``\ s that represent the loss
-      for different microbatches become part of same ``StepOutput`` object,
-      if loss is returned from the ``smp.step`` function.
-
-
-   The ``StepOutput`` class offers the following API for commonly-used
-   post-processing operations on tensors.
-
-   .. data:: StepOutput.outputs
-      :noindex:
-
-      Returns a list of the underlying tensors, indexed by microbatch.
-
-   .. function:: StepOutput.reduce_mean( )
-      :noindex:
-
-      Returns a ``tf.Tensor``, ``torch.Tensor`` that averages the constituent ``tf.Tensor`` s
-      ``torch.Tensor`` s. This is commonly used for averaging loss and gradients across microbatches.
-
-   .. function:: StepOutput.reduce_sum( )
-      :noindex:
-
-      Returns a ``tf.Tensor`` /
-      ``torch.Tensor`` that sums the constituent
-      ``tf.Tensor``\ s/\ ``torch.Tensor``\ s.
-
-   .. function:: StepOutput.concat( )
-      :noindex:
-
-      Returns a
-      ``tf.Tensor``/``torch.Tensor`` that concatenates tensors along the
-      batch dimension using ``tf.concat`` / ``torch.cat``.
-
-   .. function:: StepOutput.stack( )
-      :noindex:
-
-      Applies ``tf.stack`` / ``torch.stack``
-      operation to the list of constituent ``tf.Tensor``\ s /
-      ``torch.Tensor``\ s.
-
-   **TensorFlow-only methods**
-
-   .. function:: StepOutput.merge( )
-      :noindex:
-
-      Returns a ``tf.Tensor`` that
-      concatenates the constituent ``tf.Tensor``\ s along the batch
-      dimension. This is commonly used for merging the model predictions
-      across microbatches.
-
-   .. function:: StepOutput.accumulate(method="variable", var=None)
-      :noindex:
-
-      Functionally the same as ``StepOutput.reduce_mean()``. However, it is
-      more memory-efficient, especially for large numbers of microbatches,
-      since it does not wait for all constituent \ ``tf.Tensor``\ s to be
-      ready to start averaging them, thereby saving memory.
-
-      In some cases (XLA for example) ``StepOutput.reduce_mean()`` might end
-      up being more memory-efficient than ``StepOutput.accumulate()``.
-
-      **Parameters**
-
-      -  ``method`` (``"add_n"`` or ``"accumulate_n"`` or ``"variable"``):
-         If ``"add_n"`` or ``"accumulate_n"``, the library uses
-         ``tf.add_n`` and ``tf.accumulate_n``, respectively, to implement
-         accumulation. If ``"variable"``, the library uses an internal ``tf.Variable``
-         into which to accumulate the tensors. Default is \ ``"variable"``.
-         Note: Memory usage behavior of these choices can depend on the model
-         and implementation.
-
-      -  ``var``: A ``tf.Variable`` into which, if provided, the library uses to
-         accumulate the tensors. If \ ``None``, the library internally creates a
-         variable. If ``method`` is not ``"variable"``, this argument is
-         ignored.
-
-.. _mpi_basics:
-   :noindex:
-
-MPI Basics
-^^^^^^^^^^
-
-The library exposes the following basic MPI primitives to its Python API:
-
--  ``smp.rank()``: The rank of the current process.
--  ``smp.size()``: The total number of processes.
--  ``smp.mp_rank()``: The rank of the process among the processes that
-   hold the current model replica.
--  ``smp.dp_rank()``: The rank of the process among the processes that
-   hold different replicas of the same model partition.
--  ``smp.dp_size()``: The total number of model replicas.
--  ``smp.local_rank()``: The rank among the processes on the current
-   instance.
--  ``smp.local_size()``: The total number of processes on the current
-   instance.
--  ``smp.get_mp_group()``: The list of ranks over which the current
-   model replica is partitioned.
--  ``smp.get_dp_group()``: The list of ranks that hold different
-   replicas of the same model partition.
-
-   .. _communication_api:
-      :noindex:
-
-Communication API
-^^^^^^^^^^^^^^^^^
-
-The library provides a few communication primitives which can be helpful while
-developing the training script. These primitives use the following
-``enum`` s as arguments to specify which processes the communication
-should involve.
-​
-
-**Helper structures**
-
-.. data:: smp.CommGroup
-   :noindex:
-
-   An ``enum`` that takes the values
-   ``CommGroup.WORLD``, ``CommGroup.MP_GROUP``, and ``CommGroup.DP_GROUP``.
-   These values can also be accessed as ``smp.WORLD``, ``smp.MP_GROUP``,
-   and ``smp.DP_GROUP`` respectively.
-
-   -  ``CommGroup.WORLD``: Represents the entire group of processes used in
-      training
-   -  ``CommGroup.MP_GROUP``: Represents the group of processes that hold
-      the same model replica as the current process. The processes in a
-      single ``MP_GROUP`` collectively store an entire replica of the
-      model.
-   -  ``CommGroup.DP_GROUP``: Represents the group of processes that hold
-      the same model partition as the current process. The processes in a
-      single ``DP_GROUP`` perform data parallelism/allreduce among
-      themselves.
-
-.. data:: smp.RankType
-   :noindex:
-
-   An ``enum`` that takes the values
-   ``RankType.WORLD_RANK``, ``RankType.MP_RANK``, and ``RankType.DP_RANK``.
-
-   -  ``RankType.WORLD_RANK``: The associated rank is to be interpreted as
-      the rank of the process across all processes used in training.
-   -  ``RankType.MP_RANK``: The associated rank is to be interpreted as the
-      rank of the process within the ``MP_GROUP``.
-   -  ``RankType.DP_RANK``: The associated rank is to be interpreted as the
-      rank of the process within the ``DP_GROUP``.
-
-
-**Communication primitives:**
-
-.. function:: smp.broadcast(obj, group)
-   :noindex:
-
-   Sends the object to all processes in the
-   group. The receiving process must call ``smp.recv_from`` to receive the
-   sent object.
-
-   **Inputs**
-
-   -  ``obj``: An arbitrary picklable Python object that will be broadcast.
-
-   -  ``group``: A ``CommGroup`` argument that represents to which group of
-      processes the object will be sent.
-
-   **Notes**
-
-   -  When you use ``broadcast`` on the sender process, there needs
-      to be an accompanying ``smp.recv_from()`` call on the receiver
-      processes.
-
-   -  This is a synchronous call; the ``broadcast`` statement
-      returns only after all ranks participating in the call have made a
-      matching ``recv_from`` call.
-
-   **Example**
-
-   .. code:: python
-
-      if smp.rank() == 0:
-          smp.broadcast(something, group=smp.CommGroup.WORLD)
-      else:
-          smp.recv_from(0, rank_type=smp.RankType.WORLD_RANK)
-
-.. function:: smp.send(obj, dest_rank, rank_type)
-   :noindex:
-
-   Sends the object ``obj`` to
-   ``dest_rank``, which is of a type specified by ``rank_type``.
-
-   **Inputs**
-
-   -  ``obj``: An arbitrary picklable Python object that will be sent.
-
-   -  ``dest_rank`` (``int``): An integer denoting the rank of the receiving process.
-
-   -  ``rank_type`` (``enum``): A ``smp.RankType`` ``enum`` that determines how
-      ``dest_rank`` is to be interpreted. For example if ``dest_rank`` is 1
-      and ``rank_type`` is ``MP_RANK``, then ``obj`` is sent to process
-      with ``mp_rank`` 1 in the ``MP_GROUP`` which contains the current
-      process.
-
-   **Notes**
-
-   -  Note: \ This is a synchronous call; the ``send`` statement returns
-      only after the destination rank has made a matching
-      ``recv_from`` call.
-
-.. function:: smp.recv_from(src_rank, rank_type)
-   :noindex:
-
-   Receive an object from a peer process. Can be used with a matching
-   ``smp.send`` or a ``smp.broadcast`` call.
-
-   **Inputs**
-
-   -  ``src_rank`` (``int``): An integer denoting rank of the sending process.
-
-   -  ``rank_type`` (``enum``): A ``smp.RankType`` ``enum`` that determines how
-      ``dest_rank`` is to be interpreted. For example if ``src_rank`` is 1
-      and ``rank_type`` is ``MP_RANK``, then the object is received from
-      the process with ``mp_rank`` 1 in the ``MP_GROUP`` which contains the
-      current process.
-
-   **Returns**
-
-   Returns the python object that is sent by the peer process.
-
-   **Notes**
-
-   -  Note: This is a synchronous call; the ``recv_from`` statement returns
-      only after the source rank has made a matching ``send`` or
-      ``broadcast`` call, and the object is received.
-
-.. function:: smp.allgather(obj, group)
-   :noindex:
-
-   A collective call that gathers all the
-   submitted objects across all ranks in the specified ``group``. Returns a
-   list whose ``i``\ th index contains the object submitted by the
-   ``i``\ th rank in ``group``.
-
-   **Inputs**
-
-   -  ``obj``: An arbitrary picklable Python object that will be
-      allgathered.
-
-   -  ``group`` : A ``CommGroup`` argument that represents which group of
-      processes participate in ``allgather``.
-
-   **Notes**
-
-   -  Note: This is a synchronous call; the ``allgather`` statement returns
-      only after all ranks participating in the call have made a matching
-      ``allgather`` call, and all the objects are received at the current
-      rank.
-
-   **Examples**
-
-   .. code:: python
-
-      # assuming mp_size() == 2
-
-      if smp.mp_rank() == 0:
-          out = smp.allgather(obj1, smp.CommGroup.MP_GROUP)  # returns [obj1, obj2]
-      else:
-          out = smp.allgather(obj2, smp.CommGroup.MP_GROUP)  # returns [obj1, obj2]
-
-.. function:: smp.barrier(group=smp.WORLD)
-   :noindex:
-
-   A statement that hangs until all
-   processes in the specified group reach the barrier statement, similar to
-   ``MPI_Barrier()``.
-
-   **Inputs**
-
-   -  ``group``: An ``smp.CommGroup`` ``enum`` that specifies the group of
-      processes participating in the barrier call. Defaults to
-      ``smp.WORLD``.
-
-   **Examples**
-
-   -  Assume there are 8 processes and 2 model partitions, and
-      therefore 4 \ ``mp_group``\ s, and 2 ``dp_group``\ s. If
-      the \ ``barrier`` call is passed the value ``smp.MP_GROUP`` for its
-      group argument, then each process only waits until the other process
-      of its own ``mp_group`` reaches that point. It does not wait for
-      processes outside that ``mp_group``.
-
-.. function:: smp.dp_barrier()
-   :noindex:
-
-   Same as passing ``smp.DP_GROUP``\ to ``smp.barrier()``.
-   Waits for the processes in the same \ ``dp_group`` as
-   the current process to reach the same point in execution.
-
-.. function:: smp.mp_barrier()
-   :noindex:
-
-   Same as passing ``smp.MP_GROUP`` to
-   ``smp.barrier()``. Waits for the processes in the same ``mp_group`` as
-   the current process to reach the same point in execution.
diff --git a/doc/api/training/smp_versions/v1.4.0/smd_model_parallel_pytorch.rst b/doc/api/training/smp_versions/v1.4.0/smd_model_parallel_pytorch.rst
deleted file mode 100644
index d2fcb95954..0000000000
--- a/doc/api/training/smp_versions/v1.4.0/smd_model_parallel_pytorch.rst
+++ /dev/null
@@ -1,572 +0,0 @@
-.. admonition:: Contents
-
-   - :ref:`pytorch_saving_loading`
-   - :ref:`pytorch_saving_loading_instructions`
-
-PyTorch API
-===========
-
-**Supported versions: 1.7.1, 1.8.1**
-
-This API document assumes you use the following import statements in your training scripts.
-
-.. code:: python
-
-   import smdistributed.modelparallel.torch as smp
-
-
-.. tip::
-
-   Refer to
-   `Modify a PyTorch Training Script
-   <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-customize-training-script.html#model-parallel-customize-training-script-pt>`_
-   to learn how to use the following API in your PyTorch training script.
-
-.. class:: smp.DistributedModel
-   :noindex:
-
-   A sub-class of ``torch.nn.Module`` which specifies the model to be
-   partitioned. Accepts a ``torch.nn.Module`` object ``module`` which is
-   the model to be partitioned. The returned ``DistributedModel`` object
-   internally manages model parallelism and data parallelism. Only one
-   model in the training script can be wrapped with
-   ``smp.DistributedModel``.
-
-   **Example:**
-
-   .. code:: python
-
-      model = smp.DistributedModel(model)
-
-   **Important**: The ``__call__`` and  ``backward`` method calls on the
-   ``smp.DistributedModel`` object (in the following example, the object
-   is \ ``model``) can only be made inside a ``smp.step``-decorated
-   function.
-
-
-   Since ``DistributedModel``  is a ``torch.nn.Module``, a forward pass can
-   be performed by calling the \ ``DistributedModel`` object on the input
-   tensors.
-
-   .. code:: python
-
-      predictions = model(inputs)   # model is a smp.DistributedModel object
-
-   For a backward pass, one needs to call the backward function on
-   the \ ``DistributedModel`` object, with tensors and gradients as
-   arguments, replacing the PyTorch operations \ ``torch.Tensor.backward``
-   or ``torch.autograd.backward``.
-
-
-   The API for ``model.backward`` is very similar to
-   ``torch.autograd.backward``. For example, the following
-   ``backward`` calls:
-
-   .. code:: python
-
-      torch.autograd.backward(loss) or loss.backward()
-
-   should be replaced with:
-
-   .. code:: python
-
-      model.backward(loss) # loss is a tensor with only one element as its data
-
-   Similarly, for non-scalar tensors, replace the following
-   ``backward`` call containing incoming gradient arguments:
-
-   .. code:: python
-
-      torch.autograd.backward(outputs, out_grads)
-
-   with the following line:
-
-   .. code:: python
-
-      model.backward(outputs, out_grads)
-
-   In these examples, all ``__call__``  and ``backward`` method calls on
-   the model objects (``model(inputs)`` and ``model.backward(loss)``) must be made inside
-   a ``smp.step``-decorated function.
-
-   **Using DDP**
-
-   If DDP is enabled, do not not place a PyTorch
-   ``DistributedDataParallel`` wrapper around the ``DistributedModel`` because
-   the ``DistributedModel`` wrapper will also handle data parallelism.
-
-   Unlike the original DDP wrapper, when you use ``DistributedModel``,
-   model parameters and buffers are not immediately broadcast across
-   processes when the wrapper is called. Instead, the broadcast is deferred to the first call of the
-   ``smp.step``-decorated function when the partition is done.
-
-   **Parameters**
-
-   -  ``module`` (``torch.nn.Module``): Module to be distributed (data parallelism and model parallelism).
-
-   -  ``trace_device`` (``"cpu"`` or ``"gpu"``) (default: ``"gpu"``)
-      Whether to perform the tracing step on the GPU or CPU. The tracing step gathers
-      information on the order of execution of modules, the shapes of
-      intermediate outputs, and execution times, to be used by the
-      partitioning algorithm. If ``trace_device`` is set to GPU, accurate
-      module execution times can be gathered during tracing for potentially
-      improved partitioning decision. However, if the model is too large to
-      fit in a single GPU, then ``trace_device`` should be set to CPU.
-
-   -  ``trace_execution_times`` (``bool``) (default: ``False``): If ``True``,
-      the library profiles the execution time of each module during tracing, and uses
-      it in the partitioning decision. This improves the partitioning
-      decision, but it might make the tracing slower. It may also introduce
-      some degree of non-determinism in partitioning results, because of the
-      inherent randomness in module execution times. Must be ``False`` if
-      ``trace_device`` is ``"cpu"``.
-
-   -  ``overlapping_allreduce`` (``bool``) (default: ``True``): This is only
-      applicable for hybrid data parallelism/model parallelism use cases (when
-      ``ddp`` is set to ``True`` while launching training). The library uses this flag
-      to decide whether to do overlapping allreduce whenever a parameter
-      gradients are ready. This leads to overlapping of communication and
-      computation and can improve performance. If this is set to ``False`` ,
-      allreduce is performed at the end of the step.
-
-   -  ``backward_passes_per_step`` (``int``) (default: 1): This is only
-      applicable for hybrid data parallelism/model parallelism use cases (when
-      ``ddp`` is set to ``True`` in config). This parameter indicates the
-      number of backward passes to perform before calling allreduce on DDP.
-      This allows accumulating updates over multiple mini-batches before
-      reducing and applying them.
-
-   -  ``average_grads_across_microbatches`` (``bool``) (default: ``True``):
-      Whether or not the computed gradients should be averaged across
-      microbatches. If ``False``, the computed gradients will be summed across
-      microbatches, but not divided by the number of microbatches. In typical
-      use case where the computed loss is averaged over the mini-batch, this
-      should be left as ``True``. If you use a loss function that only sums
-      the per-sample loss across the batch (and not divide by the batch size),
-      then this must be set to ``False`` for correctness.
-
-   -  ``bucket_cap_mb`` (default: 25): \ ``DistributedDataParallel`` buckets
-      parameters into multiple buckets so that gradient reduction of each
-      bucket can potentially overlap with backward
-      computation. \ ``bucket_cap_mb``\ controls the bucket size in MegaBytes
-      (MB).
-
-   -  ``trace_memory_usage`` (default: False): When set to True, the library attempts
-      to measure memory usage per module during tracing. If this is disabled,
-      memory usage will be estimated through the sizes of tensors returned from
-      the module.
-
-   -  ``broadcast_buffers`` (default: True): Flag to be used with ``ddp=True``.
-      This parameter is forwarded to the underlying ``DistributedDataParallel`` wrapper.
-      Please see: `broadcast_buffer <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel>`__.
-
-   -  ``gradient_as_bucket_view`` (default: False): To be
-      used with ``ddp=True``. This parameter is forwarded to the underlying
-      ``DistributedDataParallel`` wrapper. Please see `gradient_as_bucket_view <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel>`__.
-
-   **Properties**
-
-   -  ``partitioned``: Is ``True`` if the model is partitioned, ``False``
-      otherwise. Initialized to ``False`` when ``DistributedModel`` is first
-      created. It becomes be ``True`` during the first call
-      to ``smp.step``-decorated function. Once the model is partitioned, the
-      local parameters or local ``state_dict`` can be fetched using the
-      following methods.
-
-   **Methods**
-
-   .. function:: backward(tensors, grad_tensors)
-      :noindex:
-
-      Triggers a distributed backward
-      pass across model partitions. Example usage provided in the previous
-      section. The API is very similar
-      to https://pytorch.org/docs/stable/autograd.html#torch.autograd.backward.
-      ``retain_grad`` and ``create_graph``  flags are not supported.
-
-   .. function:: local_buffers( )
-      :noindex:
-
-      Returns an iterator over buffers for the modules in
-      the partitioned model that have been assigned to the current process.
-
-   .. function:: local_named_buffers( )
-      :noindex:
-
-      Returns an iterator over buffers for the
-      modules in the partitioned model that have been assigned to the current
-      process. This yields both the name of the buffer as well as the buffer
-      itself.
-
-   .. function:: local_parameters( )
-      :noindex:
-
-      Returns an iterator over parameters for the
-      modules in the partitioned model that have been assigned to the current
-      process.
-
-   .. function:: local_named_parameters( )
-      :noindex:
-
-      Returns an iterator over parameters for
-      the modules in the partitioned model that have been assigned to the
-      current process. This yields both the name of the parameter as well as
-      the parameter itself.
-
-   .. function:: local_modules( )
-      :noindex:
-
-      Returns an iterator over the modules in the
-      partitioned model that have been assigned to the current process.
-
-   .. function:: local_named_modules( )
-      :noindex:
-
-      Returns an iterator over the modules in the
-      partitioned model that have been assigned to the current process. This
-      yields both the name of the module as well as the module itself.
-
-   .. function:: local_state_dict( )
-      :noindex:
-
-      Returns the ``state_dict`` that contains local
-      parameters that belong to the current \ ``mp_rank``. This ``state_dict``
-      contains a key \ ``_smp_is_partial`` to indicate this is a
-      partial \ ``state_dict``, which indicates whether the
-      ``state_dict`` contains elements corresponding to only the current
-      partition, or to the entire model.
-
-   .. function:: state_dict( )
-      :noindex:
-
-      Returns the ``state_dict`` that contains parameters
-      for the entire model. It first collects the \ ``local_state_dict``  and
-      gathers and merges the \ ``local_state_dict`` from all ``mp_rank``\ s to
-      create a full ``state_dict``. Please note that this needs to be called on all ranks with
-      ``dp_rank()==0`` to ensure the gather happens properly.
-      If it is only called on all such ranks, it can hang.
-
-   .. function:: load_state_dict( )
-      :noindex:
-
-      Same as the ``torch.module.load_state_dict()`` ,
-      except: It first gathers and merges the ``state_dict``\ s across
-      ``mp_rank``\ s, if they are partial. The actual loading happens after the
-      model partition so that each rank knows its local parameters.
-
-   .. function:: register_post_partition_hook(hook)
-      :noindex:
-
-      Registers a callable ``hook`` to
-      be executed after the model is partitioned. This is useful in situations
-      where an operation needs to be executed after the model partition during
-      the first call to ``smp.step``, but before the actual execution of the
-      first forward pass. Returns a ``RemovableHandle`` object ``handle``,
-      which can be used to remove the hook by calling ``handle.remove()``.
-
-   .. function:: cpu( )
-      :noindex:
-
-      Allgathers parameters and buffers across all ``mp_rank``\ s and moves them
-      to the CPU.
-
-   .. function:: join( )
-      :noindex:
-
-      A context manager to be used in conjunction with an instance of
-      ``smp.DistributedModel`` to be able to train with uneven inputs across
-      participating processes. This is only supported when ``ddp=True``. This will use the join with the wrapped
-      ``DistributedDataParallel`` instance. For more information, see:
-      `join <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel.join>`__
-      in the PyTorch documentation.
-
-   .. function:: register_comm_hook( state, callable )
-      :noindex:
-
-      **Available for PyTorch 1.8.1 only**
-      Registers a communication hook which is an enhancement that provides
-      a flexible hook ``callable`` to users where they can specify how
-      gradients are aggregated across multiple workers. This method will be called on the wrapped ``DistributedDataParallel`` instance.
-
-      Please note that when you register a comm hook you have full control of how the gradients are processed.
-      When using only data parallelism with Torch DDP you are expected to average grads across data parallel replicas within the hook.
-      Similarly, when using DistributedModel you have to averaging grads across data parallel replicas within the hook.
-      In addition to that, you also have to average grads across microbatches within the hook unless you explicitly desire to not average based on your loss function.
-      See ``average_grads_across_microbatches`` for more information about averaging grads across microbatches.
-
-      This is only supported when ``ddp=True`` and ``overlapping_allreduce=True`` (default).
-      For more information, see:
-      `register_comm_hook <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel.register_comm_hook>`__
-      in the PyTorch documentation.
-
-
-
-.. class:: smp.DistributedOptimizer
-   :noindex:
-
-   **Parameters**
-   - ``optimizer``
-
-   An optimizer wrapper for saving/loading optimizer states. This wrapper
-   returns ``optimizer`` with the following methods overridden:
-
-   .. function:: state_dict( )
-      :noindex:
-
-      Returns the ``state_dict`` that contains optimizer state for the entire model.
-      It first collects the ``local_state_dict`` and gathers and merges
-      the ``local_state_dict`` from all ``mp_rank``s to create a full
-      ``state_dict``.
-
-   .. function::  load_state_dict( )
-      :noindex:
-
-      Same as the ``torch.optimizer.load_state_dict()`` , except:
-
-         -  It first gathers and merges the local ``state_dict``\ s if they are
-            partial.
-         -  The actual loading happens after the model partition so that each
-            rank knows its local parameters.
-
-   .. function::  local_state_dict( )
-      :noindex:
-
-      Returns the ``state_dict`` that contains the
-      local optimizer state that belongs to the current \ ``mp_rank``. This
-      ``state_dict`` contains a key \ ``_smp_is_partial`` to indicate this is
-      a partial \ ``state_dict``, which indicates whether the
-      ``state_dict`` contains elements corresponding to only the current
-      partition, or to the entire model.
-
-   ​
-.. function:: smp.partition(index)
-   :noindex:
-
-   **Inputs**
-
-   -  ``index`` (int) - The index of the partition.
-
-   A context manager which places all modules defined inside into the
-   partition with ID ``index``.  The ``index`` argument must be less than
-   the number of partitions.
-
-   Use ``smp.partition`` to implement manual partitioning.
-   If ``"auto_partition"`` is ``True``, then the
-   ``smp.partition`` contexts are ignored. Any module that is not placed in
-   any ``smp.partition`` context is placed in the
-   ``default_partition`` defined through the SageMaker Python SDK.
-
-   When ``smp.partition`` contexts are nested, the innermost context
-   overrides the rest (see the following example). In PyTorch, manual
-   partitioning should be done inside the module \ ``__init__``, and the
-   partition assignment applies to the modules that are *created* inside
-   the ``smp.partition`` context.
-
-   Example:
-
-   .. code:: python
-
-      class Model(torch.nn.Module):
-          def __init__(self):
-              with smp.partition(1):
-                  self.child0 = Child0()            # child0 on partition 1
-                  with smp.partition(2):
-                      self.child1 = Child1()        # child1 on partition 2
-                  self.child2 = Child2()            # child2 on partition 1
-              self.child3 = Child3()                # child3 on default_partition
-
-.. function:: smp.get_world_process_group( )
-   :noindex:
-
-   Returns a ``torch.distributed`` ``ProcessGroup`` that consists of all
-   processes, which can be used with the ``torch.distributed`` API.
-   Requires ``"ddp": True`` in SageMaker Python SDK parameters.
-
-.. function:: smp.get_mp_process_group( )
-   :noindex:
-
-   Returns a ``torch.distributed`` ``ProcessGroup`` that consists of the
-   processes in the ``MP_GROUP`` which contains the current process, which
-   can be used with the \ ``torch.distributed`` API. Requires
-   ``"ddp": True`` in SageMaker Python SDK parameters.
-
-.. function:: smp.get_dp_process_group( )
-   :noindex:
-
-   Returns a ``torch.distributed`` ``ProcessGroup`` that consists of the
-   processes in the ``DP_GROUP`` which contains the current process, which
-   can be used with the \ ``torch.distributed`` API. Requires
-   ``"ddp": True`` in SageMaker Python SDK parameters.
-
-.. function:: smp.is_initialized( )
-   :noindex:
-
-   Returns ``True`` if ``smp.init`` has already been called for the
-   process, and ``False`` otherwise.
-
-.. function::smp.is_tracing( )
-   :noindex:
-
-   Returns ``True`` if the current process is running the tracing step, and
-   ``False`` otherwise.
-
-.. data:: smp.nn.FusedLayerNorm
-   :noindex:
-
-   `Apex Fused Layer Norm <https://nvidia.github.io/apex/layernorm.html>`__ is currently not
-   supported by the library. ``smp.nn.FusedLayerNorm`` replaces ``apex``
-   ``FusedLayerNorm`` and provides the same functionality. This requires
-   ``apex`` to be installed on the system.
-
-.. data:: smp.optimizers.FusedNovoGrad
-   :noindex:
-
-
-   `Fused Novo Grad optimizer <https://nvidia.github.io/apex/optimizers.html#apex.optimizers.FusedNovoGrad>`__ is
-   currently not supported by the library. ``smp.optimizers.FusedNovoGrad`` replaces ``apex`` ``FusedNovoGrad``
-   optimizer and provides the same functionality. This requires ``apex`` to
-   be installed on the system.
-
-.. data:: smp.optimizers.FusedLamb
-   :noindex:
-
-
-   `FusedLamb optimizer <https://nvidia.github.io/apex/optimizers.html#apex.optimizers.FusedLAMB>`__
-   currently doesn’t work with the library. ``smp.optimizers.FusedLamb`` replaces
-   ``apex`` ``FusedLamb`` optimizer and provides the same functionality.
-   This requires ``apex`` to be installed on the system.
-
-.. data:: smp.amp.GradScaler
-   :noindex:
-
-   `Torch AMP Gradscaler <https://pytorch.org/docs/stable/amp.html#torch.cuda.amp.GradScaler>`__
-   currently doesn’t work with the library. ``smp.amp.GradScaler`` replaces
-   ``torch.amp.GradScaler`` and provides the same functionality.
-
-.. _pytorch_saving_loading:
-   :noindex:
-
-APIs for Saving and Loading
-^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-.. function:: smp.save( )
-   :noindex:
-
-   Saves an object. This operation is similar to ``torch.save()``, except
-   it has an additional keyword argument, ``partial``, and accepts only
-   string type for the argument ``f`` (file). If ``partial=True``, each
-   ``mp_rank`` saves a separate checkpoint file and the library adds an ``mp_rank``
-   index to your saved file.
-
-   **Parameters**
-
-   -  ``obj`` (dict): A saved object.
-   -  ``f`` (str): A string containing a file name.
-   -  ``partial`` (bool, default= ``True``):  When set to ``True``, each
-      ``mp_rank`` saves a separate checkpoint file and the library adds an
-      ``mp_rank`` index to the saved file. If you want to be able to load
-      and further train a model that you save with ``smp.save()``, you must
-      set ``partial=True``.
-   -  ``pickle_module`` (picklemodule, default = module ``"pickle"`` from ``"/opt/conda/lib/python3.6/pickle.py"``):
-      A module used for pickling metadata and objects.
-   -  ``pickle_protocol``  (int, default=2): Can be specified to
-      override the defaultprotocol.
-
-.. function:: smp.load( )
-   :noindex:
-
-   Loads an object saved with ``smp.save()`` from a file.
-
-   Similar to, `torch.load() <https://pytorch.org/docs/stable/generated/torch.load.html>`__,
-   except it has an additional keyword argument, ``partial``, and accepts
-   only string type for the argument ``f`` (file). If \ ``partial=True``,
-   then each ``mp_rank`` loads a separate checkpoint file.
-
-   **Parameters**
-
-   -  ``f`` (string): A string containing a file name.
-   -  ``map_location`` (function): A function
-      `torch.device <https://pytorch.org/docs/stable/tensor_attributes.html#torch.torch.device>`__,
-      a string, or a dict specifying how to remap storage locations.
-   -  ``pickle_module`` (pickle module): A module used for unpickling
-      metadata and objects (has to match the \ ``pickle_module``\ used to
-      serialize file).
-   -  ``pickle_load_args`` (Python 3 only): Optional keyword arguments
-      passed to ``pickle_module.load()`` and ``pickle_module.Unpickler()``.
-   -  ``partial`` (bool, default= ``True``): When set to ``True``, each
-      ``mp_rank`` loads the checkpoint corresponding to the ``mp_rank``.
-      Should be used when loading a model trained with the library.
-
-.. _pytorch_saving_loading_instructions:
-   :noindex:
-
-General Instruction For Saving and Loading
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-The library can save partial or full checkpoints.
-
--  For partial checkpoints, each ``mp_rank`` saves its own checkpoint
-   file with only the parameters that belong to that rank.
--  For full checkpoints, the library saves a single checkpoint that contains
-   entire model parameters.
-
-When **saving** using ``smp.save()``, each rank only holds its own
-parameters. If you want to save the full model, there will be some
-communication between the ranks to create the full model. If you save
-checkpoints often, you should save partial checkpoints for best
-performance.
-
-When **loading** using ``smp.load()``, the library can load either partial or |
-full checkpoints or full checkpoints saved by a non-model-parallel model. If you
-want to resume training with a non-model-parallel model or do inference, you need
-a full checkpoint.
-
-The following is an example of how you can save and load a checkpoint:
-
-.. code:: python
-
-   # Original model and optimizer
-   model = MyModel(...)
-   optimizer = MyOpt(...)
-
-   # model parallel wrapper
-   model = smp.DistributedModel(model)
-   optimizer = smp.DistributedOptimizer(optimizer)
-
-   # To save, always save on dp_rank 0 to avoid data racing
-   if partial:
-       # To save the partial model on each mp rank
-       # the library will create `checkpoint.pt_{mprank}` for each mp rank
-       if save_partial_model:
-           if smp.dp_rank() == 0:
-               model_dict = model.local_state_dict() # save the partial model
-               opt_dict = optimizer.local_state_dict() # save the partial optimizer state
-               smp.save(
-                   {"model_state_dict": model_dict, "optimizer_state_dict": opt_dict},
-                   f"/checkpoint.pt",
-                   partial=True,
-               )
-
-       # To save the full model
-       if save_full_model:
-           if smp.dp_rank() == 0:
-               model_dict = model.state_dict() # save the full model
-               opt_dict = optimizer.state_dict() # save the full optimizer state
-               smp.save(
-                   {"model_state_dict": model_dict, "optimizer_state_dict": opt_dict},
-                   "/checkpoint.pt",
-                   partial=False,
-               )
-
-   # To load, load on all ranks.
-   # The only difference for partial/full loading is the partial flag in smp.load
-   # Load partial checkpoint
-   if partial_checkpoint:
-       checkpoint = smp.load("/checkpoint.pt", partial=True)
-       model.load_state_dict(checkpoint["model_state_dict"])
-       optimizer.load_state_dict(checkpoint["optimizer_state_dict"])
-   # Load full checkpoint
-   if full_checkpoint:
-       checkpoint = smp.load("/checkpoint.pt", partial=False)
-       model.load_state_dict(checkpoint["model_state_dict"])
-       optimizer.load_state_dict(checkpoint["optimizer_state_dict"])
diff --git a/doc/api/training/smp_versions/v1.4.0/smd_model_parallel_tensorflow.rst b/doc/api/training/smp_versions/v1.4.0/smd_model_parallel_tensorflow.rst
deleted file mode 100644
index 131fc327ac..0000000000
--- a/doc/api/training/smp_versions/v1.4.0/smd_model_parallel_tensorflow.rst
+++ /dev/null
@@ -1,172 +0,0 @@
-TensorFlow API
-==============
-
-**Supported version: 2.3.1, 2.4.1, 2.5.0**
-
-**Important**: This API document assumes you use the following import statement in your training scripts.
-
-.. code:: python
-
-   import smdistributed.modelparallel.tensorflow as smp
-
-.. tip::
-
-   Refer to
-   `Modify a TensorFlow Training Script
-   <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-customize-training-script.html#model-parallel-customize-training-script-tf>`_
-   to learn how to use the following API in your TensorFlow training script.
-
-.. class:: smp.DistributedModel
-   :noindex:
-
-   A sub-class of the Keras \ ``Model`` class, which defines the model to
-   be partitioned. Model definition is done by sub-classing
-   ``smp.DistributedModel`` class, and implementing the ``call()`` method,
-   in the same way as the Keras model sub-classing API. Any operation that
-   is part of the \ ``smp.DistributedModel.call()`` method is subject to
-   partitioning, meaning that every operation placed inside executes in
-   exactly one of the devices (the operations outside run on all devices).
-
-
-   Similar to the regular Keras API, the forward pass is done by directly
-   calling the model object on the input tensors. For example:
-
-   .. code:: python
-
-      predictions = model(inputs)   # model is a smp.DistributedModel object
-
-   However, ``model()`` calls can only be made inside a
-   ``smp.step``-decorated function.
-
-   The outputs from a ``smp.DistributedModel`` are available in all ranks,
-   regardless of which rank computed the last operation.
-
-   **Methods:**
-
-   .. function:: save_model(save_path="/opt/ml/model")
-      :noindex:
-
-      **Inputs**
-      - ``save_path`` (``string``): A path to save an unpartitioned model with latest training weights.
-
-      Saves the entire,
-      unpartitioned model with the latest trained weights to ``save_path`` in
-      TensorFlow ``SavedModel`` format. Defaults to ``"/opt/ml/model"``, which
-      SageMaker monitors to upload the model artifacts to Amazon S3.
-
-.. function:: smp.partition(index)
-   :noindex:
-
-   **Inputs**
-
-   -  ``index`` (``int``): The index of the partition.
-
-   A context manager which places all operations defined inside into the
-   partition whose ID is equal to ``index``. When
-   ``smp.partition`` contexts are nested, the innermost context overrides
-   the rest. The ``index`` argument must be smaller than the number of
-   partitions.
-
-   ``smp.partition`` is used in the manual partitioning API;
-   if \ ``"auto_partition"`` parameter is set to ``True`` while launching
-   training, then ``smp.partition`` contexts are ignored. Any operation
-   that is not placed in any ``smp.partition`` context is placed in the
-   ``default_partition``, as shown in the following example:
-
-   .. code:: python
-
-      # auto_partition: False
-      # default_partition: 0
-      smp.init()
-      [...]
-      x = tf.constant(1.2)                     # placed in partition 0
-      with smp.partition(1):
-          y = tf.add(x, tf.constant(2.3))      # placed in partition 1
-          with smp.partition(3):
-              z = tf.reduce_sum(y)             # placed in partition 3
-
-
-.. function:: register_post_partition_hook(hook)
-   :noindex:
-
-    Registers a callable ``hook`` to
-    be executed after the model is partitioned. This is useful in situations
-    where an operation needs to be executed after the model partition during
-    the first call to ``smp.step``, but before the actual execution of the
-    first forward pass.
-
-    .. code:: python
-
-        @smp.register_post_partition_hook
-        def test_eager():
-            # All statements here will be executed right after partition but before the first forward pass
-            tf.print("Entered hook through eager context")
-
-.. class:: smp.CheckpointManager
-   :noindex:
-
-
-   A subclass of TensorFlow
-   `CheckpointManager <https://www.tensorflow.org/api_docs/python/tf/train/CheckpointManager>`__,
-   which is used to manage checkpoints. The usage is similar to TensorFlow
-   ``CheckpointManager``.
-
-   The following returns a ``CheckpointManager`` object.
-
-   .. code:: python
-
-      smp.CheckpointManager(checkpoint,
-                            directory="/opt/ml/checkpoints",
-                            max_to_keep=None,
-                            checkpoint_name="ckpt")
-
-   **Parameters**
-
-   -  ``checkpoint``: A `tf.train.Checkpoint
-      <https://www.tensorflow.org/api_docs/python/tf/train/Checkpoint>`__ instance
-      that represents a model checkpoint.
-
-   -  ``directory``: (``str``) The path to a directory in which to write
-      checkpoints. A file named "checkpoint" is also written to this
-      directory (in a human-readable text format) which contains the state
-      of the ``CheckpointManager``. Defaults to
-      ``"/opt/ml/checkpoints"``, which is the directory that SageMaker
-      monitors for uploading the checkpoints to Amazon S3.
-   -  ``max_to_keep`` (``int``): The number of checkpoints to keep. If
-      ``None``, all checkpoints are kept.
-   -  ``checkpoint_name`` (``str``): Custom name for the checkpoint file.
-      Defaults to ``"ckpt"``.
-
-
-   **Methods:**
-
-   .. function:: save( )
-      :noindex:
-
-      Saves a new checkpoint in the specified directory. Internally uses ``tf.train.CheckpointManager.save()``.
-
-   .. function:: restore( )
-      :noindex:
-
-      Restores the latest checkpoint in the specified directory.
-      Internally uses ``tf.train.CheckpointManager.restore()``.
-
-
-   **Examples:**
-
-   .. code:: python
-
-      checkpoint = tf.train.Checkpoint(optimizer=optimizer, model=model)
-      ckpt_manager = smp.CheckpointManager(checkpoint, max_to_keep=5)  # use /opt/ml/checkpoints
-
-      for inputs in train_ds:
-          loss = train_step(inputs)
-          # [...]
-          ckpt_manager.save()  # save a new checkpoint in /opt/ml/checkpoints
-
-   .. code:: python
-
-      for step, inputs in enumerate(train_ds):
-          if step == 0:
-              ckpt_manager.restore()
-          loss = train_step(inputs)
diff --git a/doc/api/training/smp_versions/v1.5.0/smd_model_parallel_common_api.rst b/doc/api/training/smp_versions/v1.5.0/smd_model_parallel_common_api.rst
deleted file mode 100644
index 625a7fcbf1..0000000000
--- a/doc/api/training/smp_versions/v1.5.0/smd_model_parallel_common_api.rst
+++ /dev/null
@@ -1,488 +0,0 @@
-.. admonition:: Contents
-
-   - :ref:`communication_api`
-   - :ref:`mpi_basics`
-
-Common API
-==========
-
-The following SageMaker distribute model parallel APIs are common across all frameworks.
-
-**Important**: This API document assumes you use the following import statement in your training scripts.
-
-**TensorFlow**
-
-.. code:: python
-
-   import smdistributed.modelparallel.tensorflow as smp
-
-**PyTorch**
-
-.. code:: python
-
-   import smdistributed.modelparallel.torch as smp
-
-
-.. function:: smp.init( )
-   :noindex:
-
-   Initialize the library. Must be called at the beginning of training script.
-
-.. function:: @smp.step(non_split_inputs, input_split_axes, [*args, **kwargs])
-   :noindex:
-
-   A decorator that must be placed over a function that represents a single
-   forward and backward pass (for training use cases), or a single forward
-   pass (for evaluation use cases). Any computation that is defined inside
-   the ``smp.step``-decorated function is executed in a pipelined manner.
-
-   By default, every tensor input to the function is split across its batch
-   dimension into a number of microbatches specified while launching the
-   training job. This behavior can be customized through the arguments to
-   ``smp.step``, described below. The library then orchestrates the execution of
-   each microbatch across all partitions, based on the chosen pipeline
-   type.
-
-   In a typical use case, forward pass and back-propagation are executed
-   inside an \ ``smp.step``-decorated function and gradients, loss, and
-   other relevant metrics (such as accuracy, etc.) are returned from
-   ``smp.step``-decorated function.
-
-   Any gradient post-processing operation, such as gradient clipping and
-   allreduce, as well as ``optimizer.apply_gradients`` calls (for TF) or
-   ``optimizer.step`` (for PT) should be applied on the gradients returned
-   from the ``smp.step`` function, and not inside the ``smp.step``
-   function. This is because every operation inside ``smp.step`` is
-   executed once per microbatch, so having these operations inside
-   ``smp.step`` can either be inefficient (in the case of allreduce), or
-   lead to wrong results (in the case of ``apply_gradients`` /
-   ``optimizer.step``).
-
-   If the objects returned from the ``smp.step``-decorated function contain
-   ``tf.Tensor``\ s / ``torch.Tensor``\ s, they are converted to
-   ``StepOutput`` objects. A ``StepOutput`` object encapsulates all
-   versions of the tensor across different microbatches
-   (see ``StepOutput`` entry for more information).
-
-   The argument to ``smp.step`` decorated function should either be a tensor
-   or an instance of list, tuple, dict or set for it to be split across
-   microbatches. If your object doesn't fall into this category, you can make
-   the library split your object, by implementing ``smp_slice`` method.
-
-   Below is an example of how to use it with PyTorch.
-
-   .. code:: python
-
-      class CustomType:
-          def __init__(self, tensor):
-              self.data = tensor
-
-          # The library will call this to invoke slicing on the object passing in total microbatches (num_mb)
-          # and the current microbatch index (mb).
-          def smp_slice(self, num_mb, mb, axis):
-              dim_size = list(self.data.size())[axis]
-
-              split_size = dim_size // num_mb
-              sliced_tensor = self.data.narrow(axis, mb * split_size, split_size)
-              return CustomType(sliced_tensor, self.other)
-
-      custom_obj = CustomType(torch.ones(4,))
-
-      @smp.step()
-      def step(custom_obj):
-          loss = model(custom_obj)
-          model.backward(loss)
-          return loss
-
-
-   **Important:** ``smp.step`` splits the batch into microbatches, and
-   executes everything inside the decorated function once per microbatch.
-   This might affect the behavior of batch normalization, any operation
-   that explicitly uses the batch size information, or any other Python
-   code that is expected to run once.
-
-   **TensorFlow-specific behavior**
-
-   ``smp.step`` is a wrapper that
-   inherits from and extends the behavior of ``tf.function``, and as such,
-   all the caveats that apply to the use of ``tf.function``\ s also apply
-   to ``smp.step``. In particular, any operation that is inside
-   ``smp.step`` executes in graph mode, and not eager mode.
-
-   In the first call, ``smp.step`` performs tracing of the wrapped function every time
-   one of the tensor arguments changes their shape or dtype, or for every
-   new value of a Python argument, if there is one. Tracing is expensive,
-   so such scenarios should be avoided as much as possible or,
-   alternatively, an ``input_signature`` argument must be provided. For
-   more information on the usage of ``tf.function``, refer to the
-   TensorFlow documentation:
-
-   -  https://www.tensorflow.org/api_docs/python/tf/function\
-   -  https://www.tensorflow.org/guide/function\
-
-   Each ``smp.step`` decorated function must have a return value that depends on the
-   output of ``smp.DistributedModel``.
-
-   **Common parameters**
-
-   -  ``non_split_inputs`` (``list``): The list of arguments to the decorated function
-      that should not be split along the batch dimension. Should be used
-      for all input tensors that do not have a batch dimension. Should be a
-      list of argument names as ``str``, as they appear in the signature of
-      the ``smp.step``-decorated function. By default it is considered an
-      empty list.
-
-   -  ``input_split_axes`` (``dict``): A dict that maps the argument name to its batch
-      axis. The keys should be the argument names as ``str``, as they
-      appear in the signature of the ``smp.step``-decorated function.  By
-      default all batch axes are assumed to be the 0-axis.
-
-   **TensorFlow-only parameters**
-
-   -  All arguments of ``tf.function``. Note:
-      The \ ``experimental_compile`` argument of ``tf.function`` may not
-      work as expected with ``smp.step``, since it interferes with
-      pipelining and model partitioning. To enable XLA with the library, you can
-      instead use \ ``tf.config.optimizer.set_jit(True)``.
-
-   **PyTorch-only parameters**
-
-   -  ``detach_outputs`` (``bool``) : If ``True``, calls ``torch.Tensor.detach()`` on
-      all returned ``torch.Tensor`` outputs. Setting it to ``False``
-      increases memory consumption, unless ``detach()`` is manually called
-      on the returned tensors, because the model graph is not cleared from
-      memory after the training step. Set to \ ``True`` by default.
-
-   **Returns**
-
-   -  The same object(s) returned from the decorated function. All
-      returned \ ``tf.Tensor``, \ ``tf.Variable``  objects (for TF) or
-      ``torch.Tensor`` objects (for PT) are wrapped inside
-      a \ ``StepOutput`` object, even when they are inside a Python
-      ``list``, ``tuple``, or ``dict``.
-
-
-
-.. class:: StepOutput
-   :noindex:
-
-
-   A class that encapsulates all versions of a ``tf.Tensor``
-   or \ ``torch.Tensor`` across all microbatches.
-
-   When a particular ``tf.Tensor`` or ``torch.Tensor`` is computed inside
-   ``smp.step``, different versions of the tensor are computed for each
-   microbatch.
-
-   When this tensor is returned from ``smp.step`` and is accessed outside
-   of the decorated function, it appears as a ``StepOutput`` object, which
-   contains all such versions. For example,
-
-   -  In the case of Tensorflow, the gradient for a particular
-      ``tf.Variable`` is computed on each microbatch individually, and if
-      this gradient is returned from ``smp.step``, all gradients for this
-      ``tf.Variable`` become part of the same ``StepOutput`` object. The
-      ``StepOutput`` class offers the following API for commonly-used
-      post-processing operations on such tensors.
-   -  In the case of PyTorch, the loss for each microbatch is computed
-      individually and all the ``torch.Tensor``\ s that represent the loss
-      for different microbatches become part of same ``StepOutput`` object,
-      if loss is returned from the ``smp.step`` function.
-
-
-   The ``StepOutput`` class offers the following API for commonly-used
-   post-processing operations on tensors.
-
-   .. data:: StepOutput.outputs
-      :noindex:
-
-      Returns a list of the underlying tensors, indexed by microbatch.
-
-   .. function:: StepOutput.reduce_mean( )
-      :noindex:
-
-      Returns a ``tf.Tensor``, ``torch.Tensor`` that averages the constituent ``tf.Tensor`` s
-      ``torch.Tensor`` s. This is commonly used for averaging loss and gradients across microbatches.
-
-   .. function:: StepOutput.reduce_sum( )
-      :noindex:
-
-      Returns a ``tf.Tensor`` /
-      ``torch.Tensor`` that sums the constituent
-      ``tf.Tensor``\ s/\ ``torch.Tensor``\ s.
-
-   .. function:: StepOutput.concat( )
-      :noindex:
-
-      Returns a
-      ``tf.Tensor``/``torch.Tensor`` that concatenates tensors along the
-      batch dimension using ``tf.concat`` / ``torch.cat``.
-
-   .. function:: StepOutput.stack( )
-      :noindex:
-
-      Applies ``tf.stack`` / ``torch.stack``
-      operation to the list of constituent ``tf.Tensor``\ s /
-      ``torch.Tensor``\ s.
-
-   **TensorFlow-only methods**
-
-   .. function:: StepOutput.merge( )
-      :noindex:
-
-      Returns a ``tf.Tensor`` that
-      concatenates the constituent ``tf.Tensor``\ s along the batch
-      dimension. This is commonly used for merging the model predictions
-      across microbatches.
-
-   .. function:: StepOutput.accumulate(method="variable", var=None)
-      :noindex:
-
-      Functionally the same as ``StepOutput.reduce_mean()``. However, it is
-      more memory-efficient, especially for large numbers of microbatches,
-      since it does not wait for all constituent \ ``tf.Tensor``\ s to be
-      ready to start averaging them, thereby saving memory.
-
-      In some cases (XLA for example) ``StepOutput.reduce_mean()`` might end
-      up being more memory-efficient than ``StepOutput.accumulate()``.
-
-      **Parameters**
-
-      -  ``method`` (``"add_n"`` or ``"accumulate_n"`` or ``"variable"``):
-         If ``"add_n"`` or ``"accumulate_n"``, the library uses
-         ``tf.add_n`` and ``tf.accumulate_n``, respectively, to implement
-         accumulation. If ``"variable"``, the library uses an internal ``tf.Variable``
-         into which to accumulate the tensors. Default is \ ``"variable"``.
-         Note: Memory usage behavior of these choices can depend on the model
-         and implementation.
-
-      -  ``var``: A ``tf.Variable`` into which, if provided, the library uses to
-         accumulate the tensors. If \ ``None``, the library internally creates a
-         variable. If ``method`` is not ``"variable"``, this argument is
-         ignored.
-
-.. _mpi_basics:
-   :noindex:
-
-MPI Basics
-^^^^^^^^^^
-
-The library exposes the following basic MPI primitives to its Python API:
-
--  ``smp.rank()``: The rank of the current process.
--  ``smp.size()``: The total number of processes.
--  ``smp.mp_rank()``: The rank of the process among the processes that
-   hold the current model replica.
--  ``smp.dp_rank()``: The rank of the process among the processes that
-   hold different replicas of the same model partition.
--  ``smp.dp_size()``: The total number of model replicas.
--  ``smp.local_rank()``: The rank among the processes on the current
-   instance.
--  ``smp.local_size()``: The total number of processes on the current
-   instance.
--  ``smp.get_mp_group()``: The list of ranks over which the current
-   model replica is partitioned.
--  ``smp.get_dp_group()``: The list of ranks that hold different
-   replicas of the same model partition.
-
-   .. _communication_api:
-      :noindex:
-
-Communication API
-^^^^^^^^^^^^^^^^^
-
-The library provides a few communication primitives which can be helpful while
-developing the training script. These primitives use the following
-``enum`` s as arguments to specify which processes the communication
-should involve.
-​
-
-**Helper structures**
-
-.. data:: smp.CommGroup
-   :noindex:
-
-   An ``enum`` that takes the values
-   ``CommGroup.WORLD``, ``CommGroup.MP_GROUP``, and ``CommGroup.DP_GROUP``.
-   These values can also be accessed as ``smp.WORLD``, ``smp.MP_GROUP``,
-   and ``smp.DP_GROUP`` respectively.
-
-   -  ``CommGroup.WORLD``: Represents the entire group of processes used in
-      training
-   -  ``CommGroup.MP_GROUP``: Represents the group of processes that hold
-      the same model replica as the current process. The processes in a
-      single ``MP_GROUP`` collectively store an entire replica of the
-      model.
-   -  ``CommGroup.DP_GROUP``: Represents the group of processes that hold
-      the same model partition as the current process. The processes in a
-      single ``DP_GROUP`` perform data parallelism/allreduce among
-      themselves.
-
-.. data:: smp.RankType
-   :noindex:
-
-   An ``enum`` that takes the values
-   ``RankType.WORLD_RANK``, ``RankType.MP_RANK``, and ``RankType.DP_RANK``.
-
-   -  ``RankType.WORLD_RANK``: The associated rank is to be interpreted as
-      the rank of the process across all processes used in training.
-   -  ``RankType.MP_RANK``: The associated rank is to be interpreted as the
-      rank of the process within the ``MP_GROUP``.
-   -  ``RankType.DP_RANK``: The associated rank is to be interpreted as the
-      rank of the process within the ``DP_GROUP``.
-
-
-**Communication primitives:**
-
-.. function:: smp.broadcast(obj, group)
-   :noindex:
-
-   Sends the object to all processes in the
-   group. The receiving process must call ``smp.recv_from`` to receive the
-   sent object.
-
-   **Inputs**
-
-   -  ``obj``: An arbitrary picklable Python object that will be broadcast.
-
-   -  ``group``: A ``CommGroup`` argument that represents to which group of
-      processes the object will be sent.
-
-   **Notes**
-
-   -  When you use ``broadcast`` on the sender process, there needs
-      to be an accompanying ``smp.recv_from()`` call on the receiver
-      processes.
-
-   -  This is a synchronous call; the ``broadcast`` statement
-      returns only after all ranks participating in the call have made a
-      matching ``recv_from`` call.
-
-   **Example**
-
-   .. code:: python
-
-      if smp.rank() == 0:
-          smp.broadcast(something, group=smp.CommGroup.WORLD)
-      else:
-          smp.recv_from(0, rank_type=smp.RankType.WORLD_RANK)
-
-.. function:: smp.send(obj, dest_rank, rank_type)
-   :noindex:
-
-   Sends the object ``obj`` to
-   ``dest_rank``, which is of a type specified by ``rank_type``.
-
-   **Inputs**
-
-   -  ``obj``: An arbitrary picklable Python object that will be sent.
-
-   -  ``dest_rank`` (``int``): An integer denoting the rank of the receiving process.
-
-   -  ``rank_type`` (``enum``): A ``smp.RankType`` ``enum`` that determines how
-      ``dest_rank`` is to be interpreted. For example if ``dest_rank`` is 1
-      and ``rank_type`` is ``MP_RANK``, then ``obj`` is sent to process
-      with ``mp_rank`` 1 in the ``MP_GROUP`` which contains the current
-      process.
-
-   **Notes**
-
-   -  Note: \ This is a synchronous call; the ``send`` statement returns
-      only after the destination rank has made a matching
-      ``recv_from`` call.
-
-.. function:: smp.recv_from(src_rank, rank_type)
-   :noindex:
-
-   Receive an object from a peer process. Can be used with a matching
-   ``smp.send`` or a ``smp.broadcast`` call.
-
-   **Inputs**
-
-   -  ``src_rank`` (``int``): An integer denoting rank of the sending process.
-
-   -  ``rank_type`` (``enum``): A ``smp.RankType`` ``enum`` that determines how
-      ``dest_rank`` is to be interpreted. For example if ``src_rank`` is 1
-      and ``rank_type`` is ``MP_RANK``, then the object is received from
-      the process with ``mp_rank`` 1 in the ``MP_GROUP`` which contains the
-      current process.
-
-   **Returns**
-
-   Returns the python object that is sent by the peer process.
-
-   **Notes**
-
-   -  Note: This is a synchronous call; the ``recv_from`` statement returns
-      only after the source rank has made a matching ``send`` or
-      ``broadcast`` call, and the object is received.
-
-.. function:: smp.allgather(obj, group)
-   :noindex:
-
-   A collective call that gathers all the
-   submitted objects across all ranks in the specified ``group``. Returns a
-   list whose ``i``\ th index contains the object submitted by the
-   ``i``\ th rank in ``group``.
-
-   **Inputs**
-
-   -  ``obj``: An arbitrary picklable Python object that will be
-      allgathered.
-
-   -  ``group`` : A ``CommGroup`` argument that represents which group of
-      processes participate in ``allgather``.
-
-   **Notes**
-
-   -  Note: This is a synchronous call; the ``allgather`` statement returns
-      only after all ranks participating in the call have made a matching
-      ``allgather`` call, and all the objects are received at the current
-      rank.
-
-   **Examples**
-
-   .. code:: python
-
-      # assuming mp_size() == 2
-
-      if smp.mp_rank() == 0:
-          out = smp.allgather(obj1, smp.CommGroup.MP_GROUP)  # returns [obj1, obj2]
-      else:
-          out = smp.allgather(obj2, smp.CommGroup.MP_GROUP)  # returns [obj1, obj2]
-
-.. function:: smp.barrier(group=smp.WORLD)
-   :noindex:
-
-   A statement that hangs until all
-   processes in the specified group reach the barrier statement, similar to
-   ``MPI_Barrier()``.
-
-   **Inputs**
-
-   -  ``group``: An ``smp.CommGroup`` ``enum`` that specifies the group of
-      processes participating in the barrier call. Defaults to
-      ``smp.WORLD``.
-
-   **Examples**
-
-   -  Assume there are 8 processes and 2 model partitions, and
-      therefore 4 \ ``mp_group``\ s, and 2 ``dp_group``\ s. If
-      the \ ``barrier`` call is passed the value ``smp.MP_GROUP`` for its
-      group argument, then each process only waits until the other process
-      of its own ``mp_group`` reaches that point. It does not wait for
-      processes outside that ``mp_group``.
-
-.. function:: smp.dp_barrier()
-   :noindex:
-
-   Same as passing ``smp.DP_GROUP``\ to ``smp.barrier()``.
-   Waits for the processes in the same \ ``dp_group`` as
-   the current process to reach the same point in execution.
-
-.. function:: smp.mp_barrier()
-   :noindex:
-
-   Same as passing ``smp.MP_GROUP`` to
-   ``smp.barrier()``. Waits for the processes in the same ``mp_group`` as
-   the current process to reach the same point in execution.
diff --git a/doc/api/training/smp_versions/v1.5.0/smd_model_parallel_pytorch.rst b/doc/api/training/smp_versions/v1.5.0/smd_model_parallel_pytorch.rst
deleted file mode 100644
index d2fcb95954..0000000000
--- a/doc/api/training/smp_versions/v1.5.0/smd_model_parallel_pytorch.rst
+++ /dev/null
@@ -1,572 +0,0 @@
-.. admonition:: Contents
-
-   - :ref:`pytorch_saving_loading`
-   - :ref:`pytorch_saving_loading_instructions`
-
-PyTorch API
-===========
-
-**Supported versions: 1.7.1, 1.8.1**
-
-This API document assumes you use the following import statements in your training scripts.
-
-.. code:: python
-
-   import smdistributed.modelparallel.torch as smp
-
-
-.. tip::
-
-   Refer to
-   `Modify a PyTorch Training Script
-   <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-customize-training-script.html#model-parallel-customize-training-script-pt>`_
-   to learn how to use the following API in your PyTorch training script.
-
-.. class:: smp.DistributedModel
-   :noindex:
-
-   A sub-class of ``torch.nn.Module`` which specifies the model to be
-   partitioned. Accepts a ``torch.nn.Module`` object ``module`` which is
-   the model to be partitioned. The returned ``DistributedModel`` object
-   internally manages model parallelism and data parallelism. Only one
-   model in the training script can be wrapped with
-   ``smp.DistributedModel``.
-
-   **Example:**
-
-   .. code:: python
-
-      model = smp.DistributedModel(model)
-
-   **Important**: The ``__call__`` and  ``backward`` method calls on the
-   ``smp.DistributedModel`` object (in the following example, the object
-   is \ ``model``) can only be made inside a ``smp.step``-decorated
-   function.
-
-
-   Since ``DistributedModel``  is a ``torch.nn.Module``, a forward pass can
-   be performed by calling the \ ``DistributedModel`` object on the input
-   tensors.
-
-   .. code:: python
-
-      predictions = model(inputs)   # model is a smp.DistributedModel object
-
-   For a backward pass, one needs to call the backward function on
-   the \ ``DistributedModel`` object, with tensors and gradients as
-   arguments, replacing the PyTorch operations \ ``torch.Tensor.backward``
-   or ``torch.autograd.backward``.
-
-
-   The API for ``model.backward`` is very similar to
-   ``torch.autograd.backward``. For example, the following
-   ``backward`` calls:
-
-   .. code:: python
-
-      torch.autograd.backward(loss) or loss.backward()
-
-   should be replaced with:
-
-   .. code:: python
-
-      model.backward(loss) # loss is a tensor with only one element as its data
-
-   Similarly, for non-scalar tensors, replace the following
-   ``backward`` call containing incoming gradient arguments:
-
-   .. code:: python
-
-      torch.autograd.backward(outputs, out_grads)
-
-   with the following line:
-
-   .. code:: python
-
-      model.backward(outputs, out_grads)
-
-   In these examples, all ``__call__``  and ``backward`` method calls on
-   the model objects (``model(inputs)`` and ``model.backward(loss)``) must be made inside
-   a ``smp.step``-decorated function.
-
-   **Using DDP**
-
-   If DDP is enabled, do not not place a PyTorch
-   ``DistributedDataParallel`` wrapper around the ``DistributedModel`` because
-   the ``DistributedModel`` wrapper will also handle data parallelism.
-
-   Unlike the original DDP wrapper, when you use ``DistributedModel``,
-   model parameters and buffers are not immediately broadcast across
-   processes when the wrapper is called. Instead, the broadcast is deferred to the first call of the
-   ``smp.step``-decorated function when the partition is done.
-
-   **Parameters**
-
-   -  ``module`` (``torch.nn.Module``): Module to be distributed (data parallelism and model parallelism).
-
-   -  ``trace_device`` (``"cpu"`` or ``"gpu"``) (default: ``"gpu"``)
-      Whether to perform the tracing step on the GPU or CPU. The tracing step gathers
-      information on the order of execution of modules, the shapes of
-      intermediate outputs, and execution times, to be used by the
-      partitioning algorithm. If ``trace_device`` is set to GPU, accurate
-      module execution times can be gathered during tracing for potentially
-      improved partitioning decision. However, if the model is too large to
-      fit in a single GPU, then ``trace_device`` should be set to CPU.
-
-   -  ``trace_execution_times`` (``bool``) (default: ``False``): If ``True``,
-      the library profiles the execution time of each module during tracing, and uses
-      it in the partitioning decision. This improves the partitioning
-      decision, but it might make the tracing slower. It may also introduce
-      some degree of non-determinism in partitioning results, because of the
-      inherent randomness in module execution times. Must be ``False`` if
-      ``trace_device`` is ``"cpu"``.
-
-   -  ``overlapping_allreduce`` (``bool``) (default: ``True``): This is only
-      applicable for hybrid data parallelism/model parallelism use cases (when
-      ``ddp`` is set to ``True`` while launching training). The library uses this flag
-      to decide whether to do overlapping allreduce whenever a parameter
-      gradients are ready. This leads to overlapping of communication and
-      computation and can improve performance. If this is set to ``False`` ,
-      allreduce is performed at the end of the step.
-
-   -  ``backward_passes_per_step`` (``int``) (default: 1): This is only
-      applicable for hybrid data parallelism/model parallelism use cases (when
-      ``ddp`` is set to ``True`` in config). This parameter indicates the
-      number of backward passes to perform before calling allreduce on DDP.
-      This allows accumulating updates over multiple mini-batches before
-      reducing and applying them.
-
-   -  ``average_grads_across_microbatches`` (``bool``) (default: ``True``):
-      Whether or not the computed gradients should be averaged across
-      microbatches. If ``False``, the computed gradients will be summed across
-      microbatches, but not divided by the number of microbatches. In typical
-      use case where the computed loss is averaged over the mini-batch, this
-      should be left as ``True``. If you use a loss function that only sums
-      the per-sample loss across the batch (and not divide by the batch size),
-      then this must be set to ``False`` for correctness.
-
-   -  ``bucket_cap_mb`` (default: 25): \ ``DistributedDataParallel`` buckets
-      parameters into multiple buckets so that gradient reduction of each
-      bucket can potentially overlap with backward
-      computation. \ ``bucket_cap_mb``\ controls the bucket size in MegaBytes
-      (MB).
-
-   -  ``trace_memory_usage`` (default: False): When set to True, the library attempts
-      to measure memory usage per module during tracing. If this is disabled,
-      memory usage will be estimated through the sizes of tensors returned from
-      the module.
-
-   -  ``broadcast_buffers`` (default: True): Flag to be used with ``ddp=True``.
-      This parameter is forwarded to the underlying ``DistributedDataParallel`` wrapper.
-      Please see: `broadcast_buffer <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel>`__.
-
-   -  ``gradient_as_bucket_view`` (default: False): To be
-      used with ``ddp=True``. This parameter is forwarded to the underlying
-      ``DistributedDataParallel`` wrapper. Please see `gradient_as_bucket_view <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel>`__.
-
-   **Properties**
-
-   -  ``partitioned``: Is ``True`` if the model is partitioned, ``False``
-      otherwise. Initialized to ``False`` when ``DistributedModel`` is first
-      created. It becomes be ``True`` during the first call
-      to ``smp.step``-decorated function. Once the model is partitioned, the
-      local parameters or local ``state_dict`` can be fetched using the
-      following methods.
-
-   **Methods**
-
-   .. function:: backward(tensors, grad_tensors)
-      :noindex:
-
-      Triggers a distributed backward
-      pass across model partitions. Example usage provided in the previous
-      section. The API is very similar
-      to https://pytorch.org/docs/stable/autograd.html#torch.autograd.backward.
-      ``retain_grad`` and ``create_graph``  flags are not supported.
-
-   .. function:: local_buffers( )
-      :noindex:
-
-      Returns an iterator over buffers for the modules in
-      the partitioned model that have been assigned to the current process.
-
-   .. function:: local_named_buffers( )
-      :noindex:
-
-      Returns an iterator over buffers for the
-      modules in the partitioned model that have been assigned to the current
-      process. This yields both the name of the buffer as well as the buffer
-      itself.
-
-   .. function:: local_parameters( )
-      :noindex:
-
-      Returns an iterator over parameters for the
-      modules in the partitioned model that have been assigned to the current
-      process.
-
-   .. function:: local_named_parameters( )
-      :noindex:
-
-      Returns an iterator over parameters for
-      the modules in the partitioned model that have been assigned to the
-      current process. This yields both the name of the parameter as well as
-      the parameter itself.
-
-   .. function:: local_modules( )
-      :noindex:
-
-      Returns an iterator over the modules in the
-      partitioned model that have been assigned to the current process.
-
-   .. function:: local_named_modules( )
-      :noindex:
-
-      Returns an iterator over the modules in the
-      partitioned model that have been assigned to the current process. This
-      yields both the name of the module as well as the module itself.
-
-   .. function:: local_state_dict( )
-      :noindex:
-
-      Returns the ``state_dict`` that contains local
-      parameters that belong to the current \ ``mp_rank``. This ``state_dict``
-      contains a key \ ``_smp_is_partial`` to indicate this is a
-      partial \ ``state_dict``, which indicates whether the
-      ``state_dict`` contains elements corresponding to only the current
-      partition, or to the entire model.
-
-   .. function:: state_dict( )
-      :noindex:
-
-      Returns the ``state_dict`` that contains parameters
-      for the entire model. It first collects the \ ``local_state_dict``  and
-      gathers and merges the \ ``local_state_dict`` from all ``mp_rank``\ s to
-      create a full ``state_dict``. Please note that this needs to be called on all ranks with
-      ``dp_rank()==0`` to ensure the gather happens properly.
-      If it is only called on all such ranks, it can hang.
-
-   .. function:: load_state_dict( )
-      :noindex:
-
-      Same as the ``torch.module.load_state_dict()`` ,
-      except: It first gathers and merges the ``state_dict``\ s across
-      ``mp_rank``\ s, if they are partial. The actual loading happens after the
-      model partition so that each rank knows its local parameters.
-
-   .. function:: register_post_partition_hook(hook)
-      :noindex:
-
-      Registers a callable ``hook`` to
-      be executed after the model is partitioned. This is useful in situations
-      where an operation needs to be executed after the model partition during
-      the first call to ``smp.step``, but before the actual execution of the
-      first forward pass. Returns a ``RemovableHandle`` object ``handle``,
-      which can be used to remove the hook by calling ``handle.remove()``.
-
-   .. function:: cpu( )
-      :noindex:
-
-      Allgathers parameters and buffers across all ``mp_rank``\ s and moves them
-      to the CPU.
-
-   .. function:: join( )
-      :noindex:
-
-      A context manager to be used in conjunction with an instance of
-      ``smp.DistributedModel`` to be able to train with uneven inputs across
-      participating processes. This is only supported when ``ddp=True``. This will use the join with the wrapped
-      ``DistributedDataParallel`` instance. For more information, see:
-      `join <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel.join>`__
-      in the PyTorch documentation.
-
-   .. function:: register_comm_hook( state, callable )
-      :noindex:
-
-      **Available for PyTorch 1.8.1 only**
-      Registers a communication hook which is an enhancement that provides
-      a flexible hook ``callable`` to users where they can specify how
-      gradients are aggregated across multiple workers. This method will be called on the wrapped ``DistributedDataParallel`` instance.
-
-      Please note that when you register a comm hook you have full control of how the gradients are processed.
-      When using only data parallelism with Torch DDP you are expected to average grads across data parallel replicas within the hook.
-      Similarly, when using DistributedModel you have to averaging grads across data parallel replicas within the hook.
-      In addition to that, you also have to average grads across microbatches within the hook unless you explicitly desire to not average based on your loss function.
-      See ``average_grads_across_microbatches`` for more information about averaging grads across microbatches.
-
-      This is only supported when ``ddp=True`` and ``overlapping_allreduce=True`` (default).
-      For more information, see:
-      `register_comm_hook <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel.register_comm_hook>`__
-      in the PyTorch documentation.
-
-
-
-.. class:: smp.DistributedOptimizer
-   :noindex:
-
-   **Parameters**
-   - ``optimizer``
-
-   An optimizer wrapper for saving/loading optimizer states. This wrapper
-   returns ``optimizer`` with the following methods overridden:
-
-   .. function:: state_dict( )
-      :noindex:
-
-      Returns the ``state_dict`` that contains optimizer state for the entire model.
-      It first collects the ``local_state_dict`` and gathers and merges
-      the ``local_state_dict`` from all ``mp_rank``s to create a full
-      ``state_dict``.
-
-   .. function::  load_state_dict( )
-      :noindex:
-
-      Same as the ``torch.optimizer.load_state_dict()`` , except:
-
-         -  It first gathers and merges the local ``state_dict``\ s if they are
-            partial.
-         -  The actual loading happens after the model partition so that each
-            rank knows its local parameters.
-
-   .. function::  local_state_dict( )
-      :noindex:
-
-      Returns the ``state_dict`` that contains the
-      local optimizer state that belongs to the current \ ``mp_rank``. This
-      ``state_dict`` contains a key \ ``_smp_is_partial`` to indicate this is
-      a partial \ ``state_dict``, which indicates whether the
-      ``state_dict`` contains elements corresponding to only the current
-      partition, or to the entire model.
-
-   ​
-.. function:: smp.partition(index)
-   :noindex:
-
-   **Inputs**
-
-   -  ``index`` (int) - The index of the partition.
-
-   A context manager which places all modules defined inside into the
-   partition with ID ``index``.  The ``index`` argument must be less than
-   the number of partitions.
-
-   Use ``smp.partition`` to implement manual partitioning.
-   If ``"auto_partition"`` is ``True``, then the
-   ``smp.partition`` contexts are ignored. Any module that is not placed in
-   any ``smp.partition`` context is placed in the
-   ``default_partition`` defined through the SageMaker Python SDK.
-
-   When ``smp.partition`` contexts are nested, the innermost context
-   overrides the rest (see the following example). In PyTorch, manual
-   partitioning should be done inside the module \ ``__init__``, and the
-   partition assignment applies to the modules that are *created* inside
-   the ``smp.partition`` context.
-
-   Example:
-
-   .. code:: python
-
-      class Model(torch.nn.Module):
-          def __init__(self):
-              with smp.partition(1):
-                  self.child0 = Child0()            # child0 on partition 1
-                  with smp.partition(2):
-                      self.child1 = Child1()        # child1 on partition 2
-                  self.child2 = Child2()            # child2 on partition 1
-              self.child3 = Child3()                # child3 on default_partition
-
-.. function:: smp.get_world_process_group( )
-   :noindex:
-
-   Returns a ``torch.distributed`` ``ProcessGroup`` that consists of all
-   processes, which can be used with the ``torch.distributed`` API.
-   Requires ``"ddp": True`` in SageMaker Python SDK parameters.
-
-.. function:: smp.get_mp_process_group( )
-   :noindex:
-
-   Returns a ``torch.distributed`` ``ProcessGroup`` that consists of the
-   processes in the ``MP_GROUP`` which contains the current process, which
-   can be used with the \ ``torch.distributed`` API. Requires
-   ``"ddp": True`` in SageMaker Python SDK parameters.
-
-.. function:: smp.get_dp_process_group( )
-   :noindex:
-
-   Returns a ``torch.distributed`` ``ProcessGroup`` that consists of the
-   processes in the ``DP_GROUP`` which contains the current process, which
-   can be used with the \ ``torch.distributed`` API. Requires
-   ``"ddp": True`` in SageMaker Python SDK parameters.
-
-.. function:: smp.is_initialized( )
-   :noindex:
-
-   Returns ``True`` if ``smp.init`` has already been called for the
-   process, and ``False`` otherwise.
-
-.. function::smp.is_tracing( )
-   :noindex:
-
-   Returns ``True`` if the current process is running the tracing step, and
-   ``False`` otherwise.
-
-.. data:: smp.nn.FusedLayerNorm
-   :noindex:
-
-   `Apex Fused Layer Norm <https://nvidia.github.io/apex/layernorm.html>`__ is currently not
-   supported by the library. ``smp.nn.FusedLayerNorm`` replaces ``apex``
-   ``FusedLayerNorm`` and provides the same functionality. This requires
-   ``apex`` to be installed on the system.
-
-.. data:: smp.optimizers.FusedNovoGrad
-   :noindex:
-
-
-   `Fused Novo Grad optimizer <https://nvidia.github.io/apex/optimizers.html#apex.optimizers.FusedNovoGrad>`__ is
-   currently not supported by the library. ``smp.optimizers.FusedNovoGrad`` replaces ``apex`` ``FusedNovoGrad``
-   optimizer and provides the same functionality. This requires ``apex`` to
-   be installed on the system.
-
-.. data:: smp.optimizers.FusedLamb
-   :noindex:
-
-
-   `FusedLamb optimizer <https://nvidia.github.io/apex/optimizers.html#apex.optimizers.FusedLAMB>`__
-   currently doesn’t work with the library. ``smp.optimizers.FusedLamb`` replaces
-   ``apex`` ``FusedLamb`` optimizer and provides the same functionality.
-   This requires ``apex`` to be installed on the system.
-
-.. data:: smp.amp.GradScaler
-   :noindex:
-
-   `Torch AMP Gradscaler <https://pytorch.org/docs/stable/amp.html#torch.cuda.amp.GradScaler>`__
-   currently doesn’t work with the library. ``smp.amp.GradScaler`` replaces
-   ``torch.amp.GradScaler`` and provides the same functionality.
-
-.. _pytorch_saving_loading:
-   :noindex:
-
-APIs for Saving and Loading
-^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-.. function:: smp.save( )
-   :noindex:
-
-   Saves an object. This operation is similar to ``torch.save()``, except
-   it has an additional keyword argument, ``partial``, and accepts only
-   string type for the argument ``f`` (file). If ``partial=True``, each
-   ``mp_rank`` saves a separate checkpoint file and the library adds an ``mp_rank``
-   index to your saved file.
-
-   **Parameters**
-
-   -  ``obj`` (dict): A saved object.
-   -  ``f`` (str): A string containing a file name.
-   -  ``partial`` (bool, default= ``True``):  When set to ``True``, each
-      ``mp_rank`` saves a separate checkpoint file and the library adds an
-      ``mp_rank`` index to the saved file. If you want to be able to load
-      and further train a model that you save with ``smp.save()``, you must
-      set ``partial=True``.
-   -  ``pickle_module`` (picklemodule, default = module ``"pickle"`` from ``"/opt/conda/lib/python3.6/pickle.py"``):
-      A module used for pickling metadata and objects.
-   -  ``pickle_protocol``  (int, default=2): Can be specified to
-      override the defaultprotocol.
-
-.. function:: smp.load( )
-   :noindex:
-
-   Loads an object saved with ``smp.save()`` from a file.
-
-   Similar to, `torch.load() <https://pytorch.org/docs/stable/generated/torch.load.html>`__,
-   except it has an additional keyword argument, ``partial``, and accepts
-   only string type for the argument ``f`` (file). If \ ``partial=True``,
-   then each ``mp_rank`` loads a separate checkpoint file.
-
-   **Parameters**
-
-   -  ``f`` (string): A string containing a file name.
-   -  ``map_location`` (function): A function
-      `torch.device <https://pytorch.org/docs/stable/tensor_attributes.html#torch.torch.device>`__,
-      a string, or a dict specifying how to remap storage locations.
-   -  ``pickle_module`` (pickle module): A module used for unpickling
-      metadata and objects (has to match the \ ``pickle_module``\ used to
-      serialize file).
-   -  ``pickle_load_args`` (Python 3 only): Optional keyword arguments
-      passed to ``pickle_module.load()`` and ``pickle_module.Unpickler()``.
-   -  ``partial`` (bool, default= ``True``): When set to ``True``, each
-      ``mp_rank`` loads the checkpoint corresponding to the ``mp_rank``.
-      Should be used when loading a model trained with the library.
-
-.. _pytorch_saving_loading_instructions:
-   :noindex:
-
-General Instruction For Saving and Loading
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-The library can save partial or full checkpoints.
-
--  For partial checkpoints, each ``mp_rank`` saves its own checkpoint
-   file with only the parameters that belong to that rank.
--  For full checkpoints, the library saves a single checkpoint that contains
-   entire model parameters.
-
-When **saving** using ``smp.save()``, each rank only holds its own
-parameters. If you want to save the full model, there will be some
-communication between the ranks to create the full model. If you save
-checkpoints often, you should save partial checkpoints for best
-performance.
-
-When **loading** using ``smp.load()``, the library can load either partial or |
-full checkpoints or full checkpoints saved by a non-model-parallel model. If you
-want to resume training with a non-model-parallel model or do inference, you need
-a full checkpoint.
-
-The following is an example of how you can save and load a checkpoint:
-
-.. code:: python
-
-   # Original model and optimizer
-   model = MyModel(...)
-   optimizer = MyOpt(...)
-
-   # model parallel wrapper
-   model = smp.DistributedModel(model)
-   optimizer = smp.DistributedOptimizer(optimizer)
-
-   # To save, always save on dp_rank 0 to avoid data racing
-   if partial:
-       # To save the partial model on each mp rank
-       # the library will create `checkpoint.pt_{mprank}` for each mp rank
-       if save_partial_model:
-           if smp.dp_rank() == 0:
-               model_dict = model.local_state_dict() # save the partial model
-               opt_dict = optimizer.local_state_dict() # save the partial optimizer state
-               smp.save(
-                   {"model_state_dict": model_dict, "optimizer_state_dict": opt_dict},
-                   f"/checkpoint.pt",
-                   partial=True,
-               )
-
-       # To save the full model
-       if save_full_model:
-           if smp.dp_rank() == 0:
-               model_dict = model.state_dict() # save the full model
-               opt_dict = optimizer.state_dict() # save the full optimizer state
-               smp.save(
-                   {"model_state_dict": model_dict, "optimizer_state_dict": opt_dict},
-                   "/checkpoint.pt",
-                   partial=False,
-               )
-
-   # To load, load on all ranks.
-   # The only difference for partial/full loading is the partial flag in smp.load
-   # Load partial checkpoint
-   if partial_checkpoint:
-       checkpoint = smp.load("/checkpoint.pt", partial=True)
-       model.load_state_dict(checkpoint["model_state_dict"])
-       optimizer.load_state_dict(checkpoint["optimizer_state_dict"])
-   # Load full checkpoint
-   if full_checkpoint:
-       checkpoint = smp.load("/checkpoint.pt", partial=False)
-       model.load_state_dict(checkpoint["model_state_dict"])
-       optimizer.load_state_dict(checkpoint["optimizer_state_dict"])
diff --git a/doc/api/training/smp_versions/v1.5.0/smd_model_parallel_tensorflow.rst b/doc/api/training/smp_versions/v1.5.0/smd_model_parallel_tensorflow.rst
deleted file mode 100644
index 131fc327ac..0000000000
--- a/doc/api/training/smp_versions/v1.5.0/smd_model_parallel_tensorflow.rst
+++ /dev/null
@@ -1,172 +0,0 @@
-TensorFlow API
-==============
-
-**Supported version: 2.3.1, 2.4.1, 2.5.0**
-
-**Important**: This API document assumes you use the following import statement in your training scripts.
-
-.. code:: python
-
-   import smdistributed.modelparallel.tensorflow as smp
-
-.. tip::
-
-   Refer to
-   `Modify a TensorFlow Training Script
-   <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-customize-training-script.html#model-parallel-customize-training-script-tf>`_
-   to learn how to use the following API in your TensorFlow training script.
-
-.. class:: smp.DistributedModel
-   :noindex:
-
-   A sub-class of the Keras \ ``Model`` class, which defines the model to
-   be partitioned. Model definition is done by sub-classing
-   ``smp.DistributedModel`` class, and implementing the ``call()`` method,
-   in the same way as the Keras model sub-classing API. Any operation that
-   is part of the \ ``smp.DistributedModel.call()`` method is subject to
-   partitioning, meaning that every operation placed inside executes in
-   exactly one of the devices (the operations outside run on all devices).
-
-
-   Similar to the regular Keras API, the forward pass is done by directly
-   calling the model object on the input tensors. For example:
-
-   .. code:: python
-
-      predictions = model(inputs)   # model is a smp.DistributedModel object
-
-   However, ``model()`` calls can only be made inside a
-   ``smp.step``-decorated function.
-
-   The outputs from a ``smp.DistributedModel`` are available in all ranks,
-   regardless of which rank computed the last operation.
-
-   **Methods:**
-
-   .. function:: save_model(save_path="/opt/ml/model")
-      :noindex:
-
-      **Inputs**
-      - ``save_path`` (``string``): A path to save an unpartitioned model with latest training weights.
-
-      Saves the entire,
-      unpartitioned model with the latest trained weights to ``save_path`` in
-      TensorFlow ``SavedModel`` format. Defaults to ``"/opt/ml/model"``, which
-      SageMaker monitors to upload the model artifacts to Amazon S3.
-
-.. function:: smp.partition(index)
-   :noindex:
-
-   **Inputs**
-
-   -  ``index`` (``int``): The index of the partition.
-
-   A context manager which places all operations defined inside into the
-   partition whose ID is equal to ``index``. When
-   ``smp.partition`` contexts are nested, the innermost context overrides
-   the rest. The ``index`` argument must be smaller than the number of
-   partitions.
-
-   ``smp.partition`` is used in the manual partitioning API;
-   if \ ``"auto_partition"`` parameter is set to ``True`` while launching
-   training, then ``smp.partition`` contexts are ignored. Any operation
-   that is not placed in any ``smp.partition`` context is placed in the
-   ``default_partition``, as shown in the following example:
-
-   .. code:: python
-
-      # auto_partition: False
-      # default_partition: 0
-      smp.init()
-      [...]
-      x = tf.constant(1.2)                     # placed in partition 0
-      with smp.partition(1):
-          y = tf.add(x, tf.constant(2.3))      # placed in partition 1
-          with smp.partition(3):
-              z = tf.reduce_sum(y)             # placed in partition 3
-
-
-.. function:: register_post_partition_hook(hook)
-   :noindex:
-
-    Registers a callable ``hook`` to
-    be executed after the model is partitioned. This is useful in situations
-    where an operation needs to be executed after the model partition during
-    the first call to ``smp.step``, but before the actual execution of the
-    first forward pass.
-
-    .. code:: python
-
-        @smp.register_post_partition_hook
-        def test_eager():
-            # All statements here will be executed right after partition but before the first forward pass
-            tf.print("Entered hook through eager context")
-
-.. class:: smp.CheckpointManager
-   :noindex:
-
-
-   A subclass of TensorFlow
-   `CheckpointManager <https://www.tensorflow.org/api_docs/python/tf/train/CheckpointManager>`__,
-   which is used to manage checkpoints. The usage is similar to TensorFlow
-   ``CheckpointManager``.
-
-   The following returns a ``CheckpointManager`` object.
-
-   .. code:: python
-
-      smp.CheckpointManager(checkpoint,
-                            directory="/opt/ml/checkpoints",
-                            max_to_keep=None,
-                            checkpoint_name="ckpt")
-
-   **Parameters**
-
-   -  ``checkpoint``: A `tf.train.Checkpoint
-      <https://www.tensorflow.org/api_docs/python/tf/train/Checkpoint>`__ instance
-      that represents a model checkpoint.
-
-   -  ``directory``: (``str``) The path to a directory in which to write
-      checkpoints. A file named "checkpoint" is also written to this
-      directory (in a human-readable text format) which contains the state
-      of the ``CheckpointManager``. Defaults to
-      ``"/opt/ml/checkpoints"``, which is the directory that SageMaker
-      monitors for uploading the checkpoints to Amazon S3.
-   -  ``max_to_keep`` (``int``): The number of checkpoints to keep. If
-      ``None``, all checkpoints are kept.
-   -  ``checkpoint_name`` (``str``): Custom name for the checkpoint file.
-      Defaults to ``"ckpt"``.
-
-
-   **Methods:**
-
-   .. function:: save( )
-      :noindex:
-
-      Saves a new checkpoint in the specified directory. Internally uses ``tf.train.CheckpointManager.save()``.
-
-   .. function:: restore( )
-      :noindex:
-
-      Restores the latest checkpoint in the specified directory.
-      Internally uses ``tf.train.CheckpointManager.restore()``.
-
-
-   **Examples:**
-
-   .. code:: python
-
-      checkpoint = tf.train.Checkpoint(optimizer=optimizer, model=model)
-      ckpt_manager = smp.CheckpointManager(checkpoint, max_to_keep=5)  # use /opt/ml/checkpoints
-
-      for inputs in train_ds:
-          loss = train_step(inputs)
-          # [...]
-          ckpt_manager.save()  # save a new checkpoint in /opt/ml/checkpoints
-
-   .. code:: python
-
-      for step, inputs in enumerate(train_ds):
-          if step == 0:
-              ckpt_manager.restore()
-          loss = train_step(inputs)
diff --git a/doc/api/training/smp_versions/v1.6.0/smd_model_parallel_common_api.rst b/doc/api/training/smp_versions/v1.6.0/smd_model_parallel_common_api.rst
deleted file mode 100644
index b4713b2707..0000000000
--- a/doc/api/training/smp_versions/v1.6.0/smd_model_parallel_common_api.rst
+++ /dev/null
@@ -1,538 +0,0 @@
-Common API
-==========
-
-The following SageMaker distribute model parallel APIs are common across all frameworks.
-
-.. contents:: Table of Contents
-  :depth: 3
-  :local:
-
-The Library's Core APIs
------------------------
-
-This API document assumes you use the following import statement in your training scripts.
-
-**TensorFlow**
-
-.. code:: python
-
-   import smdistributed.modelparallel.tensorflow as smp
-
-**PyTorch**
-
-.. code:: python
-
-   import smdistributed.modelparallel.torch as smp
-
-
-.. function:: smp.init( )
-   :noindex:
-
-   Initialize the library. Must be called at the beginning of training script.
-
-.. function:: @smp.step(non_split_inputs, input_split_axes, [*args, **kwargs])
-   :noindex:
-
-   A decorator that must be placed over a function that represents a single
-   forward and backward pass (for training use cases), or a single forward
-   pass (for evaluation use cases). Any computation that is defined inside
-   the ``smp.step``-decorated function is executed in a pipelined manner.
-
-   By default, every tensor input to the function is split across its batch
-   dimension into a number of microbatches specified while launching the
-   training job. This behavior can be customized through the arguments to
-   ``smp.step``, described below. The library then orchestrates the execution of
-   each microbatch across all partitions, based on the chosen pipeline
-   type.
-
-   In a typical use case, forward pass and back-propagation are executed
-   inside an \ ``smp.step``-decorated function and gradients, loss, and
-   other relevant metrics (such as accuracy, etc.) are returned from
-   ``smp.step``-decorated function.
-
-   Any gradient post-processing operation, such as gradient clipping and
-   allreduce, as well as ``optimizer.apply_gradients`` calls (for TF) or
-   ``optimizer.step`` (for PT) should be applied on the gradients returned
-   from the ``smp.step`` function, and not inside the ``smp.step``
-   function. This is because every operation inside ``smp.step`` is
-   executed once per microbatch, so having these operations inside
-   ``smp.step`` can either be inefficient (in the case of allreduce), or
-   lead to wrong results (in the case of ``apply_gradients`` /
-   ``optimizer.step``).
-
-   If the objects returned from the ``smp.step``-decorated function contain
-   ``tf.Tensor``\ s / ``torch.Tensor``\ s, they are converted to
-   ``StepOutput`` objects. A ``StepOutput`` object encapsulates all
-   versions of the tensor across different microbatches
-   (see ``StepOutput`` entry for more information).
-
-   The argument to ``smp.step`` decorated function should either be a tensor
-   or an instance of list, tuple, dict or set for it to be split across
-   microbatches. If your object doesn't fall into this category, you can make
-   the library split your object, by implementing ``smp_slice`` method.
-
-   Below is an example of how to use it with PyTorch.
-
-   .. code:: python
-
-      class CustomType:
-          def __init__(self, tensor):
-              self.data = tensor
-
-          # The library will call this to invoke slicing on the object passing in total microbatches (num_mb)
-          # and the current microbatch index (mb).
-          def smp_slice(self, num_mb, mb, axis):
-              dim_size = list(self.data.size())[axis]
-
-              split_size = dim_size // num_mb
-              sliced_tensor = self.data.narrow(axis, mb * split_size, split_size)
-              return CustomType(sliced_tensor, self.other)
-
-      custom_obj = CustomType(torch.ones(4,))
-
-      @smp.step()
-      def step(custom_obj):
-          loss = model(custom_obj)
-          model.backward(loss)
-          return loss
-
-
-   **Important:** ``smp.step`` splits the batch into microbatches, and
-   executes everything inside the decorated function once per microbatch.
-   This might affect the behavior of batch normalization, any operation
-   that explicitly uses the batch size information, or any other Python
-   code that is expected to run once.
-
-   **TensorFlow-specific behavior**
-
-   ``smp.step`` is a wrapper that
-   inherits from and extends the behavior of ``tf.function``, and as such,
-   all the caveats that apply to the use of ``tf.function``\ s also apply
-   to ``smp.step``. In particular, any operation that is inside
-   ``smp.step`` executes in graph mode, and not eager mode.
-
-   In the first call, ``smp.step`` performs tracing of the wrapped function every time
-   one of the tensor arguments changes their shape or dtype, or for every
-   new value of a Python argument, if there is one. Tracing is expensive,
-   so such scenarios should be avoided as much as possible or,
-   alternatively, an ``input_signature`` argument must be provided. For
-   more information on the usage of ``tf.function``, refer to the
-   TensorFlow documentation:
-
-   -  https://www.tensorflow.org/api_docs/python/tf/function\
-   -  https://www.tensorflow.org/guide/function\
-
-   Each ``smp.step`` decorated function must have a return value that depends on the
-   output of ``smp.DistributedModel``.
-
-   **Common parameters**
-
-   -  ``non_split_inputs`` (``list``): The list of arguments to the decorated function
-      that should not be split along the batch dimension. Should be used
-      for all input tensors that do not have a batch dimension. Should be a
-      list of argument names as ``str``, as they appear in the signature of
-      the ``smp.step``-decorated function. By default it is considered an
-      empty list.
-
-   -  ``input_split_axes`` (``dict``): A dict that maps the argument name to its batch
-      axis. The keys should be the argument names as ``str``, as they
-      appear in the signature of the ``smp.step``-decorated function.  By
-      default all batch axes are assumed to be the 0-axis.
-
-   **TensorFlow-only parameters**
-
-   -  All arguments of ``tf.function``. Note:
-      The \ ``experimental_compile`` argument of ``tf.function`` may not
-      work as expected with ``smp.step``, since it interferes with
-      pipelining and model partitioning. To enable XLA with the library, you can
-      instead use \ ``tf.config.optimizer.set_jit(True)``.
-
-   **PyTorch-only parameters**
-
-   -  ``detach_outputs`` (``bool``) : If ``True``, calls ``torch.Tensor.detach()`` on
-      all returned ``torch.Tensor`` outputs. Setting it to ``False``
-      increases memory consumption, unless ``detach()`` is manually called
-      on the returned tensors, because the model graph is not cleared from
-      memory after the training step. Set to \ ``True`` by default.
-
-   **Returns**
-
-   -  The same object(s) returned from the decorated function. All
-      returned \ ``tf.Tensor``, \ ``tf.Variable``  objects (for TF) or
-      ``torch.Tensor`` objects (for PT) are wrapped inside
-      a \ ``StepOutput`` object, even when they are inside a Python
-      ``list``, ``tuple``, or ``dict``.
-
-
-
-.. class:: StepOutput
-   :noindex:
-
-
-   A class that encapsulates all versions of a ``tf.Tensor``
-   or \ ``torch.Tensor`` across all microbatches.
-
-   When a particular ``tf.Tensor`` or ``torch.Tensor`` is computed inside
-   ``smp.step``, different versions of the tensor are computed for each
-   microbatch.
-
-   When this tensor is returned from ``smp.step`` and is accessed outside
-   of the decorated function, it appears as a ``StepOutput`` object, which
-   contains all such versions. For example,
-
-   -  In the case of Tensorflow, the gradient for a particular
-      ``tf.Variable`` is computed on each microbatch individually, and if
-      this gradient is returned from ``smp.step``, all gradients for this
-      ``tf.Variable`` become part of the same ``StepOutput`` object. The
-      ``StepOutput`` class offers the following API for commonly-used
-      post-processing operations on such tensors.
-   -  In the case of PyTorch, the loss for each microbatch is computed
-      individually and all the ``torch.Tensor``\ s that represent the loss
-      for different microbatches become part of same ``StepOutput`` object,
-      if loss is returned from the ``smp.step`` function.
-
-
-   The ``StepOutput`` class offers the following API for commonly-used
-   post-processing operations on tensors.
-
-   .. data:: StepOutput.outputs
-      :noindex:
-
-      Returns a list of the underlying tensors, indexed by microbatch.
-
-   .. function:: StepOutput.reduce_mean( )
-      :noindex:
-
-      Returns a ``tf.Tensor``, ``torch.Tensor`` that averages the constituent ``tf.Tensor`` s
-      ``torch.Tensor`` s. This is commonly used for averaging loss and gradients across microbatches.
-
-   .. function:: StepOutput.reduce_sum( )
-      :noindex:
-
-      Returns a ``tf.Tensor`` /
-      ``torch.Tensor`` that sums the constituent
-      ``tf.Tensor``\ s/\ ``torch.Tensor``\ s.
-
-   .. function:: StepOutput.concat( )
-      :noindex:
-
-      Returns a
-      ``tf.Tensor``/``torch.Tensor`` that concatenates tensors along the
-      batch dimension using ``tf.concat`` / ``torch.cat``.
-
-   .. function:: StepOutput.stack( )
-      :noindex:
-
-      Applies ``tf.stack`` / ``torch.stack``
-      operation to the list of constituent ``tf.Tensor``\ s /
-      ``torch.Tensor``\ s.
-
-   **TensorFlow-only methods**
-
-   .. function:: StepOutput.merge( )
-      :noindex:
-
-      Returns a ``tf.Tensor`` that
-      concatenates the constituent ``tf.Tensor``\ s along the batch
-      dimension. This is commonly used for merging the model predictions
-      across microbatches.
-
-   .. function:: StepOutput.accumulate(method="variable", var=None)
-      :noindex:
-
-      Functionally the same as ``StepOutput.reduce_mean()``. However, it is
-      more memory-efficient, especially for large numbers of microbatches,
-      since it does not wait for all constituent \ ``tf.Tensor``\ s to be
-      ready to start averaging them, thereby saving memory.
-
-      In some cases (XLA for example) ``StepOutput.reduce_mean()`` might end
-      up being more memory-efficient than ``StepOutput.accumulate()``.
-
-      **Parameters**
-
-      -  ``method`` (``"add_n"`` or ``"accumulate_n"`` or ``"variable"``):
-         If ``"add_n"`` or ``"accumulate_n"``, the library uses
-         ``tf.add_n`` and ``tf.accumulate_n``, respectively, to implement
-         accumulation. If ``"variable"``, the library uses an internal ``tf.Variable``
-         into which to accumulate the tensors. Default is \ ``"variable"``.
-         Note: Memory usage behavior of these choices can depend on the model
-         and implementation.
-
-      -  ``var``: A ``tf.Variable`` into which, if provided, the library uses to
-         accumulate the tensors. If \ ``None``, the library internally creates a
-         variable. If ``method`` is not ``"variable"``, this argument is
-         ignored.
-
-.. _mpi_basics:
-   :noindex:
-
-MPI Basics
-----------
-
-The library exposes the following basic MPI primitives to its Python API:
-
-**Global**
-
--  ``smp.rank()`` : The global rank of the current process.
--  ``smp.size()`` : The total number of processes.
--  ``smp.get_world_process_group()`` :
-   ``torch.distributed.ProcessGroup`` that contains all processes.
--  ``smp.CommGroup.WORLD``: The communication group corresponding to all processes.
--  ``smp.local_rank()``: The rank among the processes on the current instance.
--  ``smp.local_size()``: The total number of processes on the current instance.
--  ``smp.get_mp_group()``: The list of ranks over which the current model replica is partitioned.
--  ``smp.get_dp_group()``: The list of ranks that hold different replicas of the same model partition.
-
-**Tensor Parallelism**
-
--  ``smp.tp_rank()`` : The rank of the process within its
-   tensor-parallelism group.
--  ``smp.tp_size()`` : The size of the tensor-parallelism group.
--  ``smp.get_tp_process_group()`` : Equivalent to
-   ``torch.distributed.ProcessGroup`` that contains the processes in the
-   current tensor-parallelism group.
--  ``smp.CommGroup.TP_GROUP`` : The communication group corresponding to
-   the current tensor parallelism group.
-
-**Pipeline Parallelism**
-
--  ``smp.pp_rank()`` : The rank of the process within its
-   pipeline-parallelism group.
--  ``smp.pp_size()`` : The size of the pipeline-parallelism group.
--  ``smp.get_pp_process_group()`` : ``torch.distributed.ProcessGroup``
-   that contains the processes in the current pipeline-parallelism group.
--  ``smp.CommGroup.PP_GROUP`` : The communication group corresponding to
-   the current pipeline parallelism group.
-
-**Reduced-Data Parallelism**
-
--  ``smp.rdp_rank()`` : The rank of the process within its
-   reduced-data-parallelism group.
--  ``smp.rdp_size()`` : The size of the reduced-data-parallelism group.
--  ``smp.get_rdp_process_group()`` : ``torch.distributed.ProcessGroup``
-   that contains the processes in the current reduced data parallelism
-   group.
--  ``smp.CommGroup.RDP_GROUP`` : The communication group corresponding
-   to the current reduced data parallelism group.
-
-**Model Parallelism**
-
--  ``smp.mp_rank()`` : The rank of the process within its model-parallelism
-   group.
--  ``smp.mp_size()`` : The size of the model-parallelism group.
--  ``smp.get_mp_process_group()`` : ``torch.distributed.ProcessGroup``
-   that contains the processes in the current model-parallelism group.
--  ``smp.CommGroup.MP_GROUP`` : The communication group corresponding to
-   the current model parallelism group.
-
-**Data Parallelism**
-
--  ``smp.dp_rank()`` : The rank of the process within its data-parallelism
-   group.
--  ``smp.dp_size()`` : The size of the data-parallelism group.
--  ``smp.get_dp_process_group()`` : ``torch.distributed.ProcessGroup``
-   that contains the processes in the current data-parallelism group.
--  ``smp.CommGroup.DP_GROUP`` : The communication group corresponding to
-   the current data-parallelism group.
-
-.. _communication_api:
-   :noindex:
-
-Communication API
------------------
-
-The library provides a few communication primitives which can be helpful while
-developing the training script. These primitives use the following
-``enum`` s as arguments to specify which processes the communication
-should involve.
-​
-
-**Helper structures**
-
-.. data:: smp.CommGroup
-   :noindex:
-
-   An ``enum`` that takes the values
-   ``CommGroup.WORLD``, ``CommGroup.MP_GROUP``, and ``CommGroup.DP_GROUP``.
-   These values can also be accessed as ``smp.WORLD``, ``smp.MP_GROUP``,
-   and ``smp.DP_GROUP`` respectively.
-
-   -  ``CommGroup.WORLD``: Represents the entire group of processes used in
-      training
-   -  ``CommGroup.MP_GROUP``: Represents the group of processes that hold
-      the same model replica as the current process. The processes in a
-      single ``MP_GROUP`` collectively store an entire replica of the
-      model.
-   -  ``CommGroup.DP_GROUP``: Represents the group of processes that hold
-      the same model partition as the current process. The processes in a
-      single ``DP_GROUP`` perform data parallelism/allreduce among
-      themselves.
-
-.. data:: smp.RankType
-   :noindex:
-
-   An ``enum`` that takes the values
-   ``RankType.WORLD_RANK``, ``RankType.MP_RANK``, and ``RankType.DP_RANK``.
-
-   -  ``RankType.WORLD_RANK``: The associated rank is to be interpreted as
-      the rank of the process across all processes used in training.
-   -  ``RankType.MP_RANK``: The associated rank is to be interpreted as the
-      rank of the process within the ``MP_GROUP``.
-   -  ``RankType.DP_RANK``: The associated rank is to be interpreted as the
-      rank of the process within the ``DP_GROUP``.
-
-
-**Communication primitives:**
-
-.. function:: smp.broadcast(obj, group)
-   :noindex:
-
-   Sends the object to all processes in the
-   group. The receiving process must call ``smp.recv_from`` to receive the
-   sent object.
-
-   **Inputs**
-
-   -  ``obj``: An arbitrary picklable Python object that will be broadcast.
-
-   -  ``group``: A ``CommGroup`` argument that represents to which group of
-      processes the object will be sent.
-
-   **Notes**
-
-   -  When you use ``broadcast`` on the sender process, there needs
-      to be an accompanying ``smp.recv_from()`` call on the receiver
-      processes.
-
-   -  This is a synchronous call; the ``broadcast`` statement
-      returns only after all ranks participating in the call have made a
-      matching ``recv_from`` call.
-
-   **Example**
-
-   .. code:: python
-
-      if smp.rank() == 0:
-          smp.broadcast(something, group=smp.CommGroup.WORLD)
-      else:
-          smp.recv_from(0, rank_type=smp.RankType.WORLD_RANK)
-
-.. function:: smp.send(obj, dest_rank, rank_type)
-   :noindex:
-
-   Sends the object ``obj`` to
-   ``dest_rank``, which is of a type specified by ``rank_type``.
-
-   **Inputs**
-
-   -  ``obj``: An arbitrary picklable Python object that will be sent.
-
-   -  ``dest_rank`` (``int``): An integer denoting the rank of the receiving process.
-
-   -  ``rank_type`` (``enum``): A ``smp.RankType`` ``enum`` that determines how
-      ``dest_rank`` is to be interpreted. For example if ``dest_rank`` is 1
-      and ``rank_type`` is ``MP_RANK``, then ``obj`` is sent to process
-      with ``mp_rank`` 1 in the ``MP_GROUP`` which contains the current
-      process.
-
-   **Notes**
-
-   -  Note: \ This is a synchronous call; the ``send`` statement returns
-      only after the destination rank has made a matching
-      ``recv_from`` call.
-
-.. function:: smp.recv_from(src_rank, rank_type)
-   :noindex:
-
-   Receive an object from a peer process. Can be used with a matching
-   ``smp.send`` or a ``smp.broadcast`` call.
-
-   **Inputs**
-
-   -  ``src_rank`` (``int``): An integer denoting rank of the sending process.
-
-   -  ``rank_type`` (``enum``): A ``smp.RankType`` ``enum`` that determines how
-      ``dest_rank`` is to be interpreted. For example if ``src_rank`` is 1
-      and ``rank_type`` is ``MP_RANK``, then the object is received from
-      the process with ``mp_rank`` 1 in the ``MP_GROUP`` which contains the
-      current process.
-
-   **Returns**
-
-   Returns the python object that is sent by the peer process.
-
-   **Notes**
-
-   -  Note: This is a synchronous call; the ``recv_from`` statement returns
-      only after the source rank has made a matching ``send`` or
-      ``broadcast`` call, and the object is received.
-
-.. function:: smp.allgather(obj, group)
-   :noindex:
-
-   A collective call that gathers all the
-   submitted objects across all ranks in the specified ``group``. Returns a
-   list whose ``i``\ th index contains the object submitted by the
-   ``i``\ th rank in ``group``.
-
-   **Inputs**
-
-   -  ``obj``: An arbitrary picklable Python object that will be
-      allgathered.
-
-   -  ``group`` : A ``CommGroup`` argument that represents which group of
-      processes participate in ``allgather``.
-
-   **Notes**
-
-   -  Note: This is a synchronous call; the ``allgather`` statement returns
-      only after all ranks participating in the call have made a matching
-      ``allgather`` call, and all the objects are received at the current
-      rank.
-
-   **Examples**
-
-   .. code:: python
-
-      # assuming mp_size() == 2
-
-      if smp.mp_rank() == 0:
-          out = smp.allgather(obj1, smp.CommGroup.MP_GROUP)  # returns [obj1, obj2]
-      else:
-          out = smp.allgather(obj2, smp.CommGroup.MP_GROUP)  # returns [obj1, obj2]
-
-.. function:: smp.barrier(group=smp.WORLD)
-   :noindex:
-
-   A statement that hangs until all
-   processes in the specified group reach the barrier statement, similar to
-   ``MPI_Barrier()``.
-
-   **Inputs**
-
-   -  ``group``: An ``smp.CommGroup`` ``enum`` that specifies the group of
-      processes participating in the barrier call. Defaults to
-      ``smp.WORLD``.
-
-   **Examples**
-
-   -  Assume there are 8 processes and 2 model partitions, and
-      therefore 4 \ ``mp_group``\ s, and 2 ``dp_group``\ s. If
-      the \ ``barrier`` call is passed the value ``smp.MP_GROUP`` for its
-      group argument, then each process only waits until the other process
-      of its own ``mp_group`` reaches that point. It does not wait for
-      processes outside that ``mp_group``.
-
-.. function:: smp.dp_barrier()
-   :noindex:
-
-   Same as passing ``smp.DP_GROUP``\ to ``smp.barrier()``.
-   Waits for the processes in the same \ ``dp_group`` as
-   the current process to reach the same point in execution.
-
-.. function:: smp.mp_barrier()
-   :noindex:
-
-   Same as passing ``smp.MP_GROUP`` to
-   ``smp.barrier()``. Waits for the processes in the same ``mp_group`` as
-   the current process to reach the same point in execution.
diff --git a/doc/api/training/smp_versions/v1.6.0/smd_model_parallel_pytorch.rst b/doc/api/training/smp_versions/v1.6.0/smd_model_parallel_pytorch.rst
deleted file mode 100644
index e549559b6b..0000000000
--- a/doc/api/training/smp_versions/v1.6.0/smd_model_parallel_pytorch.rst
+++ /dev/null
@@ -1,678 +0,0 @@
-PyTorch API
-===========
-
-To use the PyTorch-specific APIs for SageMaker distributed model parallism,
-you need to add the following import statement at the top of your training script.
-
-.. code:: python
-
-   import smdistributed.modelparallel.torch as smp
-
-
-.. tip::
-
-   Refer to
-   `Modify a PyTorch Training Script
-   <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-customize-training-script-pt.html>`_
-   to learn how to use the following API in your PyTorch training script.
-
-.. class:: smp.DistributedModel
-  :noindex:
-
-   A sub-class of ``torch.nn.Module`` which specifies the model to be
-   partitioned. Accepts a ``torch.nn.Module`` object ``module`` which is
-   the model to be partitioned. The returned ``DistributedModel`` object
-   internally manages model parallelism and data parallelism. Only one
-   model in the training script can be wrapped with
-   ``smp.DistributedModel``.
-
-   **Example:**
-
-   .. code:: python
-
-      model = smp.DistributedModel(model)
-
-   **Important**: The ``__call__`` and  ``backward`` method calls on the
-   ``smp.DistributedModel`` object (in the following example, the object
-   is \ ``model``) can only be made inside a ``smp.step``-decorated
-   function.
-
-   Since ``DistributedModel``  is a ``torch.nn.Module``, a forward pass can
-   be performed by calling the \ ``DistributedModel`` object on the input
-   tensors.
-
-   .. code:: python
-
-      predictions = model(inputs)   # model is a smp.DistributedModel object
-
-   For a backward pass, one needs to call the backward function on
-   the \ ``DistributedModel`` object, with tensors and gradients as
-   arguments, replacing the PyTorch operations \ ``torch.Tensor.backward``
-   or ``torch.autograd.backward``.
-
-   The API for ``model.backward`` is very similar to
-   ``torch.autograd.backward``. For example, the following
-   ``backward`` calls:
-
-   .. code:: python
-
-      torch.autograd.backward(loss) or loss.backward()
-
-   should be replaced with:
-
-   .. code:: python
-
-      model.backward(loss) # loss is a tensor with only one element as its data
-
-   Similarly, for non-scalar tensors, replace the following
-   ``backward`` call containing incoming gradient arguments:
-
-   .. code:: python
-
-      torch.autograd.backward(outputs, out_grads)
-
-   with the following line:
-
-   .. code:: python
-
-      model.backward(outputs, out_grads)
-
-   In these examples, all ``__call__``  and ``backward`` method calls on
-   the model objects (``model(inputs)`` and ``model.backward(loss)``) must be made inside
-   a ``smp.step``-decorated function.
-
-   **Using DDP**
-
-   If DDP is enabled with the SageMaker model parallel library, do not not place a PyTorch
-   ``DistributedDataParallel`` wrapper around the ``DistributedModel`` because
-   the ``DistributedModel`` wrapper will also handle data parallelism.
-
-   Unlike the original DDP wrapper, when you use ``DistributedModel``,
-   model parameters and buffers are not immediately broadcast across
-   processes when the wrapper is called. Instead, the broadcast is deferred to the first call of the
-   ``smp.step``-decorated function when the partition is done.
-
-   **Parameters**
-
-   -  ``module`` (``torch.nn.Module``): Module to be distributed (data parallelism and model parallelism).
-
-   -  ``trace_device`` (``"cpu"`` or ``"gpu"``) (default: ``"gpu"``)
-      Whether to perform the tracing step on the GPU or CPU. The tracing step gathers
-      information on the order of execution of modules, the shapes of
-      intermediate outputs, and execution times, to be used by the
-      partitioning algorithm. If ``trace_device`` is set to GPU, accurate
-      module execution times can be gathered during tracing for potentially
-      improved partitioning decision. However, if the model is too large to
-      fit in a single GPU, then ``trace_device`` should be set to CPU.
-
-   -  ``trace_execution_times`` (``bool``) (default: ``False``): If ``True``,
-      the library profiles the execution time of each module during tracing, and uses
-      it in the partitioning decision. This improves the partitioning
-      decision, but it might make the tracing slower. It may also introduce
-      some degree of non-determinism in partitioning results, because of the
-      inherent randomness in module execution times. Must be ``False`` if
-      ``trace_device`` is ``"cpu"``.
-
-   -  ``overlapping_allreduce`` (``bool``) (default: ``True``): This is only
-      applicable for hybrid data parallelism/model parallelism use cases (when
-      ``ddp`` is set to ``True`` while launching training). The library uses this flag
-      to decide whether to do overlapping allreduce whenever a parameter
-      gradients are ready. This leads to overlapping of communication and
-      computation and can improve performance. If this is set to ``False`` ,
-      allreduce is performed at the end of the step.
-
-   -  ``backward_passes_per_step`` (``int``) (default: 1): This is only
-      applicable for hybrid data parallelism/model parallelism use cases (when
-      ``ddp`` is set to ``True`` in config). This parameter indicates the
-      number of backward passes to perform before calling allreduce on DDP.
-      This allows accumulating updates over multiple mini-batches before
-      reducing and applying them.
-
-   -  ``average_grads_across_microbatches`` (``bool``) (default: ``True``):
-      Whether or not the computed gradients should be averaged across
-      microbatches. If ``False``, the computed gradients will be summed across
-      microbatches, but not divided by the number of microbatches. In typical
-      use case where the computed loss is averaged over the mini-batch, this
-      should be left as ``True``. If you use a loss function that only sums
-      the per-sample loss across the batch (and not divide by the batch size),
-      then this must be set to ``False`` for correctness.
-
-   -  ``bucket_cap_mb`` (default: 25): \ ``DistributedDataParallel`` buckets
-      parameters into multiple buckets so that gradient reduction of each
-      bucket can potentially overlap with backward
-      computation. \ ``bucket_cap_mb``\ controls the bucket size in MegaBytes
-      (MB).
-
-   -  ``trace_memory_usage`` (default: False): When set to True, the library attempts
-      to measure memory usage per module during tracing. If this is disabled,
-      memory usage will be estimated through the sizes of tensors returned from
-      the module.
-
-   -  ``broadcast_buffers`` (default: True): Flag to be used with ``ddp=True``.
-      This parameter is forwarded to the underlying ``DistributedDataParallel`` wrapper.
-      Please see: `broadcast_buffer <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel>`__.
-
-   -  ``gradient_as_bucket_view`` (default: False): To be
-      used with ``ddp=True``. This parameter is forwarded to the underlying
-      ``DistributedDataParallel`` wrapper. Please see `gradient_as_bucket_view <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel>`__.
-
-   **Properties**
-
-   -  ``partitioned``: Is ``True`` if the model is partitioned, ``False``
-      otherwise. Initialized to ``False`` when ``DistributedModel`` is first
-      created. It becomes be ``True`` during the first call
-      to ``smp.step``-decorated function. Once the model is partitioned, the
-      local parameters or local ``state_dict`` can be fetched using the
-      following methods.
-
-   **Methods**
-
-   .. function:: backward(tensors, grad_tensors)
-      :noindex:
-
-      Triggers a distributed backward
-      pass across model partitions. Example usage provided in the previous
-      section. The API is very similar
-      to https://pytorch.org/docs/stable/autograd.html#torch.autograd.backward.
-      ``retain_grad`` and ``create_graph``  flags are not supported.
-
-   .. function:: local_buffers( )
-      :noindex:
-
-      Returns an iterator over buffers for the modules in
-      the partitioned model that have been assigned to the current process.
-
-   .. function:: local_named_buffers( )
-      :noindex:
-
-      Returns an iterator over buffers for the
-      modules in the partitioned model that have been assigned to the current
-      process. This yields both the name of the buffer as well as the buffer
-      itself.
-
-   .. function:: local_parameters( )
-      :noindex:
-
-      Returns an iterator over parameters for the
-      modules in the partitioned model that have been assigned to the current
-      process.
-
-   .. function:: local_named_parameters( )
-      :noindex:
-
-      Returns an iterator over parameters for
-      the modules in the partitioned model that have been assigned to the
-      current process. This yields both the name of the parameter as well as
-      the parameter itself.
-
-   .. function:: local_modules( )
-      :noindex:
-
-      Returns an iterator over the modules in the
-      partitioned model that have been assigned to the current process.
-
-   .. function:: local_named_modules( )
-      :noindex:
-
-      Returns an iterator over the modules in the
-      partitioned model that have been assigned to the current process. This
-      yields both the name of the module as well as the module itself.
-
-   .. function:: local_state_dict( )
-      :noindex:
-
-      Returns the ``state_dict`` that contains local
-      parameters that belong to the current \ ``mp_rank``. This ``state_dict``
-      contains a key \ ``_smp_is_partial`` to indicate this is a
-      partial \ ``state_dict``, which indicates whether the
-      ``state_dict`` contains elements corresponding to only the current
-      partition, or to the entire model.
-
-   .. function:: state_dict( )
-      :noindex:
-
-      Returns the ``state_dict`` that contains parameters
-      for the entire model. It first collects the \ ``local_state_dict``  and
-      gathers and merges the \ ``local_state_dict`` from all ``mp_rank``\ s to
-      create a full ``state_dict``. Please note that this needs to be called on all ranks with
-      ``dp_rank()==0`` to ensure the gather happens properly.
-      If it is only called on all such ranks, it can hang.
-
-   .. function:: load_state_dict( )
-      :noindex:
-
-      Same as the ``torch.module.load_state_dict()`` ,
-      except: It first gathers and merges the ``state_dict``\ s across
-      ``mp_rank``\ s, if they are partial. The actual loading happens after the
-      model partition so that each rank knows its local parameters.
-
-   .. function:: register_post_partition_hook(hook)
-      :noindex:
-
-      Registers a callable ``hook`` to
-      be executed after the model is partitioned. This is useful in situations
-      where an operation needs to be executed after the model partition during
-      the first call to ``smp.step``, but before the actual execution of the
-      first forward pass. Returns a ``RemovableHandle`` object ``handle``,
-      which can be used to remove the hook by calling ``handle.remove()``.
-
-   .. function:: cpu( )
-      :noindex:
-
-      Allgathers parameters and buffers across all ``mp_rank``\ s and moves them
-      to the CPU.
-
-   .. function:: join( )
-      :noindex:
-
-      A context manager to be used in conjunction with an instance of
-      ``smp.DistributedModel`` to be able to train with uneven inputs across
-      participating processes. This is only supported when ``ddp=True``. This will use the join with the wrapped
-      ``DistributedDataParallel`` instance. For more information, see:
-      `join <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel.join>`__
-      in the PyTorch documentation.
-
-   .. function:: register_comm_hook( state, callable )
-      :noindex:
-
-      **Available for PyTorch 1.8.1 only**
-      Registers a communication hook which is an enhancement that provides
-      a flexible hook ``callable`` to users where they can specify how
-      gradients are aggregated across multiple workers. This method will be called on the wrapped ``DistributedDataParallel`` instance.
-
-      Please note that when you register a comm hook you have full control of how the gradients are processed.
-      When using only data parallelism with Torch DDP you are expected to average grads across data parallel replicas within the hook.
-      Similarly, when using DistributedModel you have to averaging grads across data parallel replicas within the hook.
-      In addition to that, you also have to average grads across microbatches within the hook unless you explicitly desire to not average based on your loss function.
-      See ``average_grads_across_microbatches`` for more information about averaging grads across microbatches.
-
-      This is only supported when ``ddp=True`` and ``overlapping_allreduce=True`` (default).
-      For more information, see:
-      `register_comm_hook <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel.register_comm_hook>`__
-      in the PyTorch documentation.
-
-  **Behavior of** ``smp.DistributedModel`` **with Tensor Parallelism**
-
-  When a model is wrapped by ``smp.DistributedModel``, the library
-  immediately traverses the modules of the model object, and replaces the
-  modules that are supported for tensor parallelism with their distributed
-  counterparts. This replacement happens in place. If there are no other
-  references to the original modules in the script, they are
-  garbage-collected. The module attributes that previously referred to the
-  original submodules now refer to the distributed versions of those
-  submodules.
-
-  **Example:**
-
-  .. code:: python
-
-     # register DistributedSubmodule as the distributed version of Submodule
-     # (note this is a hypothetical example, smp.nn.DistributedSubmodule does not exist)
-     smp.tp_register_with_module(Submodule, smp.nn.DistributedSubmodule)
-
-     class MyModule(nn.Module):
-         def __init__(self):
-             ...
-
-             self.submodule = Submodule()
-         ...
-
-     # enabling tensor parallelism for the entire model
-     with smp.tensor_parallelism():
-         model = MyModule()
-
-     # here model.submodule is still a Submodule object
-     assert isinstance(model.submodule, Submodule)
-
-     model = smp.DistributedModel(model)
-
-     # now model.submodule is replaced with an equivalent instance
-     # of smp.nn.DistributedSubmodule
-     assert isinstance(model.module.submodule, smp.nn.DistributedSubmodule)
-
-  If ``pipeline_parallel_degree`` (equivalently, ``partitions``) is 1, the
-  placement of model partitions into GPUs and the initial broadcast of
-  model parameters and buffers across data-parallel ranks take place
-  immediately. This is because it does not need to wait for the model
-  partition when ``smp.DistributedModel`` wrapper is called. For other
-  cases with ``pipeline_parallel_degree`` greater than 1, the broadcast
-  and device placement will be deferred until the first call of an
-  ``smp.step``-decorated function happens. This is because the first
-  ``smp.step``-decorated function call is when the model partitioning
-  happens if pipeline parallelism is enabled.
-
-  Because of the module replacement during the ``smp.DistributedModel``
-  call, any ``load_state_dict`` calls on the model, as well as any direct
-  access to model parameters, such as during the optimizer creation,
-  should be done **after** the ``smp.DistributedModel`` call.
-
-  Since the broadcast of the model parameters and buffers happens
-  immediately during ``smp.DistributedModel`` call when the degree of
-  pipeline parallelism is 1, using ``@smp.step`` decorators is not
-  required when tensor parallelism is used by itself (without pipeline
-  parallelism).
-
-  For more information about the library's tensor parallelism APIs for PyTorch,
-  see :ref:`smdmp-pytorch-tensor-parallel`.
-
-  **Additional Methods of** ``smp.DistributedModel`` **for Tensor Parallelism**
-
-  The following are the new methods of ``smp.DistributedModel``, in
-  addition to the ones listed in the
-  `documentation <https://sagemaker.readthedocs.io/en/stable/api/training/smp_versions/v1.2.0/smd_model_parallel_pytorch.html#smp.DistributedModel>`__.
-
-  .. function:: distributed_modules()
-     :noindex:
-
-     -  An iterator that runs over the set of distributed
-        (tensor-parallelized) modules in the model
-
-  .. function:: is_distributed_parameter(param)
-     :noindex:
-
-     -  Returns ``True`` if the given ``nn.Parameter`` is distributed over
-        tensor-parallel ranks.
-
-  .. function::  is_distributed_buffer(buf)
-     :noindex:
-
-     -  Returns ``True`` if the given buffer is distributed over
-        tensor-parallel ranks.
-
-  .. function::  is_scaled_batch_parameter(param)
-     :noindex:
-
-     -  Returns ``True`` if the given ``nn.Parameter`` is operates on the
-        scaled batch (batch over the entire ``TP_GROUP``, and not only the
-        local batch).
-
-  .. function::  is_scaled_batch_buffer(buf)
-     :noindex:
-
-     -  Returns ``True`` if the parameter corresponding to the given
-        buffer operates on the scaled batch (batch over the entire
-        ``TP_GROUP``, and not only the local batch).
-
-  .. function::  default_reducer_named_parameters()
-     :noindex:
-
-     -  Returns an iterator that runs over ``(name, param)`` tuples, for
-        ``param`` that is allreduced over the ``DP_GROUP``.
-
-  .. function::  scaled_batch_reducer_named_parameters()
-     :noindex:
-
-     -  Returns an iterator that runs over ``(name, param)`` tuples, for
-        ``param`` that is allreduced over the ``RDP_GROUP``.
-
-
-
-.. class:: smp.DistributedOptimizer
-   :noindex:
-
-   **Parameters**
-   - ``optimizer``
-
-   An optimizer wrapper for saving/loading optimizer states. This wrapper
-   returns ``optimizer`` with the following methods overridden:
-
-   .. function:: state_dict( )
-      :noindex:
-
-      Returns the ``state_dict`` that contains optimizer state for the entire model.
-      It first collects the ``local_state_dict`` and gathers and merges
-      the ``local_state_dict`` from all ``mp_rank``s to create a full
-      ``state_dict``.
-
-   .. function::  load_state_dict( )
-      :noindex:
-
-      Same as the ``torch.optimizer.load_state_dict()`` , except:
-
-         -  It first gathers and merges the local ``state_dict``\ s if they are
-            partial.
-         -  The actual loading happens after the model partition so that each
-            rank knows its local parameters.
-
-   .. function::  local_state_dict( )
-      :noindex:
-
-      Returns the ``state_dict`` that contains the
-      local optimizer state that belongs to the current \ ``mp_rank``. This
-      ``state_dict`` contains a key \ ``_smp_is_partial`` to indicate this is
-      a partial \ ``state_dict``, which indicates whether the
-      ``state_dict`` contains elements corresponding to only the current
-      partition, or to the entire model.
-
-   ​
-.. function:: smp.partition(index)
-   :noindex:
-
-   **Inputs**
-
-   -  ``index`` (int) - The index of the partition.
-
-   A context manager which places all modules defined inside into the
-   partition with ID ``index``.  The ``index`` argument must be less than
-   the number of partitions.
-
-   Use ``smp.partition`` to implement manual partitioning.
-   If ``"auto_partition"`` is ``True``, then the
-   ``smp.partition`` contexts are ignored. Any module that is not placed in
-   any ``smp.partition`` context is placed in the
-   ``default_partition`` defined through the SageMaker Python SDK.
-
-   When ``smp.partition`` contexts are nested, the innermost context
-   overrides the rest (see the following example). In PyTorch, manual
-   partitioning should be done inside the module \ ``__init__``, and the
-   partition assignment applies to the modules that are *created* inside
-   the ``smp.partition`` context.
-
-   Example:
-
-   .. code:: python
-
-      class Model(torch.nn.Module):
-          def __init__(self):
-              with smp.partition(1):
-                  self.child0 = Child0()            # child0 on partition 1
-                  with smp.partition(2):
-                      self.child1 = Child1()        # child1 on partition 2
-                  self.child2 = Child2()            # child2 on partition 1
-              self.child3 = Child3()                # child3 on default_partition
-
-.. function:: smp.get_world_process_group( )
-   :noindex:
-
-   Returns a ``torch.distributed`` ``ProcessGroup`` that consists of all
-   processes, which can be used with the ``torch.distributed`` API.
-   Requires ``"ddp": True`` in SageMaker Python SDK parameters.
-
-.. function:: smp.get_mp_process_group( )
-   :noindex:
-
-   Returns a ``torch.distributed`` ``ProcessGroup`` that consists of the
-   processes in the ``MP_GROUP`` which contains the current process, which
-   can be used with the \ ``torch.distributed`` API. Requires
-   ``"ddp": True`` in SageMaker Python SDK parameters.
-
-.. function:: smp.get_dp_process_group( )
-   :noindex:
-
-   Returns a ``torch.distributed`` ``ProcessGroup`` that consists of the
-   processes in the ``DP_GROUP`` which contains the current process, which
-   can be used with the \ ``torch.distributed`` API. Requires
-   ``"ddp": True`` in SageMaker Python SDK parameters.
-
-.. function:: smp.is_initialized( )
-   :noindex:
-
-   Returns ``True`` if ``smp.init`` has already been called for the
-   process, and ``False`` otherwise.
-
-.. function:: smp.is_tracing( )
-   :noindex:
-
-   Returns ``True`` if the current process is running the tracing step, and
-   ``False`` otherwise.
-
-.. data:: smp.nn.FusedLayerNorm
-   :noindex:
-
-   `Apex Fused Layer Norm <https://nvidia.github.io/apex/layernorm.html>`__ is currently not
-   supported by the library. ``smp.nn.FusedLayerNorm`` replaces ``apex``
-   ``FusedLayerNorm`` and provides the same functionality. This requires
-   ``apex`` to be installed on the system.
-
-.. data:: smp.optimizers.FusedNovoGrad
-   :noindex:
-
-
-   `Fused Novo Grad optimizer <https://nvidia.github.io/apex/optimizers.html#apex.optimizers.FusedNovoGrad>`__ is
-   currently not supported by the library. ``smp.optimizers.FusedNovoGrad`` replaces ``apex`` ``FusedNovoGrad``
-   optimizer and provides the same functionality. This requires ``apex`` to
-   be installed on the system.
-
-.. data:: smp.optimizers.FusedLamb
-   :noindex:
-
-
-   `FusedLamb optimizer <https://nvidia.github.io/apex/optimizers.html#apex.optimizers.FusedLAMB>`__
-   currently doesn’t work with the library. ``smp.optimizers.FusedLamb`` replaces
-   ``apex`` ``FusedLamb`` optimizer and provides the same functionality.
-   This requires ``apex`` to be installed on the system.
-
-.. data:: smp.amp.GradScaler
-   :noindex:
-
-   `Torch AMP Gradscaler <https://pytorch.org/docs/stable/amp.html#torch.cuda.amp.GradScaler>`__
-   currently doesn’t work with the library. ``smp.amp.GradScaler`` replaces
-   ``torch.amp.GradScaler`` and provides the same functionality.
-
-.. _pytorch_saving_loading:
-   :noindex:
-
-APIs for Saving and Loading
-^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-.. function:: smp.save( )
-   :noindex:
-
-   Saves an object. This operation is similar to ``torch.save()``, except
-   it has an additional keyword argument, ``partial``, and accepts only
-   string type for the argument ``f`` (file). If ``partial=True``, each
-   ``mp_rank`` saves a separate checkpoint file and the library adds an ``mp_rank``
-   index to your saved file.
-
-   **Parameters**
-
-   -  ``obj`` (dict): A saved object.
-   -  ``f`` (str): A string containing a file name.
-   -  ``partial`` (bool, default= ``True``):  When set to ``True``, each
-      ``mp_rank`` saves a separate checkpoint file and the library adds an
-      ``mp_rank`` index to the saved file. If you want to be able to load
-      and further train a model that you save with ``smp.save()``, you must
-      set ``partial=True``.
-   -  ``pickle_module`` (picklemodule, default = module ``"pickle"`` from ``"/opt/conda/lib/python3.6/pickle.py"``):
-      A module used for pickling metadata and objects.
-   -  ``pickle_protocol``  (int, default=2): Can be specified to
-      override the defaultprotocol.
-
-.. function:: smp.load( )
-   :noindex:
-
-   Loads an object saved with ``smp.save()`` from a file.
-
-   Similar to, `torch.load() <https://pytorch.org/docs/stable/generated/torch.load.html>`__,
-   except it has an additional keyword argument, ``partial``, and accepts
-   only string type for the argument ``f`` (file). If \ ``partial=True``,
-   then each ``mp_rank`` loads a separate checkpoint file.
-
-   **Parameters**
-
-   -  ``f`` (string): A string containing a file name.
-   -  ``map_location`` (function): A function
-      `torch.device <https://pytorch.org/docs/stable/tensor_attributes.html#torch.torch.device>`__,
-      a string, or a dict specifying how to remap storage locations.
-   -  ``pickle_module`` (pickle module): A module used for unpickling
-      metadata and objects (has to match the \ ``pickle_module``\ used to
-      serialize file).
-   -  ``pickle_load_args`` (Python 3 only): Optional keyword arguments
-      passed to ``pickle_module.load()`` and ``pickle_module.Unpickler()``.
-   -  ``partial`` (bool, default= ``True``): When set to ``True``, each
-      ``mp_rank`` loads the checkpoint corresponding to the ``mp_rank``.
-      Should be used when loading a model trained with the library.
-
-.. _pytorch_saving_loading_instructions:
-   :noindex:
-
-General Instruction For Saving and Loading
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-The library can save partial or full checkpoints.
-
--  For partial checkpoints, each ``mp_rank`` saves its own checkpoint
-   file with only the parameters that belong to that rank.
--  For full checkpoints, the library saves a single checkpoint that contains
-   entire model parameters.
-
-When **saving** using ``smp.save()``, each rank only holds its own
-parameters. If you want to save the full model, there will be some
-communication between the ranks to create the full model. If you save
-checkpoints often, you should save partial checkpoints for best
-performance.
-
-When **loading** using ``smp.load()``, the library can load either partial or |
-full checkpoints or full checkpoints saved by a non-model-parallel model. If you
-want to resume training with a non-model-parallel model or do inference, you need
-a full checkpoint.
-
-The following is an example of how you can save and load a checkpoint:
-
-.. code:: python
-
-   # Original model and optimizer
-   model = MyModel(...)
-   optimizer = MyOpt(...)
-
-   # model parallel wrapper
-   model = smp.DistributedModel(model)
-   optimizer = smp.DistributedOptimizer(optimizer)
-
-   # To save, always save on dp_rank 0 to avoid data racing
-   if partial:
-       # To save the partial model on each mp rank
-       # the library will create `checkpoint.pt_{mprank}` for each mp rank
-       if save_partial_model:
-           if smp.dp_rank() == 0:
-               model_dict = model.local_state_dict() # save the partial model
-               opt_dict = optimizer.local_state_dict() # save the partial optimizer state
-               smp.save(
-                   {"model_state_dict": model_dict, "optimizer_state_dict": opt_dict},
-                   f"/checkpoint.pt",
-                   partial=True,
-               )
-
-       # To save the full model
-       if save_full_model:
-           if smp.dp_rank() == 0:
-               model_dict = model.state_dict() # save the full model
-               opt_dict = optimizer.state_dict() # save the full optimizer state
-               smp.save(
-                   {"model_state_dict": model_dict, "optimizer_state_dict": opt_dict},
-                   "/checkpoint.pt",
-                   partial=False,
-               )
-
-   # To load, load on all ranks.
-   # The only difference for partial/full loading is the partial flag in smp.load
-   # Load partial checkpoint
-   if partial_checkpoint:
-       checkpoint = smp.load("/checkpoint.pt", partial=True)
-       model.load_state_dict(checkpoint["model_state_dict"])
-       optimizer.load_state_dict(checkpoint["optimizer_state_dict"])
-   # Load full checkpoint
-   if full_checkpoint:
-       checkpoint = smp.load("/checkpoint.pt", partial=False)
-       model.load_state_dict(checkpoint["model_state_dict"])
-       optimizer.load_state_dict(checkpoint["optimizer_state_dict"])
diff --git a/doc/api/training/smp_versions/v1.6.0/smd_model_parallel_pytorch_tensor_parallel.rst b/doc/api/training/smp_versions/v1.6.0/smd_model_parallel_pytorch_tensor_parallel.rst
deleted file mode 100644
index d481d32c15..0000000000
--- a/doc/api/training/smp_versions/v1.6.0/smd_model_parallel_pytorch_tensor_parallel.rst
+++ /dev/null
@@ -1,855 +0,0 @@
-.. _smdmp-pytorch-tensor-parallel:
-   :noindex:
-
-PyTorch API for Tensor Parallelism
-==================================
-
-SageMaker distributed tensor parallelism works by replacing specific submodules
-in the model with their distributed implementations. The distributed modules
-have their parameters and optimizer states partitioned across tensor-parallel
-ranks. This is to compute the same output as it would have been computed by
-the original modules. Since tensor parallelism occurs across data-parallel
-ranks, a rank might collect slices of the activations corresponding to the
-data shards on other devices that are part of the same tensor parallelism group.
-
-You can enable or disable tensor parallelism for specific parts of the model.
-Within the enabled parts, the replacements with distributed modules will take
-place on a best-effort basis for those module supported for tensor parallelism.
-Alternatively, you can directly import and use the library’s distributed
-modules in the model definition.
-
-Some of the supported modules (such as ``smp.nn.Transformer``) are high-level
-blocks that contain many operations. Because custom implementations
-(as opposed to the built-in PyTorch modules) are typically used for these
-high-level blocks, the library offers an API that you can use to register
-specific distributed versions with such custom modules (provided that they
-are functionally equivalent). This allows the library to automatically replace
-the occurrences of such PyTorch modules with their distributed counterparts
-provided by the library.
-For more information, see the following topics.
-
-.. contents:: Topics
-  :depth: 3
-  :local:
-
-.. _registering-tp-modules:
-   :noindex:
-
-Registering Tensor Parallelism Distributed Modules
---------------------------------------------------
-
-Although PyTorch natively provides some of the commonly used (and
-tensor-parallelizable) building blocks such as Transformer, users often
-use custom implementations for such higher-level modules. To distribute
-such modules with tensor parallelism, you need to register the
-distributed modules to the custom module implementation in your class,
-so that the library knows how to distribute the custom module. When you
-register the distributed modules, make sure the custom module that you
-use is functionally equivalent to the distributed module. You can verify
-this by taking a look at the equivalent reference implementations in the
-:ref:`smdmp-tp-appendix`.
-These implementations are functionally equivalent to their distributed
-versions in ``smp.nn`` module.
-
-.. decorator:: @smp.tp_register(dist_module, init_hook=None, forward_hook=None, return_hook=None)
-
-   -  A class decorator that registers the ``dist_module`` class with
-      the module class that it is attached to. The hooks can be used to
-      adapt to different interfaces used with ``__init__`` and
-      ``forward`` methods.
-   -  **Arguments:**
-
-      -  ``dist_module``: A subclass of ``smp.nn.DistributedModule``
-         that implements the distributed version of the module class the
-         decorator is attached to. Any distributed module class defined
-         in ``smp.nn`` module can be used.
-      -  ``init_hook``: A callable that translates the arguments of the
-         original module ``__init__`` method to an ``(args, kwargs)``
-         tuple compatible with the arguments of the corresponding
-         distributed module ``__init__`` method. Must return a tuple,
-         whose first element is an iterable representing the positional
-         arguments, and second element is a ``dict`` representing the
-         keyword arguments. The input signature of the ``init_hook``
-         must **exactly** match the signature of the original
-         ``__init__`` method (including argument order and default
-         values), except it must exclude ``self``.
-      -  ``forward_hook``: A callable that translates the arguments of
-         the original module ``forward`` method to an ``(args, kwargs)``
-         tuple compatible with the arguments of the corresponding
-         distributed module ``forward`` method. Must return a tuple,
-         whose first element is an iterable representing the positional
-         arguments, and second element is a ``dict`` representing the
-         keyword arguments. The input signature of the ``init_hook``
-         must **exactly** match the signature of the original
-         ``forward`` method (including argument order and default
-         values), except it must exclude ``self``.
-      -  ``return_hook``: A callable that translates the object returned
-         from the distributed module to the return object expected of
-         the original module.
-
-   -  **Example:**
-
-      .. code:: python
-
-         init_hook = lambda config: ((), config.to_dict())
-
-         # register smp.nn.DistributedTransformer
-         # as the distributed version of MyTransformer
-         @smp.tp_register(smp.nn.DistributedTransformer, init_hook=init_hook)
-         class MyTransformer(nn.Module):
-             def __init__(self, config):
-                 ...
-
-             def forward(self, hidden_states, attention_mask):
-                 ...
-
-.. function:: smp.tp_register_with_module(module_cls, dist_module, init_hook=None, forward_hook=None, return_hook=None)
-   :noindex:
-
-   -  When you do not have direct access to model definition code, you
-      can use this API to similarly register a distributed module with
-      an existing module class.
-
-   -  **Arguments:**
-
-      -  ``module_cls``: The existing module class that will be
-         distributed.
-      -  ``dist_module``: A subclass of ``smp.nn.DistributedModule``
-         that implements the distributed version of the module class the
-         decorator is attached to. Any distributed module class defined
-         in ``smp.nn`` module can be used.
-      -  ``init_hook``: A callable that translates the arguments of the
-         original module ``__init__`` method to an ``(args, kwargs)``
-         tuple compatible with the arguments of the corresponding
-         distributed module ``__init__`` method. Must return a tuple,
-         whose first element is an iterable representing the positional
-         arguments, and second element is a ``dict`` representing the
-         keyword arguments. The input signature of the ``init_hook``
-         must **exactly** match the signature of the original
-         ``__init__`` method (including argument order and default
-         values), except it must exclude ``self``.
-      -  ``forward_hook``: A callable that translates the arguments of
-         the original module ``forward`` method to an ``(args, kwargs)``
-         tuple compatible with the arguments of the corresponding
-         distributed module ``forward`` method. Must return a tuple,
-         whose first element is an iterable representing the positional
-         arguments, and second element is a ``dict`` representing the
-         keyword arguments. The input signature of the ``init_hook``
-         must **exactly** match the signature of the original
-         ``forward`` method (including argument order and default
-         values), except it must exclude ``self``.
-      -  ``return_hook``: A callable that translates the object returned
-         from the distributed module to the return object expected of
-         the original module.
-
-   -  **Example:**
-
-      .. code:: python
-
-         from somelibrary import MyTransformer
-
-         init_hook = lambda config: ((), config.to_dict())
-
-         # register smp.nn.DistributedTransformer as the distributed version of MyTransformer
-         smp.tp_register_with_module(MyTransformer,
-                                     smp.nn.DistributedTransformer,
-                                     init_hook=init_hook)
-
-.. _smdmp-supported-modules-for-tp:
-   :noindex:
-
-Supported Modules for Tensor Parallelism
-----------------------------------------
-
-The following modules are supported for tensor
-parallelism.
-
--  ``smp.nn.DistributedLinear`` (implements ``nn.Linear``)
--  ``smp.nn.DistributedTransformerLMHead``
--  ``smp.nn.DistributedTransformer``
--  ``smp.nn.DistributedTransformerLayer``
--  ``smp.nn.DistributedAttentionLayer``
--  ``smp.nn.DistributedTransformerOutputLayer``
--  ``smp.nn.DistributedEmbedding``
-
-.. contents:: Topics
-  :depth: 3
-  :local:
-
-.. _tp-module-api:
-   :noindex:
-
-Tensor Parallelism Module APIs
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. class:: smp.nn.DistributedLinear(in_features, out_features)
-   :noindex:
-
-   -  Tensor-parallel implementation of the ``nn.Linear`` class.
-      Functionally equivalent to an ``nn.Linear`` module with the same
-      ``in_features`` and ``out_features``. In other words,
-      ``in_features`` and ``out_features`` are the number of *global*
-      channels across tensor-parallel ranks.
-   -  **Arguments:**
-
-      -  ``in_features``: The total number of input channels for the
-         linear layer across all tensor-parallel ranks.
-      -  ``out_features``: The total number of output channels for the
-         linear layer across all tensor-parallel ranks.
-
-.. class:: smp.nn.DistributedTransformerLMHead(num_layers=12, num_attention_heads=32, attention_head_size=32, hidden_size=1024, intermediate_size=4096, vocab_size=30522, num_positions=1024, attention_dropout_prob=0.1, hidden_dropout_prob=0.1, activation="gelu", layernorm_epsilon=1e-5, num_token_types=0, causal_mask_size=None, add_cross_attention=False, add_lm_head=True,  initializer_range=0.02, use_normal_initialization=False, pre_layernorm=False, post_layernorm=True)
-   :noindex:
-
-   -  Constructs a distributed transformer model, including embeddings
-      and a single LM head. A word embedding of size
-      ``(vocab_size, hidden_size)`` is created, as well as a positional
-      embedding of size ``(num_positions, hidden_size)``, and the
-      embeddings are added together. If ``num_token_types`` is larger
-      than 0, a separate embedding of size
-      ``(num_token_types, hidden_size)`` is created, and further added
-      on top.
-   -  The embeddings are fed through a ``DistributedTransformer``, and
-      if ``add_lm_head`` is ``True``, the output passes through a single
-      LM head, which is a linear module without bias whose weight is
-      tied to the word embeddings.
-   -  See ``DistributedTransformerLayer`` for a description of the rest
-      of the arguments.
-   -  **Methods:**
-
-      -  ``forward(self, inputs)``
-
-         -  If ``add_cross_attention`` is ``True``, ``inputs`` must be a
-            tuple
-            ``(input_ids, attention_mask, token_type_ids, position_ids, cross_states, cross_states, cross_mask, labels)``.
-         -  Otherwise, ``inputs`` must be a tuple
-            ``(input_ids, attention_mask, token_type_ids, position_ids, labels)``.
-         -  If ``token_type_ids`` is ``None``, token type embedding will
-            not be used.
-         -  ``input_ids`` is assumed to be of shape ``[N, S]``, where
-            ``N`` is the batch size and ``S`` is sequence length.
-         -  ``attention_mask`` is assumed to be a 0-1 tensor of shape
-            ``[N, S]``, where 1 represents a masked position.
-
-.. class:: smp.nn.DistributedTransformer(num_layers=12, num_attention_heads=32, attention_head_size=32, hidden_size=1024, intermediate_size=4096, attention_dropout_prob=0.1, hidden_dropout_prob=0.1, activation="gelu", layernorm_epsilon=1e-5, initializer_range=0.02, use_normal_initialization=False, causal_mask_size=None, add_cross_attention=False, pre_layernorm=False, post_layernorm=True)
-   :noindex:
-
-   -  A sequence of ``smp.nn.DistributedTransformerLayer``\ s, whose
-      number is given by ``num_layers`` argument. For the other
-      arguments and methods, refer to
-      ``smp.nn.DistributedTransformerLayer``.
-   -  If both ``pre_layernorm`` and ``post_layernorm`` are ``True``,
-      layer normalization is applied to both the input and the output of
-      the ``DistributedTransformer``, in addition to the intermediate
-      attention and transformer-output layers.
-
-.. class:: smp.nn.DistributedTransformerLayer(num_attention_heads=32, attention_head_size=32, hidden_size=1024, intermediate_size=4096, attention_dropout_prob=0.1, hidden_dropout_prob=0.1, activation="gelu", layernorm_epsilon=1e-5, initializer_range=0.02, use_normal_initialization=False, causal_mask_size=None, add_cross_attention=False, pre_layernorm=False, post_layernorm=True)
-   :noindex:
-
-   -  Tensor-parallel implementation of a single transformer layer.
-      Number of attention heads, hidden size, and intermediate size
-      refer to the global quantities across all tensor-parallel ranks.
-   -  **Arguments:**
-
-      -  ``num_attention_heads``: The total number of attention heads
-         across tensor-parallel ranks
-      -  ``attention_head_size``: The number of channels of a single
-         attention head.
-      -  ``hidden_size``: The hidden dimension of the transformer. The
-         input tensor ``hidden_states`` is assumed to have its last
-         dimension size equal to ``hidden_size``.
-      -  ``intermediate_size``: The number of output channels in the
-         first linear transformation of the transformer output layer.
-         ``DistributedTransformerOutputLayer`` first maps
-         ``hidden_size`` dimensions of its input tensor into
-         ``intermediate_size`` dimensions, and then maps it back into
-         ``hidden_size`` dimensions.
-      -  ``attention_dropout_prob``: The dropout probability applied to
-         the attention probabilities.
-      -  ``hidden_dropout_prob``: The dropout probability used in
-         dropout layers other than the one applied to the attention
-         probabilities.
-      -  ``activation``: Choice of activation function to use at the
-         output layer. Must be ``"gelu"`` or ``"relu"``.
-      -  ``layernorm_epsilon``: The epsilon added to the denominator of
-         layer normalization for numerical stability.
-      -  ``initializer_range``: If ``use_normal_initialization`` is
-         ``True``, the standard deviation of the normal random variable
-         to initialize the weights with.
-      -  ``use_normal_initialization``: If ``True``, the weights are
-         initialized with normal distribution with standard deviation
-         given by ``initializer_range``. Otherwise, default PyTorch
-         initialization is used.
-      -  ``causal_mask_size``: If ``None``, no causal mask is used on
-         attentions. Otherwise, should be set to maximum sequence length
-         to apply a causal mask to the attention scores. This is used,
-         for instance, in GPT-2.
-      -  ``add_cross_attention``: If ``True``, a cross-attention layer
-         will be added after the self-attention block. The
-         cross-attention layer computes the attention keys and values
-         based on the ``cross_states`` input (instead of
-         ``hidden_states`` input, as in self-attention. This is used in
-         the decoder block of encoder-decoder architectures. For
-         encoder-only architectures that only use self-attention, this
-         should be kept ``False``.
-      -  ``pre_layernorm``: If ``True``, inserts layer normalization at
-         the input. At least one of ``pre_layernorm`` and
-         ``post_layernorm`` must be ``True``.
-      -  ``post_layernorm``: If ``True``, inserts layer normalization at
-         the output. At least one of ``pre_layernorm`` and
-         ``post_layernorm`` must be ``True``.
-
-   -  **Methods:**
-
-      -  ``forward(self, inputs)``: Forward pass for the transformer
-         layer.
-
-         -  **Arguments:**
-
-            -  If ``add_cross_attention=False``, ``inputs`` must be a
-               tuple ``(hidden_states, attention_mask)``, where
-               ``hidden_states`` is assumed to be a tensor of dimensions
-               ``[N, S, H]``, where ``N`` is batch size, ``S`` is
-               sequence length, and ``H`` is ``hidden_size``.
-               ``attention_mask`` is assumed to be a tensor of
-               dimensions ``[N, 1, 1, S]``, where ``N`` is the batch
-               size, and ``S`` is the sequence length.
-            -  If ``add_cross_attention=True``, ``inputs`` must be a
-               tuple
-               ``(hidden_states, cross_states, attention_mask, cross_mask)``,
-               where ``hidden_states`` is assumed to be a tensor of
-               dimensions ``[N, S_1, H]``, where ``N`` is batch size,
-               ``S_1`` is sequence length, and ``H`` is ``hidden_size``.
-               ``cross_states`` is assumed to be a tensor of size
-               ``[N, S_2, H]``, similarly interpreted.
-               ``attention_mask`` is assumed to be a tensor of
-               dimensions ``[N, 1, 1, S_1]``, where ``N`` is the batch
-               size, and ``S_1`` is the sequence length, and
-               ``cross_mask`` is assumed to be a tensor of size
-               ``[N, 1, 1, S_2]``. Keys and values for the attention
-               heads in the cross-attention layer (but not the
-               self-attention layer) are computed using
-               ``cross_states``, and ``cross_mask`` is applied as the
-               attention mask in the cross-attention layer (but not the
-               self-attention layer).
-
-         -  **Returns:**
-
-            -  If ``add_cross_attention=False``, a tuple
-               ``(hidden_states, attention_mask)``, where
-               ``hidden_states`` is the output of the transformer, and
-               ``attention_mask`` is the same the ``attention_mask``
-               argument.
-            -  If ``add_cross_attention=True``, a tuple
-               ``(hidden_states, cross_states, attention_mask, cross_mask)``,
-               where ``hidden_states`` is the output of the transformer,
-               and the next three tensors are the same as the input
-               arguments.
-
-.. class:: smp.nn.DistributedAttentionLayer(num_attention_heads=32, attention_head_size=32, hidden_size=1024, attention_dropout_prob=0.1, hidden_dropout_prob=0.1, layernorm_epsilon=1e-5, initializer_range=0.02, use_normal_initialization=False, cross_attention=False, causal_mask_size=None, pre_layernorm=False, post_layernorm=True)
-   :noindex:
-
-   -  A distributed implementation for the attention block. Includes the
-      computation of the self- or cross-attention (context layer),
-      followed by a linear mapping and dropout, which is optionally
-      followed by the residual-connection and layer normalization.
-   -  **Arguments:**
-
-      -  See ``DistributedTransformerLayer`` for a description of the
-         arguments.
-      -  If ``cross_attention`` is ``True``, computes the attentions
-         with respect to the ``cross_states`` tensor of the ``forward``
-         method input tuple.
-
-   -  **Methods:**
-
-      -  ``forward(self, inputs)``: Forward pass for the attention
-         layer.
-
-         -  **Arguments:**
-
-            -  If ``cross_attention=False``, ``inputs`` must be a tuple
-               ``(hidden_states, attention_mask)``, where
-               ``hidden_states`` is assumed to be a tensor of dimensions
-               ``[N, S, H]``, where ``N`` is batch size, ``S`` is
-               sequence length, and ``H`` is ``hidden_size``.
-               ``attention_mask`` is assumed to be a tensor of
-               dimensions ``[N, 1, 1, S]``, \***\* where ``N`` is the
-               batch size, and ``S`` is the sequence length.
-            -  If ``cross_attention=True``, ``inputs`` must be a tuple
-               ``(hidden_states, cross_states, attention_mask)``, where
-               ``hidden_states`` is assumed to be a tensor of dimensions
-               ``[N, S_1, H]``, where ``N`` is batch size, ``S_1`` is
-               sequence length, and ``H`` is ``hidden_size``.
-               ``cross_states`` is assumed to be a tensor of size
-               ``[N, S_2, H]``, similarly interpreted.
-               ``attention_mask`` is assumed to be a tensor of
-               dimensions ``[N, 1, 1, S_2]``, where ``N`` is the batch
-               size, and ``S_2`` is the sequence length. Keys and values
-               for the attention heads are computed using
-               ``cross_states``.
-
-         -  **Returns:**
-
-            -  A single tensor that is the output of the attention
-               layer.
-
-.. class:: smp.nn.DistributedTransformerOutputLayer(hidden_size=1024, intermediate_size=4096,  hidden_dropout_prob=0.1, activation="gelu", layernorm_epsilon=1e-5, initializer_range=0.02, use_normal_initialization=False, pre_layernorm=False, post_layernorm=True)
-   :noindex:
-
-   -  Distributed implementation of a single transformer output layer. A
-      single ``DistributedTransformerLayer`` with
-      ``add_cross_attention=False`` consists of a single
-      ``DistributedAttentionLayer`` immediately followed by a single
-      ``DistributedTransformerOutputLayer``. The latter linearly maps
-      the last channel of the input tensor from ``hidden_size`` to
-      ``intermediate_size``, and then maps it back to ``hidden_size``.
-   -  **Arguments:**
-
-      -  See ``DistributedTransformerLayer`` for a description of the
-         arguments.
-
-.. class:: smp.nn.DistributedEmbedding(num_embeddings,embedding_dim, padding_idx=None, max_norm=None, norm_type=2.0, scale_grad_by_freq=False, sparse=False, _weight=None, initializer_range=0.02, _skip_allgather=False,_skip_scatter_and_merge=False,)
-   :noindex:
-
-   -  Distributed implementation of a single Embedding Layer. Currently
-      only supports splitting across the embedding_dim.
-   -  **Arguments:**
-
-      -  See ``DistributedEmbedding`` for a description of the
-         arguments.
-
-.. _enabling-tp:
-   :noindex:
-
-Enabling Tensor Parallelism
-~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-There are two ways tensor parallelism can be enabled.
-
-First, you can use
-the distributed module implementations in ``smp.nn`` module directly in
-your model definition. See :ref:`smdmp-supported-modules-for-tp`
-for a complete list of built-in distributed modules. Here is an example
-of how this can be done:
-
-.. code:: python
-
-   import torch.nn as nn
-   import smdistributed.modelparallel.torch as smp
-
-   class TransformerModel:
-       def __init__(self):
-           self.embedding = nn.Embedding(vocab_size, hidden_size)
-
-           # directly instantiate smp.nn.DistributedTransformer and use it
-           self.encoder = smp.nn.DistributedTransformer(num_layers, hidden_size, **kwargs)
-
-           self.pooler = nn.Linear(hidden_size, hidden_size)
-
-       def forward(self, hidden_states):
-           emb_out = self.embedding(hidden_states)
-           enc_out = self.encoder(emb_out)
-           return self.pooler(enc_out)
-
-Second, you can enable tensor parallelism for specific modules or blocks
-of code, which will automatically enable tensor parallelism for the
-supported modules within that scope. To do this, you can use the
-following API:
-
-.. decorator:: smp.tensor_parallelism(enabled=True, **kwargs)
-   :noindex:
-
-   -  A context manager that enables or disables tensor parallelism for
-      any supported module that is created inside. If there are nested
-      contexts, the innermost will override the rest. If there are
-      multiple supported modules created within the context, where one
-      is the submodule of the other, only the outermost module will be
-      distributed. If a supported module shares weights with another
-      (supported or unsupported) module, or if its hyperparameters do
-      not support distribution (e.g., not divisible by the tensor
-      parallelism degree), tensor parallelism will **not** be enabled
-      for this module even if this API is used.
-
-      **Example:**
-
-      .. code:: python
-
-         with smp.tensor_parallelism():
-             self.m0 = nn.Linear(20, 20)                   # will be distributed
-             with smp.tensor_parallelism(enabled=False):
-                 self.m1 = nn.Linear(20, 20)               # will not be distributed
-
-   - Keyword arguments `kwargs` can be used to modify the configurations of the distributed modules created inside the context. If a keyword argument provided here matches any `__init__` method arguments of a `DistributedModule` that substitutes a module created inside the `smp.tensor_parallelism` context, this keyword will override the value defined in the `init_hook`.
-
-.. function:: smp.set_tensor_parallelism(module, enabled=True, **kwargs)
-   :noindex:
-
-   -  Enables or disables tensor parallelism for the supported
-      submodules of ``module``. If enabling, the outermost supported
-      modules will be distributed. If disabling, tensor parallelism will
-      be disabled for the entire module subtree of ``module``. Unlike
-      the context manager, this API can be used after the model creation
-      (but before wrapping with :class:`smp.DistributedModel`), so direct
-      access to model definition code is not required. If a supported
-      module shares weights with another (supported or unsupported)
-      module, or if its hyperparameters do not support distribution
-      (e.g., not divisible by the tensor parallelism degree), tensor
-      parallelism will **not** be enabled for this module.
-   -  Keyword arguments ``kwargs`` can be used to modify the
-      configurations of the distributed modules created inside the
-      context. If a keyword argument provided here matches any
-      ``__init__`` method arguments of a :class:`smp.DistributedModel` that
-      substitutes a module created inside the ``smp.tensor_parallelism``
-      context, this keyword will override the value defined in the
-      ``init_hook``.
-   -  **Example:**
-
-      .. code:: python
-
-         model = MyModel()
-         smp.set_tensor_parallelism(model.encoder, True)
-         smp.set_tensor_parallelism(model.encoder.embedding, True)
-
-         # outermost supported submodules in model.encoder will be distributed, except for
-         # model.encoder.embedding
-         model = smp.DistributedModel(model)
-         optimizer = smp.DistributedOptimizer(optimizer)
-
-.. _activation-checkpointing-api:
-   :noindex:
-
-Activation Checkpointing APIs
------------------------------
-
-``smdistributed.modelparallel`` provides three APIs to enable
-activation checkpointing: one for checkpointing modules,
-one for checkpointing sequential modules, and
-one for checkpointing pretrained models.
-
-For a conceptual guide and examples, see
-`Activation Checkpointing <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-activation-checkpointing.html>`_
-in the *SageMaker's Distributed Model Parallel developer guide*.
-
-.. class:: smdistributed.modelparallel.torch.patches.checkpoint.checkpoint(module, *args, preserve_rng_state=True)
-   :noindex:
-
-   -  Checkpoints the module passed. Throws error if, during manual
-      partitioning, all children of module are not on same rank as the
-      module itself, i.e. the module tree is split across multiple
-      partitions. During auto-partitioning, if the module is split
-      across multiple partitions, then this call is ignored(with a
-      warning). Note that this call applies to the module instance only,
-      not to the module class.
-
-   -  **Arguments:**
-
-      -  ``module (Instance of nn.Module)``: The module to be
-         checkpointed. Note that unlike native checkpointing in
-         PyTorch’s, activation checkpointing in
-         ``smdistributed.modelparallel`` is at the granularity of a
-         module. A generic function cannot be passed here.
-      -  ``args``: Tuple containing inputs to the module.
-      -  ``preserve_rng_state (bool, default=True)``: Omit stashing and
-         restoring the RNG state during each checkpoint.
-
-.. class:: smdistributed.modelparallel.torch.patches.checkpoint.checkpoint_sequential(sequential_module, input, strategy="each", preserve_rng_state=True, pack_args_as_tuple=False)
-   :noindex:
-
-   -  Checkpoints the modules inside
-      `nn.Sequential <https://pytorch.org/docs/stable/generated/torch.nn.Sequential.html>`__.
-      This can be used even if different layers that are part of the
-      sequential container lie on different partitions. Each layer part
-      of the sequential module that is checkpointed must lie completely
-      within one partition. If this is not the case during manual
-      partitioning, then an error will be thrown. If this is not the
-      case during auto partitioning, a warning will be raised and this
-      module will be run without checkpointing.
-
-   -  **Arguments**
-
-      -  ``sequential_module (nn.Sequential)``: the sequential module to
-         be checkpointed.
-      -  ``input (torch.Tensor or a tuple of torch.Tensors)``: input to
-         the module, which can be a tensor or a tuple of tensors. If a
-         tuple is passed, then pack_args_as_tuple should be set to True.
-      -  ``strategy (string, default=“each”)`` : Strategy determines how
-         many layers part of the sequential module need to be grouped
-         together for one checkpointing call. This determines how much
-         memory can be reduced. It can take the following values
-
-         -  ``each`` : The default is to checkpoint each module inside
-            the sequential separately.
-         -  ``contiguous``: Groups consecutive layers on the same
-            partition together. For example, if a sequential consists of
-            [a, b, c, d] where a,b are on pp_rank0 and c,d are on
-            pp_rank 1, then this strategy would checkpoint a,b together
-            and then c,d together. This means effectively, inputs of a,
-            outputs of b, inputs of c, and outputs of d are in memory;
-            the reamining activations are recomputed.
-         -  ``group_2, group_3, group_4, etc:`` More generally,
-            ``group_x`` where x is an integer. This strategy provides
-            more flexibility in how many layers to group together.
-            ``group_x`` groups x layers together on a best effort basis.
-            It can group x layers together if there are x layers
-            consecutively on the same partition. For example:
-            [a,b,c,d,e] where a,b are on pp_rank0 and c,d,e are on
-            pp_rank 1. If the strategy is ``group_3,`` then a,b are
-            checkpointed together on pp_rank0 and c,d,e are checkpointed
-            together on pp_rank1.
-
-      -  ``preserve_rng_state (bool, default=True)``: Set to ``False``
-         to omit stashing and restoring the RNG state during each
-         checkpoint.
-      -  ``pack_args_as_tuple (bool, default=False)``: To ensure that
-         backward works correctly, the autograd function has to unpack
-         any tuples received. If the checkpointed layer takes a tuple as
-         input, then this needs to be set to True.
-
-.. class:: smp.set_activation_checkpointing(module, preserve_rng_state=True, pack_args_as_tuple=False, strategy="each")
-   :noindex:
-
-   -  This API is recommended when importing pretrained models from
-      libraries, such as PyTorch and Hugging Face Transformers. This is
-      particularly useful when you don’t have access to the model
-      definition code and not be able to replace a module call with
-      checkpoint.
-
-   -  **Arguments**:
-
-      -  ``module (Instance of nn.Module or nn.Sequential)``: The module
-         to checkpoint.
-      -  ``preserve_rng_state (bool, default=True)``: Set to ``False``
-         to omit stashing and restoring the RNG state during each
-         checkpoint.
-      -  ``pack_args_as_tuple (bool, default=False)``: *Can only be
-         passed when module is a sequential module.* To ensure that
-         backward works correctly, the autograd function has to unpack
-         any tuples received. If the layer checkpointed takes a tuple as
-         input, then this needs to be set to True.
-      -  ``strategy: (string, default=“each”)``: *Can only be passed
-         when module is a sequential module.* Strategy determines how
-         many layers part of the sequential module need to be grouped
-         together for one checkpointing call.
-      -  This determines how much memory can be reduced. It can take the
-         following values
-
-         -  ``each`` : The default is to checkpoint each module inside
-            the sequential separately.
-         -  ``contiguous``: Groups consecutive layers on the same
-            partition together. For example if a sequential consists of
-            ``[a, b, c, d]`` where ``a, b`` are on ``pp_rank0`` and ``c, d`` are on
-            ``pp_rank 1``, then this strategy would checkpoint a,b together
-            and then ``c, d`` together. This means effectively, the inputs of
-            ``a``, outputs of ``b``, inputs of ``c``, and outputs of ``d`` are in
-            memory, and the rest of the activations are recomputed.
-         -  ``group_2, group_3, group_4, etc:`` More generally,
-            ``group_x`` where x is an integer. This strategy provides
-            more flexibility in how many layers to group together.
-            ``group_x`` groups x number of layers together on a best
-            effort basis if there are x layers consecutively in the same
-            partition. **Example**: Assume a module with layers ``[a, b,
-            c, d, e]``. The layers a and b are on pp_rank0, and ``c``, ``d``, and
-            ``e`` are on ``pp_rank 1``. If the strategy is ``group_3,`` then ``a``,
-            ``b`` are checkpointed together on ``pp_rank0``, and ``c``, ``d``, ``e`` are
-            checkpointed together on ``pp_rank1``.
-
-.. _smdmp-tp-appendix:
-   :noindex:
-
-Appendix: Reference Implementations for Modules
------------------------------------------------
-
-The following are reference implementations for transformer-related
-modules. Note that this is not the actual ``smdistributed`` source code,
-but the distributed implementations provided in the library are the
-distributed versions of these reference implementations, and can be used
-to determine whether the distributed modules perform the same operations
-as the custom modules in your script.
-
-To keep the implementations simple, we only assume keyword arguments,
-and assume the existence of a method ``parse_args(kwargs)``, which
-parses the arguments to ``__init__`` methods and sets the relevant
-attributes of the module, such as ``hidden_size`` and
-``num_attention_heads``.
-
-``smp.nn.DistributedTransformer``
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. code:: python
-
-   class Transformer(nn.Module):
-       def __init__(self, **kwargs):
-           super(Transformer, self).__init__()
-           self.parse_args(kwargs)
-
-           self.layers = []
-           for l in range(self.num_layers):
-               self.layers.append(TransformerLayer(**kwargs))
-
-           self.seq_layers = nn.Sequential(*self.layers)
-
-       def forward(self, inp):
-           return self.seq_layers(inp)
-
-``smp.nn.DistributedTransformerLayer``
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. code:: python
-
-   class TransformerLayer(nn.Module):
-       def __init__(self, **kwargs):
-           super(TransformerLayer, self).__init__()
-           self.parse_args(kwargs)
-
-           self.attention = AttentionLayer(**kwargs)
-           self.output = TransformerOutputLayer(**kwargs)
-
-           if self.add_cross_attention:
-               self.cross_attention = AttentionLayer(cross_attention=True, **kwargs)
-
-       def forward(self, inp):
-           if self.add_cross_attention:
-               hidden_states, cross_states, attention_mask, cross_mask = inp
-           else:
-               hidden_states, attention_mask = inp
-
-           attention_output = self.attention((hidden_states, attention_mask))
-           if self.add_cross_attention:
-               attention_output = self.cross_attention((attention_output,
-                                                        cross_states,
-                                                        cross_mask))
-
-           output = self.output(attention_output)
-
-           if self.add_cross_attention:
-               return output, cross_states, attention_mask, cross_mask
-           else:
-               return output, attention_mask
-
-``smp.nn.DistributedAttentionLayer``
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. code:: python
-
-   class AttentionLayer(nn.Module):
-       def __init__(self, **kwargs):
-           super(AttentionLayer, self).__init__()
-           self.parse_args(kwargs)
-           self.attention_head_size = self.hidden_size // self.num_attention_heads
-
-           self.query = nn.Linear(self.hidden_size, self.hidden_size)
-           self.key = nn.Linear(self.hidden_size, self.hidden_size)
-           self.value = nn.Linear(self.hidden_size, self.hidden_size)
-           self.dense = nn.Linear(self.hidden_size, self.hidden_size)
-
-           self.dropout1 = nn.Dropout(self.attention_dropout_prob)
-           self.dropout2 = nn.Dropout(self.hidden_dropout_prob)
-
-           if self.pre_layernorm:
-               self.pre_layernorm = nn.LayerNorm(self.hidden_size,
-                                       eps=self.layernorm_epsilon)
-
-           if self.post_layernorm:
-               self.layernorm = nn.LayerNorm(self.hidden_size,
-                                       eps=self.layernorm_epsilon)
-
-       def transpose(self, tensor, key=False):
-           shape = tensor.size()[:-1] +
-                           (self.num_attention_heads, self.attention_head_size)
-           tensor = torch.reshape(tensor, shape)
-           if key:
-               return tensor.permute(0, 2, 3, 1)
-           else:
-               return tensor.permute(0, 2, 1, 3)
-
-       def forward(self, inp):
-           if self.cross_attention:
-               hidden_states, cross_states, attention_mask = inp
-           else:
-               hidden_states, attention_mask = inp
-
-           if self.pre_layernorm:
-               norm_states = self.pre_layernorm(hidden_states)
-           else:
-               norm_states = hidden_states
-
-           query_layer = self.query(norm_states)
-
-           if self.cross_attention:
-               key_layer = self.key(cross_states)
-               value_layer = self.value(cross_states)
-           else:
-               key_layer = self.key(norm_states)
-               value_layer = self.value(norm_states)
-
-           query_layer = self.transpose(query_layer)
-           key_layer = self.transpose(key_layer, key=True)
-           value_layer = self.transpose(value_layer)
-
-           attention_scores = torch.matmul(query_layer, key_layer)
-           attention_scores = attention_scores / math.sqrt(self.attention_head_size)
-
-           if not self.cross_attention and self.causal_mask is not None:
-               attention_scores = self.apply_causal_mask(attention_scores)
-
-           attention_scores = attention_scores + attention_mask
-
-           attention_probs = F.softmax(attention_scores, dim=-1)
-           attention_probs = self.dropout1(attention_probs)
-
-           context_layer = torch.matmul(attention_probs, value_layer)
-           context_layer = context_layer.permute(0, 2, 1, 3)
-           new_context_layer_shape = context_layer.size()[:-2] + \
-                                       (self.local_attention_size,)
-           context_layer = torch.reshape(context_layer, new_context_layer_shape)
-
-           self_attention = self.dense(context_layer)
-           self_attention = self.dropout2(self_attention)
-
-           if self.post_layernorm:
-               return self.layernorm(self_attention + hidden_states)
-           else:
-               return self_attention
-
-``smp.nn.DistributedTransformerOutputLayer``
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. code:: python
-
-   class TransformerOutputLayer(nn.Module):
-       def __init__(self, **kwargs):
-           super(TransformerOutputLayer, self).__init__()
-           self.parse_args(kwargs)
-
-           self.dense1 = nn.Linear(self.hidden_size, self.intermediate_size)
-           self.dense2 = nn.Linear(self.intermediate_size, self.hidden_size)
-
-           self.dropout = nn.Dropout(self.attention_dropout_prob)
-
-           if self.pre_layernorm:
-               self.pre_layernorm = nn.LayerNorm(self.hidden_size,
-                                       eps=self.layernorm_epsilon)
-
-           if self.post_layernorm:
-               self.layernorm = nn.LayerNorm(self.hidden_size,
-                                       eps=self.layernorm_epsilon)
-
-       def forward(self, inp):
-           if self.pre_layernorm:
-               norm_inp = self.pre_layernorm(inp)
-           else:
-               norm_inp = inp
-
-           dense1_output = self.dense1(norm_inp)
-           if self.activation == "gelu":
-               act_output = F.gelu(dense1_output)
-           else:
-               act_output = F.relu(dense1_output)
-
-           dense2_output = self.dense2(act_output)
-           output = self.dropout(dense2_output)
-
-           if self.post_layernorm:
-               return self.layernorm(inp + output)
-           else:
-               return output
diff --git a/doc/api/training/smp_versions/v1.6.0/smd_model_parallel_tensorflow.rst b/doc/api/training/smp_versions/v1.6.0/smd_model_parallel_tensorflow.rst
deleted file mode 100644
index 6630371b94..0000000000
--- a/doc/api/training/smp_versions/v1.6.0/smd_model_parallel_tensorflow.rst
+++ /dev/null
@@ -1,171 +0,0 @@
-TensorFlow API
-==============
-
-To use the TensorFlow-specific APIs for SageMaker distributed model parallism,
-you need to add the following import statement at the top of your training script.
-
-.. code:: python
-
-   import smdistributed.modelparallel.tensorflow as smp
-
-.. tip::
-
-   Refer to
-   `Modify a TensorFlow Training Script
-   <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-customize-training-script-tf.html>`_
-   to learn how to use the following APIs in your TensorFlow training script.
-
-.. class:: smp.DistributedModel
-   :noindex:
-
-   A sub-class of the Keras \ ``Model`` class, which defines the model to
-   be partitioned. Model definition is done by sub-classing
-   ``smp.DistributedModel`` class, and implementing the ``call()`` method,
-   in the same way as the Keras model sub-classing API. Any operation that
-   is part of the \ ``smp.DistributedModel.call()`` method is subject to
-   partitioning, meaning that every operation placed inside executes in
-   exactly one of the devices (the operations outside run on all devices).
-
-
-   Similar to the regular Keras API, the forward pass is done by directly
-   calling the model object on the input tensors. For example:
-
-   .. code:: python
-
-      predictions = model(inputs)   # model is a smp.DistributedModel object
-
-   However, ``model()`` calls can only be made inside a
-   ``smp.step``-decorated function.
-
-   The outputs from a ``smp.DistributedModel`` are available in all ranks,
-   regardless of which rank computed the last operation.
-
-   **Methods:**
-
-   .. function:: save_model(save_path="/opt/ml/model")
-      :noindex:
-
-      **Inputs**
-      - ``save_path`` (``string``): A path to save an unpartitioned model with latest training weights.
-
-      Saves the entire,
-      unpartitioned model with the latest trained weights to ``save_path`` in
-      TensorFlow ``SavedModel`` format. Defaults to ``"/opt/ml/model"``, which
-      SageMaker monitors to upload the model artifacts to Amazon S3.
-
-.. function:: smp.partition(index)
-   :noindex:
-
-   **Inputs**
-
-   -  ``index`` (``int``): The index of the partition.
-
-   A context manager which places all operations defined inside into the
-   partition whose ID is equal to ``index``. When
-   ``smp.partition`` contexts are nested, the innermost context overrides
-   the rest. The ``index`` argument must be smaller than the number of
-   partitions.
-
-   ``smp.partition`` is used in the manual partitioning API;
-   if \ ``"auto_partition"`` parameter is set to ``True`` while launching
-   training, then ``smp.partition`` contexts are ignored. Any operation
-   that is not placed in any ``smp.partition`` context is placed in the
-   ``default_partition``, as shown in the following example:
-
-   .. code:: python
-
-      # auto_partition: False
-      # default_partition: 0
-      smp.init()
-      [...]
-      x = tf.constant(1.2)                     # placed in partition 0
-      with smp.partition(1):
-          y = tf.add(x, tf.constant(2.3))      # placed in partition 1
-          with smp.partition(3):
-              z = tf.reduce_sum(y)             # placed in partition 3
-
-
-.. function:: register_post_partition_hook(hook)
-   :noindex:
-
-    Registers a callable ``hook`` to
-    be executed after the model is partitioned. This is useful in situations
-    where an operation needs to be executed after the model partition during
-    the first call to ``smp.step``, but before the actual execution of the
-    first forward pass.
-
-    .. code:: python
-
-        @smp.register_post_partition_hook
-        def test_eager():
-            # All statements here will be executed right after partition but before the first forward pass
-            tf.print("Entered hook through eager context")
-
-.. class:: smp.CheckpointManager
-   :noindex:
-
-
-   A subclass of TensorFlow
-   `CheckpointManager <https://www.tensorflow.org/api_docs/python/tf/train/CheckpointManager>`__,
-   which is used to manage checkpoints. The usage is similar to TensorFlow
-   ``CheckpointManager``.
-
-   The following returns a ``CheckpointManager`` object.
-
-   .. code:: python
-
-      smp.CheckpointManager(checkpoint,
-                            directory="/opt/ml/checkpoints",
-                            max_to_keep=None,
-                            checkpoint_name="ckpt")
-
-   **Parameters**
-
-   -  ``checkpoint``: A `tf.train.Checkpoint
-      <https://www.tensorflow.org/api_docs/python/tf/train/Checkpoint>`__ instance
-      that represents a model checkpoint.
-
-   -  ``directory``: (``str``) The path to a directory in which to write
-      checkpoints. A file named "checkpoint" is also written to this
-      directory (in a human-readable text format) which contains the state
-      of the ``CheckpointManager``. Defaults to
-      ``"/opt/ml/checkpoints"``, which is the directory that SageMaker
-      monitors for uploading the checkpoints to Amazon S3.
-   -  ``max_to_keep`` (``int``): The number of checkpoints to keep. If
-      ``None``, all checkpoints are kept.
-   -  ``checkpoint_name`` (``str``): Custom name for the checkpoint file.
-      Defaults to ``"ckpt"``.
-
-
-   **Methods:**
-
-   .. function:: save( )
-      :noindex:
-
-      Saves a new checkpoint in the specified directory. Internally uses ``tf.train.CheckpointManager.save()``.
-
-   .. function:: restore( )
-      :noindex:
-
-      Restores the latest checkpoint in the specified directory.
-      Internally uses ``tf.train.CheckpointManager.restore()``.
-
-
-   **Examples:**
-
-   .. code:: python
-
-      checkpoint = tf.train.Checkpoint(optimizer=optimizer, model=model)
-      ckpt_manager = smp.CheckpointManager(checkpoint, max_to_keep=5)  # use /opt/ml/checkpoints
-
-      for inputs in train_ds:
-          loss = train_step(inputs)
-          # [...]
-          ckpt_manager.save()  # save a new checkpoint in /opt/ml/checkpoints
-
-   .. code:: python
-
-      for step, inputs in enumerate(train_ds):
-          if step == 0:
-              ckpt_manager.restore()
-          loss = train_step(inputs)
diff --git a/doc/api/training/smp_versions/v1.9.0/smd_model_parallel_common_api.rst b/doc/api/training/smp_versions/v1.9.0/smd_model_parallel_common_api.rst
deleted file mode 100644
index b4713b2707..0000000000
--- a/doc/api/training/smp_versions/v1.9.0/smd_model_parallel_common_api.rst
+++ /dev/null
@@ -1,538 +0,0 @@
-Common API
-==========
-
-The following SageMaker distribute model parallel APIs are common across all frameworks.
-
-.. contents:: Table of Contents
-  :depth: 3
-  :local:
-
-The Library's Core APIs
------------------------
-
-This API document assumes you use the following import statement in your training scripts.
-
-**TensorFlow**
-
-.. code:: python
-
-   import smdistributed.modelparallel.tensorflow as smp
-
-**PyTorch**
-
-.. code:: python
-
-   import smdistributed.modelparallel.torch as smp
-
-
-.. function:: smp.init( )
-   :noindex:
-
-   Initialize the library. Must be called at the beginning of training script.
-
-.. function:: @smp.step(non_split_inputs, input_split_axes, [*args, **kwargs])
-   :noindex:
-
-   A decorator that must be placed over a function that represents a single
-   forward and backward pass (for training use cases), or a single forward
-   pass (for evaluation use cases). Any computation that is defined inside
-   the ``smp.step``-decorated function is executed in a pipelined manner.
-
-   By default, every tensor input to the function is split across its batch
-   dimension into a number of microbatches specified while launching the
-   training job. This behavior can be customized through the arguments to
-   ``smp.step``, described below. The library then orchestrates the execution of
-   each microbatch across all partitions, based on the chosen pipeline
-   type.
-
-   In a typical use case, forward pass and back-propagation are executed
-   inside an \ ``smp.step``-decorated function and gradients, loss, and
-   other relevant metrics (such as accuracy, etc.) are returned from
-   ``smp.step``-decorated function.
-
-   Any gradient post-processing operation, such as gradient clipping and
-   allreduce, as well as ``optimizer.apply_gradients`` calls (for TF) or
-   ``optimizer.step`` (for PT) should be applied on the gradients returned
-   from the ``smp.step`` function, and not inside the ``smp.step``
-   function. This is because every operation inside ``smp.step`` is
-   executed once per microbatch, so having these operations inside
-   ``smp.step`` can either be inefficient (in the case of allreduce), or
-   lead to wrong results (in the case of ``apply_gradients`` /
-   ``optimizer.step``).
-
-   If the objects returned from the ``smp.step``-decorated function contain
-   ``tf.Tensor``\ s / ``torch.Tensor``\ s, they are converted to
-   ``StepOutput`` objects. A ``StepOutput`` object encapsulates all
-   versions of the tensor across different microbatches
-   (see ``StepOutput`` entry for more information).
-
-   The argument to ``smp.step`` decorated function should either be a tensor
-   or an instance of list, tuple, dict or set for it to be split across
-   microbatches. If your object doesn't fall into this category, you can make
-   the library split your object, by implementing ``smp_slice`` method.
-
-   Below is an example of how to use it with PyTorch.
-
-   .. code:: python
-
-      class CustomType:
-          def __init__(self, tensor):
-              self.data = tensor
-
-          # The library will call this to invoke slicing on the object passing in total microbatches (num_mb)
-          # and the current microbatch index (mb).
-          def smp_slice(self, num_mb, mb, axis):
-              dim_size = list(self.data.size())[axis]
-
-              split_size = dim_size // num_mb
-              sliced_tensor = self.data.narrow(axis, mb * split_size, split_size)
-              return CustomType(sliced_tensor, self.other)
-
-      custom_obj = CustomType(torch.ones(4,))
-
-      @smp.step()
-      def step(custom_obj):
-          loss = model(custom_obj)
-          model.backward(loss)
-          return loss
-
-
-   **Important:** ``smp.step`` splits the batch into microbatches, and
-   executes everything inside the decorated function once per microbatch.
-   This might affect the behavior of batch normalization, any operation
-   that explicitly uses the batch size information, or any other Python
-   code that is expected to run once.
-
-   **TensorFlow-specific behavior**
-
-   ``smp.step`` is a wrapper that
-   inherits from and extends the behavior of ``tf.function``, and as such,
-   all the caveats that apply to the use of ``tf.function``\ s also apply
-   to ``smp.step``. In particular, any operation that is inside
-   ``smp.step`` executes in graph mode, and not eager mode.
-
-   In the first call, ``smp.step`` performs tracing of the wrapped function every time
-   one of the tensor arguments changes their shape or dtype, or for every
-   new value of a Python argument, if there is one. Tracing is expensive,
-   so such scenarios should be avoided as much as possible or,
-   alternatively, an ``input_signature`` argument must be provided. For
-   more information on the usage of ``tf.function``, refer to the
-   TensorFlow documentation:
-
-   -  https://www.tensorflow.org/api_docs/python/tf/function\
-   -  https://www.tensorflow.org/guide/function\
-
-   Each ``smp.step`` decorated function must have a return value that depends on the
-   output of ``smp.DistributedModel``.
-
-   **Common parameters**
-
-   -  ``non_split_inputs`` (``list``): The list of arguments to the decorated function
-      that should not be split along the batch dimension. Should be used
-      for all input tensors that do not have a batch dimension. Should be a
-      list of argument names as ``str``, as they appear in the signature of
-      the ``smp.step``-decorated function. By default it is considered an
-      empty list.
-
-   -  ``input_split_axes`` (``dict``): A dict that maps the argument name to its batch
-      axis. The keys should be the argument names as ``str``, as they
-      appear in the signature of the ``smp.step``-decorated function.  By
-      default all batch axes are assumed to be the 0-axis.
-
-   **TensorFlow-only parameters**
-
-   -  All arguments of ``tf.function``. Note:
-      The \ ``experimental_compile`` argument of ``tf.function`` may not
-      work as expected with ``smp.step``, since it interferes with
-      pipelining and model partitioning. To enable XLA with the library, you can
-      instead use \ ``tf.config.optimizer.set_jit(True)``.
-
-   **PyTorch-only parameters**
-
-   -  ``detach_outputs`` (``bool``) : If ``True``, calls ``torch.Tensor.detach()`` on
-      all returned ``torch.Tensor`` outputs. Setting it to ``False``
-      increases memory consumption, unless ``detach()`` is manually called
-      on the returned tensors, because the model graph is not cleared from
-      memory after the training step. Set to \ ``True`` by default.
-
-   **Returns**
-
-   -  The same object(s) returned from the decorated function. All
-      returned \ ``tf.Tensor``, \ ``tf.Variable``  objects (for TF) or
-      ``torch.Tensor`` objects (for PT) are wrapped inside
-      a \ ``StepOutput`` object, even when they are inside a Python
-      ``list``, ``tuple``, or ``dict``.
-
-
-
-.. class:: StepOutput
-   :noindex:
-
-
-   A class that encapsulates all versions of a ``tf.Tensor``
-   or \ ``torch.Tensor`` across all microbatches.
-
-   When a particular ``tf.Tensor`` or ``torch.Tensor`` is computed inside
-   ``smp.step``, different versions of the tensor are computed for each
-   microbatch.
-
-   When this tensor is returned from ``smp.step`` and is accessed outside
-   of the decorated function, it appears as a ``StepOutput`` object, which
-   contains all such versions. For example,
-
-   -  In the case of Tensorflow, the gradient for a particular
-      ``tf.Variable`` is computed on each microbatch individually, and if
-      this gradient is returned from ``smp.step``, all gradients for this
-      ``tf.Variable`` become part of the same ``StepOutput`` object. The
-      ``StepOutput`` class offers the following API for commonly-used
-      post-processing operations on such tensors.
-   -  In the case of PyTorch, the loss for each microbatch is computed
-      individually and all the ``torch.Tensor``\ s that represent the loss
-      for different microbatches become part of same ``StepOutput`` object,
-      if loss is returned from the ``smp.step`` function.
-
-
-   The ``StepOutput`` class offers the following API for commonly-used
-   post-processing operations on tensors.
-
-   .. data:: StepOutput.outputs
-      :noindex:
-
-      Returns a list of the underlying tensors, indexed by microbatch.
-
-   .. function:: StepOutput.reduce_mean( )
-      :noindex:
-
-      Returns a ``tf.Tensor``, ``torch.Tensor`` that averages the constituent ``tf.Tensor`` s
-      ``torch.Tensor`` s. This is commonly used for averaging loss and gradients across microbatches.
-
-   .. function:: StepOutput.reduce_sum( )
-      :noindex:
-
-      Returns a ``tf.Tensor`` /
-      ``torch.Tensor`` that sums the constituent
-      ``tf.Tensor``\ s/\ ``torch.Tensor``\ s.
-
-   .. function:: StepOutput.concat( )
-      :noindex:
-
-      Returns a
-      ``tf.Tensor``/``torch.Tensor`` that concatenates tensors along the
-      batch dimension using ``tf.concat`` / ``torch.cat``.
-
-   .. function:: StepOutput.stack( )
-      :noindex:
-
-      Applies ``tf.stack`` / ``torch.stack``
-      operation to the list of constituent ``tf.Tensor``\ s /
-      ``torch.Tensor``\ s.
-
-   **TensorFlow-only methods**
-
-   .. function:: StepOutput.merge( )
-      :noindex:
-
-      Returns a ``tf.Tensor`` that
-      concatenates the constituent ``tf.Tensor``\ s along the batch
-      dimension. This is commonly used for merging the model predictions
-      across microbatches.
-
-   .. function:: StepOutput.accumulate(method="variable", var=None)
-      :noindex:
-
-      Functionally the same as ``StepOutput.reduce_mean()``. However, it is
-      more memory-efficient, especially for large numbers of microbatches,
-      since it does not wait for all constituent \ ``tf.Tensor``\ s to be
-      ready to start averaging them, thereby saving memory.
-
-      In some cases (XLA for example) ``StepOutput.reduce_mean()`` might end
-      up being more memory-efficient than ``StepOutput.accumulate()``.
-
-      **Parameters**
-
-      -  ``method`` (``"add_n"`` or ``"accumulate_n"`` or ``"variable"``):
-         If ``"add_n"`` or ``"accumulate_n"``, the library uses
-         ``tf.add_n`` and ``tf.accumulate_n``, respectively, to implement
-         accumulation. If ``"variable"``, the library uses an internal ``tf.Variable``
-         into which to accumulate the tensors. Default is \ ``"variable"``.
-         Note: Memory usage behavior of these choices can depend on the model
-         and implementation.
-
-      -  ``var``: A ``tf.Variable`` into which, if provided, the library uses to
-         accumulate the tensors. If \ ``None``, the library internally creates a
-         variable. If ``method`` is not ``"variable"``, this argument is
-         ignored.
-
-.. _mpi_basics:
-   :noindex:
-
-MPI Basics
-----------
-
-The library exposes the following basic MPI primitives to its Python API:
-
-**Global**
-
--  ``smp.rank()`` : The global rank of the current process.
--  ``smp.size()`` : The total number of processes.
--  ``smp.get_world_process_group()`` :
-   ``torch.distributed.ProcessGroup`` that contains all processes.
--  ``smp.CommGroup.WORLD``: The communication group corresponding to all processes.
--  ``smp.local_rank()``: The rank among the processes on the current instance.
--  ``smp.local_size()``: The total number of processes on the current instance.
--  ``smp.get_mp_group()``: The list of ranks over which the current model replica is partitioned.
--  ``smp.get_dp_group()``: The list of ranks that hold different replicas of the same model partition.
-
-**Tensor Parallelism**
-
--  ``smp.tp_rank()`` : The rank of the process within its
-   tensor-parallelism group.
--  ``smp.tp_size()`` : The size of the tensor-parallelism group.
--  ``smp.get_tp_process_group()`` : Equivalent to
-   ``torch.distributed.ProcessGroup`` that contains the processes in the
-   current tensor-parallelism group.
--  ``smp.CommGroup.TP_GROUP`` : The communication group corresponding to
-   the current tensor parallelism group.
-
-**Pipeline Parallelism**
-
--  ``smp.pp_rank()`` : The rank of the process within its
-   pipeline-parallelism group.
--  ``smp.pp_size()`` : The size of the pipeline-parallelism group.
--  ``smp.get_pp_process_group()`` : ``torch.distributed.ProcessGroup``
-   that contains the processes in the current pipeline-parallelism group.
--  ``smp.CommGroup.PP_GROUP`` : The communication group corresponding to
-   the current pipeline parallelism group.
-
-**Reduced-Data Parallelism**
-
--  ``smp.rdp_rank()`` : The rank of the process within its
-   reduced-data-parallelism group.
--  ``smp.rdp_size()`` : The size of the reduced-data-parallelism group.
--  ``smp.get_rdp_process_group()`` : ``torch.distributed.ProcessGroup``
-   that contains the processes in the current reduced data parallelism
-   group.
--  ``smp.CommGroup.RDP_GROUP`` : The communication group corresponding
-   to the current reduced data parallelism group.
-
-**Model Parallelism**
-
--  ``smp.mp_rank()`` : The rank of the process within its model-parallelism
-   group.
--  ``smp.mp_size()`` : The size of the model-parallelism group.
--  ``smp.get_mp_process_group()`` : ``torch.distributed.ProcessGroup``
-   that contains the processes in the current model-parallelism group.
--  ``smp.CommGroup.MP_GROUP`` : The communication group corresponding to
-   the current model parallelism group.
-
-**Data Parallelism**
-
--  ``smp.dp_rank()`` : The rank of the process within its data-parallelism
-   group.
--  ``smp.dp_size()`` : The size of the data-parallelism group.
--  ``smp.get_dp_process_group()`` : ``torch.distributed.ProcessGroup``
-   that contains the processes in the current data-parallelism group.
--  ``smp.CommGroup.DP_GROUP`` : The communication group corresponding to
-   the current data-parallelism group.
-
-.. _communication_api:
-   :noindex:
-
-Communication API
------------------
-
-The library provides a few communication primitives which can be helpful while
-developing the training script. These primitives use the following
-``enum`` s as arguments to specify which processes the communication
-should involve.
-​
-
-**Helper structures**
-
-.. data:: smp.CommGroup
-   :noindex:
-
-   An ``enum`` that takes the values
-   ``CommGroup.WORLD``, ``CommGroup.MP_GROUP``, and ``CommGroup.DP_GROUP``.
-   These values can also be accessed as ``smp.WORLD``, ``smp.MP_GROUP``,
-   and ``smp.DP_GROUP`` respectively.
-
-   -  ``CommGroup.WORLD``: Represents the entire group of processes used in
-      training
-   -  ``CommGroup.MP_GROUP``: Represents the group of processes that hold
-      the same model replica as the current process. The processes in a
-      single ``MP_GROUP`` collectively store an entire replica of the
-      model.
-   -  ``CommGroup.DP_GROUP``: Represents the group of processes that hold
-      the same model partition as the current process. The processes in a
-      single ``DP_GROUP`` perform data parallelism/allreduce among
-      themselves.
-
-.. data:: smp.RankType
-   :noindex:
-
-   An ``enum`` that takes the values
-   ``RankType.WORLD_RANK``, ``RankType.MP_RANK``, and ``RankType.DP_RANK``.
-
-   -  ``RankType.WORLD_RANK``: The associated rank is to be interpreted as
-      the rank of the process across all processes used in training.
-   -  ``RankType.MP_RANK``: The associated rank is to be interpreted as the
-      rank of the process within the ``MP_GROUP``.
-   -  ``RankType.DP_RANK``: The associated rank is to be interpreted as the
-      rank of the process within the ``DP_GROUP``.
-
-
-**Communication primitives:**
-
-.. function:: smp.broadcast(obj, group)
-   :noindex:
-
-   Sends the object to all processes in the
-   group. The receiving process must call ``smp.recv_from`` to receive the
-   sent object.
-
-   **Inputs**
-
-   -  ``obj``: An arbitrary picklable Python object that will be broadcast.
-
-   -  ``group``: A ``CommGroup`` argument that represents to which group of
-      processes the object will be sent.
-
-   **Notes**
-
-   -  When you use ``broadcast`` on the sender process, there needs
-      to be an accompanying ``smp.recv_from()`` call on the receiver
-      processes.
-
-   -  This is a synchronous call; the ``broadcast`` statement
-      returns only after all ranks participating in the call have made a
-      matching ``recv_from`` call.
-
-   **Example**
-
-   .. code:: python
-
-      if smp.rank() == 0:
-          smp.broadcast(something, group=smp.CommGroup.WORLD)
-      else:
-          smp.recv_from(0, rank_type=smp.RankType.WORLD_RANK)
-
-.. function:: smp.send(obj, dest_rank, rank_type)
-   :noindex:
-
-   Sends the object ``obj`` to
-   ``dest_rank``, which is of a type specified by ``rank_type``.
-
-   **Inputs**
-
-   -  ``obj``: An arbitrary picklable Python object that will be sent.
-
-   -  ``dest_rank`` (``int``): An integer denoting the rank of the receiving process.
-
-   -  ``rank_type`` (``enum``): A ``smp.RankType`` ``enum`` that determines how
-      ``dest_rank`` is to be interpreted. For example if ``dest_rank`` is 1
-      and ``rank_type`` is ``MP_RANK``, then ``obj`` is sent to process
-      with ``mp_rank`` 1 in the ``MP_GROUP`` which contains the current
-      process.
-
-   **Notes**
-
-   -  Note: \ This is a synchronous call; the ``send`` statement returns
-      only after the destination rank has made a matching
-      ``recv_from`` call.
-
-.. function:: smp.recv_from(src_rank, rank_type)
-   :noindex:
-
-   Receive an object from a peer process. Can be used with a matching
-   ``smp.send`` or a ``smp.broadcast`` call.
-
-   **Inputs**
-
-   -  ``src_rank`` (``int``): An integer denoting rank of the sending process.
-
-   -  ``rank_type`` (``enum``): A ``smp.RankType`` ``enum`` that determines how
-      ``dest_rank`` is to be interpreted. For example if ``src_rank`` is 1
-      and ``rank_type`` is ``MP_RANK``, then the object is received from
-      the process with ``mp_rank`` 1 in the ``MP_GROUP`` which contains the
-      current process.
-
-   **Returns**
-
-   Returns the python object that is sent by the peer process.
-
-   **Notes**
-
-   -  Note: This is a synchronous call; the ``recv_from`` statement returns
-      only after the source rank has made a matching ``send`` or
-      ``broadcast`` call, and the object is received.
-
-.. function:: smp.allgather(obj, group)
-   :noindex:
-
-   A collective call that gathers all the
-   submitted objects across all ranks in the specified ``group``. Returns a
-   list whose ``i``\ th index contains the object submitted by the
-   ``i``\ th rank in ``group``.
-
-   **Inputs**
-
-   -  ``obj``: An arbitrary picklable Python object that will be
-      allgathered.
-
-   -  ``group`` : A ``CommGroup`` argument that represents which group of
-      processes participate in ``allgather``.
-
-   **Notes**
-
-   -  Note: This is a synchronous call; the ``allgather`` statement returns
-      only after all ranks participating in the call have made a matching
-      ``allgather`` call, and all the objects are received at the current
-      rank.
-
-   **Examples**
-
-   .. code:: python
-
-      # assuming mp_size() == 2
-
-      if smp.mp_rank() == 0:
-          out = smp.allgather(obj1, smp.CommGroup.MP_GROUP)  # returns [obj1, obj2]
-      else:
-          out = smp.allgather(obj2, smp.CommGroup.MP_GROUP)  # returns [obj1, obj2]
-
-.. function:: smp.barrier(group=smp.WORLD)
-   :noindex:
-
-   A statement that hangs until all
-   processes in the specified group reach the barrier statement, similar to
-   ``MPI_Barrier()``.
-
-   **Inputs**
-
-   -  ``group``: An ``smp.CommGroup`` ``enum`` that specifies the group of
-      processes participating in the barrier call. Defaults to
-      ``smp.WORLD``.
-
-   **Examples**
-
-   -  Assume there are 8 processes and 2 model partitions, and
-      therefore 4 \ ``mp_group``\ s, and 2 ``dp_group``\ s. If
-      the \ ``barrier`` call is passed the value ``smp.MP_GROUP`` for its
-      group argument, then each process only waits until the other process
-      of its own ``mp_group`` reaches that point. It does not wait for
-      processes outside that ``mp_group``.
-
-.. function:: smp.dp_barrier()
-   :noindex:
-
-   Same as passing ``smp.DP_GROUP``\ to ``smp.barrier()``.
-   Waits for the processes in the same \ ``dp_group`` as
-   the current process to reach the same point in execution.
-
-.. function:: smp.mp_barrier()
-   :noindex:
-
-   Same as passing ``smp.MP_GROUP`` to
-   ``smp.barrier()``. Waits for the processes in the same ``mp_group`` as
-   the current process to reach the same point in execution.
diff --git a/doc/api/training/smp_versions/v1.9.0/smd_model_parallel_pytorch.rst b/doc/api/training/smp_versions/v1.9.0/smd_model_parallel_pytorch.rst
deleted file mode 100644
index 88d1a42165..0000000000
--- a/doc/api/training/smp_versions/v1.9.0/smd_model_parallel_pytorch.rst
+++ /dev/null
@@ -1,677 +0,0 @@
-PyTorch API
-===========
-
-To use the PyTorch-specific APIs for SageMaker distributed model parallism,
-you need to add the following import statement at the top of your training script.
-
-.. code:: python
-
-   import smdistributed.modelparallel.torch as smp
-
-
-.. tip::
-
-   Refer to
-   `Modify a PyTorch Training Script
-   <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-customize-training-script-pt.html>`_
-   to learn how to use the following API in your PyTorch training script.
-
-.. class:: smp.DistributedModel
-  :noindex:
-
-   A sub-class of ``torch.nn.Module`` which specifies the model to be
-   partitioned. Accepts a ``torch.nn.Module`` object ``module`` which is
-   the model to be partitioned. The returned ``DistributedModel`` object
-   internally manages model parallelism and data parallelism. Only one
-   model in the training script can be wrapped with
-   ``smp.DistributedModel``.
-
-   **Example:**
-
-   .. code:: python
-
-      model = smp.DistributedModel(model)
-
-   **Important**: The ``__call__`` and  ``backward`` method calls on the
-   ``smp.DistributedModel`` object (in the following example, the object
-   is \ ``model``) can only be made inside a ``smp.step``-decorated
-   function.
-
-   Since ``DistributedModel``  is a ``torch.nn.Module``, a forward pass can
-   be performed by calling the \ ``DistributedModel`` object on the input
-   tensors.
-
-   .. code:: python
-
-      predictions = model(inputs)   # model is a smp.DistributedModel object
-
-   For a backward pass, one needs to call the backward function on
-   the \ ``DistributedModel`` object, with tensors and gradients as
-   arguments, replacing the PyTorch operations \ ``torch.Tensor.backward``
-   or ``torch.autograd.backward``.
-
-   The API for ``model.backward`` is very similar to
-   ``torch.autograd.backward``. For example, the following
-   ``backward`` calls:
-
-   .. code:: python
-
-      torch.autograd.backward(loss) or loss.backward()
-
-   should be replaced with:
-
-   .. code:: python
-
-      model.backward(loss) # loss is a tensor with only one element as its data
-
-   Similarly, for non-scalar tensors, replace the following
-   ``backward`` call containing incoming gradient arguments:
-
-   .. code:: python
-
-      torch.autograd.backward(outputs, out_grads)
-
-   with the following line:
-
-   .. code:: python
-
-      model.backward(outputs, out_grads)
-
-   In these examples, all ``__call__``  and ``backward`` method calls on
-   the model objects (``model(inputs)`` and ``model.backward(loss)``) must be made inside
-   a ``smp.step``-decorated function.
-
-   **Using DDP**
-
-   If DDP is enabled with the SageMaker model parallel library, do not not place a PyTorch
-   ``DistributedDataParallel`` wrapper around the ``DistributedModel`` because
-   the ``DistributedModel`` wrapper will also handle data parallelism.
-
-   Unlike the original DDP wrapper, when you use ``DistributedModel``,
-   model parameters and buffers are not immediately broadcast across
-   processes when the wrapper is called. Instead, the broadcast is deferred to the first call of the
-   ``smp.step``-decorated function when the partition is done.
-
-   **Parameters**
-
-   -  ``module`` (``torch.nn.Module``): Module to be distributed (data parallelism and model parallelism).
-
-   -  ``trace_device`` (``"cpu"`` or ``"gpu"``) (default: ``"gpu"``)
-      Whether to perform the tracing step on the GPU or CPU. The tracing step gathers
-      information on the order of execution of modules, the shapes of
-      intermediate outputs, and execution times, to be used by the
-      partitioning algorithm. If ``trace_device`` is set to GPU, accurate
-      module execution times can be gathered during tracing for potentially
-      improved partitioning decision. However, if the model is too large to
-      fit in a single GPU, then ``trace_device`` should be set to CPU.
-
-   -  ``trace_execution_times`` (``bool``) (default: ``False``): If ``True``,
-      the library profiles the execution time of each module during tracing, and uses
-      it in the partitioning decision. This improves the partitioning
-      decision, but it might make the tracing slower. It may also introduce
-      some degree of non-determinism in partitioning results, because of the
-      inherent randomness in module execution times. Must be ``False`` if
-      ``trace_device`` is ``"cpu"``.
-
-   -  ``overlapping_allreduce`` (``bool``) (default: ``True``): This is only
-      applicable for hybrid data parallelism/model parallelism use cases (when
-      ``ddp`` is set to ``True`` while launching training). The library uses this flag
-      to decide whether to do overlapping allreduce whenever a parameter
-      gradients are ready. This leads to overlapping of communication and
-      computation and can improve performance. If this is set to ``False`` ,
-      allreduce is performed at the end of the step.
-
-   -  ``backward_passes_per_step`` (``int``) (default: 1): This is only
-      applicable for hybrid data parallelism/model parallelism use cases (when
-      ``ddp`` is set to ``True`` in config). This parameter indicates the
-      number of backward passes to perform before calling allreduce on DDP.
-      This allows accumulating updates over multiple mini-batches before
-      reducing and applying them.
-
-   -  ``average_grads_across_microbatches`` (``bool``) (default: ``True``):
-      Whether or not the computed gradients should be averaged across
-      microbatches. If ``False``, the computed gradients will be summed across
-      microbatches, but not divided by the number of microbatches. In typical
-      use case where the computed loss is averaged over the mini-batch, this
-      should be left as ``True``. If you use a loss function that only sums
-      the per-sample loss across the batch (and not divide by the batch size),
-      then this must be set to ``False`` for correctness.
-
-   -  ``bucket_cap_mb`` (default: 25): \ ``DistributedDataParallel`` buckets
-      parameters into multiple buckets so that gradient reduction of each
-      bucket can potentially overlap with backward
-      computation. \ ``bucket_cap_mb``\ controls the bucket size in MegaBytes
-      (MB).
-
-   -  ``trace_memory_usage`` (default: False): When set to True, the library attempts
-      to measure memory usage per module during tracing. If this is disabled,
-      memory usage will be estimated through the sizes of tensors returned from
-      the module.
-
-   -  ``broadcast_buffers`` (default: True): Flag to be used with ``ddp=True``.
-      This parameter is forwarded to the underlying ``DistributedDataParallel`` wrapper.
-      Please see: `broadcast_buffer <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel>`__.
-
-   -  ``gradient_as_bucket_view`` (default: False): To be
-      used with ``ddp=True``. This parameter is forwarded to the underlying
-      ``DistributedDataParallel`` wrapper. Please see `gradient_as_bucket_view <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel>`__.
-
-   **Properties**
-
-   -  ``partitioned``: Is ``True`` if the model is partitioned, ``False``
-      otherwise. Initialized to ``False`` when ``DistributedModel`` is first
-      created. It becomes be ``True`` during the first call
-      to ``smp.step``-decorated function. Once the model is partitioned, the
-      local parameters or local ``state_dict`` can be fetched using the
-      following methods.
-
-   **Methods**
-
-   .. function:: backward(tensors, grad_tensors)
-      :noindex:
-
-      Triggers a distributed backward
-      pass across model partitions. Example usage provided in the previous
-      section. The API is very similar
-      to https://pytorch.org/docs/stable/autograd.html#torch.autograd.backward.
-      ``retain_grad`` and ``create_graph``  flags are not supported.
-
-   .. function:: local_buffers( )
-      :noindex:
-
-      Returns an iterator over buffers for the modules in
-      the partitioned model that have been assigned to the current process.
-
-   .. function:: local_named_buffers( )
-      :noindex:
-
-      Returns an iterator over buffers for the
-      modules in the partitioned model that have been assigned to the current
-      process. This yields both the name of the buffer as well as the buffer
-      itself.
-
-   .. function:: local_parameters( )
-      :noindex:
-
-      Returns an iterator over parameters for the
-      modules in the partitioned model that have been assigned to the current
-      process.
-
-   .. function:: local_named_parameters( )
-      :noindex:
-
-      Returns an iterator over parameters for
-      the modules in the partitioned model that have been assigned to the
-      current process. This yields both the name of the parameter as well as
-      the parameter itself.
-
-   .. function:: local_modules( )
-      :noindex:
-
-      Returns an iterator over the modules in the
-      partitioned model that have been assigned to the current process.
-
-   .. function:: local_named_modules( )
-      :noindex:
-
-      Returns an iterator over the modules in the
-      partitioned model that have been assigned to the current process. This
-      yields both the name of the module as well as the module itself.
-
-   .. function:: local_state_dict( )
-      :noindex:
-
-      Returns the ``state_dict`` that contains local
-      parameters that belong to the current \ ``mp_rank``. This ``state_dict``
-      contains a key \ ``_smp_is_partial`` to indicate this is a
-      partial \ ``state_dict``, which indicates whether the
-      ``state_dict`` contains elements corresponding to only the current
-      partition, or to the entire model.
-
-   .. function:: state_dict( )
-      :noindex:
-
-      Returns the ``state_dict`` that contains parameters
-      for the entire model. It first collects the \ ``local_state_dict``  and
-      gathers and merges the \ ``local_state_dict`` from all ``mp_rank``\ s to
-      create a full ``state_dict``. Please note that this needs to be called on all ranks with
-      ``dp_rank()==0`` to ensure the gather happens properly.
-      If it is only called on all such ranks, it can hang.
-
-   .. function:: load_state_dict( )
-      :noindex:
-
-      Same as the ``torch.module.load_state_dict()`` ,
-      except: It first gathers and merges the ``state_dict``\ s across
-      ``mp_rank``\ s, if they are partial. The actual loading happens after the
-      model partition so that each rank knows its local parameters.
-
-   .. function:: register_post_partition_hook(hook)
-      :noindex:
-
-      Registers a callable ``hook`` to
-      be executed after the model is partitioned. This is useful in situations
-      where an operation needs to be executed after the model partition during
-      the first call to ``smp.step``, but before the actual execution of the
-      first forward pass. Returns a ``RemovableHandle`` object ``handle``,
-      which can be used to remove the hook by calling ``handle.remove()``.
-
-   .. function:: cpu( )
-      :noindex:
-
-      Allgathers parameters and buffers across all ``mp_rank``\ s and moves them
-      to the CPU.
-
-   .. function:: join( )
-      :noindex:
-
-      A context manager to be used in conjunction with an instance of
-      ``smp.DistributedModel`` to be able to train with uneven inputs across
-      participating processes. This is only supported when ``ddp=True``. This will use the join with the wrapped
-      ``DistributedDataParallel`` instance. For more information, see:
-      `join <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel.join>`__
-      in the PyTorch documentation.
-
-   .. function:: register_comm_hook( state, callable )
-      :noindex:
-
-      **Available for PyTorch 1.8.1 only**
-      Registers a communication hook which is an enhancement that provides
-      a flexible hook ``callable`` to users where they can specify how
-      gradients are aggregated across multiple workers. This method will be called on the wrapped ``DistributedDataParallel`` instance.
-
-      Please note that when you register a comm hook you have full control of how the gradients are processed.
-      When using only data parallelism with Torch DDP you are expected to average grads across data parallel replicas within the hook.
-      Similarly, when using DistributedModel you have to averaging grads across data parallel replicas within the hook.
-      In addition to that, you also have to average grads across microbatches within the hook unless you explicitly desire to not average based on your loss function.
-      See ``average_grads_across_microbatches`` for more information about averaging grads across microbatches.
-
-      This is only supported when ``ddp=True`` and ``overlapping_allreduce=True`` (default).
-      For more information, see:
-      `register_comm_hook <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel.register_comm_hook>`__
-      in the PyTorch documentation.
-
-  **Behavior of** ``smp.DistributedModel`` **with Tensor Parallelism**
-
-  When a model is wrapped by ``smp.DistributedModel``, the library
-  immediately traverses the modules of the model object, and replaces the
-  modules that are supported for tensor parallelism with their distributed
-  counterparts. This replacement happens in place. If there are no other
-  references to the original modules in the script, they are
-  garbage-collected. The module attributes that previously referred to the
-  original submodules now refer to the distributed versions of those
-  submodules.
-
-  **Example:**
-
-  .. code:: python
-
-     # register DistributedSubmodule as the distributed version of Submodule
-     # (note this is a hypothetical example, smp.nn.DistributedSubmodule does not exist)
-     smp.tp_register_with_module(Submodule, smp.nn.DistributedSubmodule)
-
-     class MyModule(nn.Module):
-         def __init__(self):
-             ...
-
-             self.submodule = Submodule()
-         ...
-
-     # enabling tensor parallelism for the entire model
-     with smp.tensor_parallelism():
-         model = MyModule()
-
-     # here model.submodule is still a Submodule object
-     assert isinstance(model.submodule, Submodule)
-
-     model = smp.DistributedModel(model)
-
-     # now model.submodule is replaced with an equivalent instance
-     # of smp.nn.DistributedSubmodule
-     assert isinstance(model.module.submodule, smp.nn.DistributedSubmodule)
-
-  If ``pipeline_parallel_degree`` (equivalently, ``partitions``) is 1, the
-  placement of model partitions into GPUs and the initial broadcast of
-  model parameters and buffers across data-parallel ranks take place
-  immediately. This is because it does not need to wait for the model
-  partition when ``smp.DistributedModel`` wrapper is called. For other
-  cases with ``pipeline_parallel_degree`` greater than 1, the broadcast
-  and device placement will be deferred until the first call of an
-  ``smp.step``-decorated function happens. This is because the first
-  ``smp.step``-decorated function call is when the model partitioning
-  happens if pipeline parallelism is enabled.
-
-  Because of the module replacement during the ``smp.DistributedModel``
-  call, any ``load_state_dict`` calls on the model, as well as any direct
-  access to model parameters, such as during the optimizer creation,
-  should be done **after** the ``smp.DistributedModel`` call.
-
-  Since the broadcast of the model parameters and buffers happens
-  immediately during ``smp.DistributedModel`` call when the degree of
-  pipeline parallelism is 1, using ``@smp.step`` decorators is not
-  required when tensor parallelism is used by itself (without pipeline
-  parallelism).
-
-  For more information about the library's tensor parallelism APIs for PyTorch,
-  see :ref:`smdmp-pytorch-tensor-parallel`.
-
-  **Additional Methods of** ``smp.DistributedModel`` **for Tensor Parallelism**
-
-  The following are the new methods of ``smp.DistributedModel``, in
-  addition to the ones listed in the
-  `documentation <https://sagemaker.readthedocs.io/en/stable/api/training/smp_versions/v1.2.0/smd_model_parallel_pytorch.html#smp.DistributedModel>`__.
-
-  .. function:: distributed_modules()
-   :noindex:
-
-     -  An iterator that runs over the set of distributed
-        (tensor-parallelized) modules in the model
-
-  .. function:: is_distributed_parameter(param)
-   :noindex:
-
-     -  Returns ``True`` if the given ``nn.Parameter`` is distributed over
-        tensor-parallel ranks.
-
-  .. function::  is_distributed_buffer(buf)
-   :noindex:
-
-     -  Returns ``True`` if the given buffer is distributed over
-        tensor-parallel ranks.
-
-  .. function::  is_scaled_batch_parameter(param)
-   :noindex:
-
-     -  Returns ``True`` if the given ``nn.Parameter`` is operates on the
-        scaled batch (batch over the entire ``TP_GROUP``, and not only the
-        local batch).
-
-  .. function::  is_scaled_batch_buffer(buf)
-   :noindex:
-
-     -  Returns ``True`` if the parameter corresponding to the given
-        buffer operates on the scaled batch (batch over the entire
-        ``TP_GROUP``, and not only the local batch).
-
-  .. function::  default_reducer_named_parameters()
-   :noindex:
-
-     -  Returns an iterator that runs over ``(name, param)`` tuples, for
-        ``param`` that is allreduced over the ``DP_GROUP``.
-
-  .. function::  scaled_batch_reducer_named_parameters()
-   :noindex:
-
-     -  Returns an iterator that runs over ``(name, param)`` tuples, for
-        ``param`` that is allreduced over the ``RDP_GROUP``.
-
-
-
-.. class:: smp.DistributedOptimizer
-   :noindex:
-
-   **Parameters**
-   - ``optimizer``
-
-   An optimizer wrapper for saving/loading optimizer states. This wrapper
-   returns ``optimizer`` with the following methods overridden:
-
-   .. function:: state_dict( )
-      :noindex:
-
-      Returns the ``state_dict`` that contains optimizer state for the entire model.
-      It first collects the ``local_state_dict`` and gathers and merges
-      the ``local_state_dict`` from all ``mp_rank``s to create a full
-      ``state_dict``.
-
-   .. function::  load_state_dict( )
-      :noindex:
-
-      Same as the ``torch.optimizer.load_state_dict()`` , except:
-
-         -  It first gathers and merges the local ``state_dict``\ s if they are
-            partial.
-         -  The actual loading happens after the model partition so that each
-            rank knows its local parameters.
-
-   .. function::  local_state_dict( )
-      :noindex:
-
-      Returns the ``state_dict`` that contains the
-      local optimizer state that belongs to the current \ ``mp_rank``. This
-      ``state_dict`` contains a key \ ``_smp_is_partial`` to indicate this is
-      a partial \ ``state_dict``, which indicates whether the
-      ``state_dict`` contains elements corresponding to only the current
-      partition, or to the entire model.
-
-   ​
-.. function:: smp.partition(index)
-   :noindex:
-
-   **Inputs**
-
-   -  ``index`` (int) - The index of the partition.
-
-   A context manager which places all modules defined inside into the
-   partition with ID ``index``.  The ``index`` argument must be less than
-   the number of partitions.
-
-   Use ``smp.partition`` to implement manual partitioning.
-   If ``"auto_partition"`` is ``True``, then the
-   ``smp.partition`` contexts are ignored. Any module that is not placed in
-   any ``smp.partition`` context is placed in the
-   ``default_partition`` defined through the SageMaker Python SDK.
-
-   When ``smp.partition`` contexts are nested, the innermost context
-   overrides the rest (see the following example). In PyTorch, manual
-   partitioning should be done inside the module \ ``__init__``, and the
-   partition assignment applies to the modules that are *created* inside
-   the ``smp.partition`` context.
-
-   Example:
-
-   .. code:: python
-
-      class Model(torch.nn.Module):
-          def __init__(self):
-              with smp.partition(1):
-                  self.child0 = Child0()            # child0 on partition 1
-                  with smp.partition(2):
-                      self.child1 = Child1()        # child1 on partition 2
-                  self.child2 = Child2()            # child2 on partition 1
-              self.child3 = Child3()                # child3 on default_partition
-
-.. function:: smp.get_world_process_group( )
-   :noindex:
-
-   Returns a ``torch.distributed`` ``ProcessGroup`` that consists of all
-   processes, which can be used with the ``torch.distributed`` API.
-   Requires ``"ddp": True`` in SageMaker Python SDK parameters.
-
-.. function:: smp.get_mp_process_group( )
-   :noindex:
-
-   Returns a ``torch.distributed`` ``ProcessGroup`` that consists of the
-   processes in the ``MP_GROUP`` which contains the current process, which
-   can be used with the \ ``torch.distributed`` API. Requires
-   ``"ddp": True`` in SageMaker Python SDK parameters.
-
-.. function:: smp.get_dp_process_group( )
-   :noindex:
-
-   Returns a ``torch.distributed`` ``ProcessGroup`` that consists of the
-   processes in the ``DP_GROUP`` which contains the current process, which
-   can be used with the \ ``torch.distributed`` API. Requires
-   ``"ddp": True`` in SageMaker Python SDK parameters.
-
-.. function:: smp.is_initialized( )
-   :noindex:
-
-   Returns ``True`` if ``smp.init`` has already been called for the
-   process, and ``False`` otherwise.
-
-.. function::smp.is_tracing( )
-   :noindex:
-   :noindex:
-
-   Returns ``True`` if the current process is running the tracing step, and
-   ``False`` otherwise.
-
-.. data:: smp.nn.FusedLayerNorm
-   :noindex:
-
-   `Apex Fused Layer Norm <https://nvidia.github.io/apex/layernorm.html>`__ is currently not
-   supported by the library. ``smp.nn.FusedLayerNorm`` replaces ``apex``
-   ``FusedLayerNorm`` and provides the same functionality. This requires
-   ``apex`` to be installed on the system.
-
-.. data:: smp.optimizers.FusedNovoGrad
-   :noindex:
-
-   `Fused Novo Grad optimizer <https://nvidia.github.io/apex/optimizers.html#apex.optimizers.FusedNovoGrad>`__ is
-   currently not supported by the library. ``smp.optimizers.FusedNovoGrad`` replaces ``apex`` ``FusedNovoGrad``
-   optimizer and provides the same functionality. This requires ``apex`` to
-   be installed on the system.
-
-.. data:: smp.optimizers.FusedLamb
-   :noindex:
-
-   `FusedLamb optimizer <https://nvidia.github.io/apex/optimizers.html#apex.optimizers.FusedLAMB>`__
-   currently doesn’t work with the library. ``smp.optimizers.FusedLamb`` replaces
-   ``apex`` ``FusedLamb`` optimizer and provides the same functionality.
-   This requires ``apex`` to be installed on the system.
-
-.. data:: smp.amp.GradScaler
-   :noindex:
-
-   `Torch AMP Gradscaler <https://pytorch.org/docs/stable/amp.html#torch.cuda.amp.GradScaler>`__
-   currently doesn’t work with the library. ``smp.amp.GradScaler`` replaces
-   ``torch.amp.GradScaler`` and provides the same functionality.
-
-.. _pytorch_saving_loading:
-   :noindex:
-
-APIs for Saving and Loading
-^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-.. function:: smp.save( )
-   :noindex:
-
-   Saves an object. This operation is similar to ``torch.save()``, except
-   it has an additional keyword argument, ``partial``, and accepts only
-   string type for the argument ``f`` (file). If ``partial=True``, each
-   ``mp_rank`` saves a separate checkpoint file and the library adds an ``mp_rank``
-   index to your saved file.
-
-   **Parameters**
-
-   -  ``obj`` (dict): A saved object.
-   -  ``f`` (str): A string containing a file name.
-   -  ``partial`` (bool, default= ``True``):  When set to ``True``, each
-      ``mp_rank`` saves a separate checkpoint file and the library adds an
-      ``mp_rank`` index to the saved file. If you want to be able to load
-      and further train a model that you save with ``smp.save()``, you must
-      set ``partial=True``.
-   -  ``pickle_module`` (picklemodule, default = module ``"pickle"`` from ``"/opt/conda/lib/python3.6/pickle.py"``):
-      A module used for pickling metadata and objects.
-   -  ``pickle_protocol``  (int, default=2): Can be specified to
-      override the defaultprotocol.
-
-.. function:: smp.load( )
-   :noindex:
-
-   Loads an object saved with ``smp.save()`` from a file.
-
-   Similar to, `torch.load() <https://pytorch.org/docs/stable/generated/torch.load.html>`__,
-   except it has an additional keyword argument, ``partial``, and accepts
-   only string type for the argument ``f`` (file). If \ ``partial=True``,
-   then each ``mp_rank`` loads a separate checkpoint file.
-
-   **Parameters**
-
-   -  ``f`` (string): A string containing a file name.
-   -  ``map_location`` (function): A function
-      `torch.device <https://pytorch.org/docs/stable/tensor_attributes.html#torch.torch.device>`__,
-      a string, or a dict specifying how to remap storage locations.
-   -  ``pickle_module`` (pickle module): A module used for unpickling
-      metadata and objects (has to match the \ ``pickle_module``\ used to
-      serialize file).
-   -  ``pickle_load_args`` (Python 3 only): Optional keyword arguments
-      passed to ``pickle_module.load()`` and ``pickle_module.Unpickler()``.
-   -  ``partial`` (bool, default= ``True``): When set to ``True``, each
-      ``mp_rank`` loads the checkpoint corresponding to the ``mp_rank``.
-      Should be used when loading a model trained with the library.
-
-.. _pytorch_saving_loading_instructions:
-   :noindex:
-
-General Instruction For Saving and Loading
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-The library can save partial or full checkpoints.
-
--  For partial checkpoints, each ``mp_rank`` saves its own checkpoint
-   file with only the parameters that belong to that rank.
--  For full checkpoints, the library saves a single checkpoint that contains
-   entire model parameters.
-
-When **saving** using ``smp.save()``, each rank only holds its own
-parameters. If you want to save the full model, there will be some
-communication between the ranks to create the full model. If you save
-checkpoints often, you should save partial checkpoints for best
-performance.
-
-When **loading** using ``smp.load()``, the library can load either partial or |
-full checkpoints or full checkpoints saved by a non-model-parallel model. If you
-want to resume training with a non-model-parallel model or do inference, you need
-a full checkpoint.
-
-The following is an example of how you can save and load a checkpoint:
-
-.. code:: python
-
-   # Original model and optimizer
-   model = MyModel(...)
-   optimizer = MyOpt(...)
-
-   # model parallel wrapper
-   model = smp.DistributedModel(model)
-   optimizer = smp.DistributedOptimizer(optimizer)
-
-   # To save, always save on dp_rank 0 to avoid data racing
-   if partial:
-       # To save the partial model on each mp rank
-       # the library will create `checkpoint.pt_{mprank}` for each mp rank
-       if save_partial_model:
-           if smp.dp_rank() == 0:
-               model_dict = model.local_state_dict() # save the partial model
-               opt_dict = optimizer.local_state_dict() # save the partial optimizer state
-               smp.save(
-                   {"model_state_dict": model_dict, "optimizer_state_dict": opt_dict},
-                   f"/checkpoint.pt",
-                   partial=True,
-               )
-
-       # To save the full model
-       if save_full_model:
-           if smp.dp_rank() == 0:
-               model_dict = model.state_dict() # save the full model
-               opt_dict = optimizer.state_dict() # save the full optimizer state
-               smp.save(
-                   {"model_state_dict": model_dict, "optimizer_state_dict": opt_dict},
-                   "/checkpoint.pt",
-                   partial=False,
-               )
-
-   # To load, load on all ranks.
-   # The only difference for partial/full loading is the partial flag in smp.load
-   # Load partial checkpoint
-   if partial_checkpoint:
-       checkpoint = smp.load("/checkpoint.pt", partial=True)
-       model.load_state_dict(checkpoint["model_state_dict"])
-       optimizer.load_state_dict(checkpoint["optimizer_state_dict"])
-   # Load full checkpoint
-   if full_checkpoint:
-       checkpoint = smp.load("/checkpoint.pt", partial=False)
-       model.load_state_dict(checkpoint["model_state_dict"])
-       optimizer.load_state_dict(checkpoint["optimizer_state_dict"])
diff --git a/doc/api/training/smp_versions/v1.9.0/smd_model_parallel_pytorch_tensor_parallel.rst b/doc/api/training/smp_versions/v1.9.0/smd_model_parallel_pytorch_tensor_parallel.rst
deleted file mode 100644
index c66595ddf2..0000000000
--- a/doc/api/training/smp_versions/v1.9.0/smd_model_parallel_pytorch_tensor_parallel.rst
+++ /dev/null
@@ -1,876 +0,0 @@
-.. _smdmp-pytorch-tensor-parallel:
-   :noindex:
-
-PyTorch API for Tensor Parallelism
-==================================
-
-SageMaker distributed tensor parallelism works by replacing specific submodules
-in the model with their distributed implementations. The distributed modules
-have their parameters and optimizer states partitioned across tensor-parallel
-ranks. This is to compute the same output as it would have been computed by
-the original modules. Since tensor parallelism occurs across data-parallel
-ranks, a rank might collect slices of the activations corresponding to the
-data shards on other devices that are part of the same tensor parallelism group.
-
-You can enable or disable tensor parallelism for specific parts of the model.
-Within the enabled parts, the replacements with distributed modules will take
-place on a best-effort basis for those module supported for tensor parallelism.
-Alternatively, you can directly import and use the library’s distributed
-modules in the model definition.
-
-Some of the supported modules (such as ``smp.nn.Transformer``) are high-level
-blocks that contain many operations. Because custom implementations
-(as opposed to the built-in PyTorch modules) are typically used for these
-high-level blocks, the library offers an API that you can use to register
-specific distributed versions with such custom modules (provided that they
-are functionally equivalent). This allows the library to automatically replace
-the occurrences of such PyTorch modules with their distributed counterparts
-provided by the library.
-For more information, see the following topics.
-
-.. contents:: Topics
-  :depth: 3
-  :local:
-
-.. _registering-tp-modules:
-   :noindex:
-
-Registering Tensor Parallelism Distributed Modules
---------------------------------------------------
-
-Although PyTorch natively provides some of the commonly used (and
-tensor-parallelizable) building blocks such as Transformer, users often
-use custom implementations for such higher-level modules. To distribute
-such modules with tensor parallelism, you need to register the
-distributed modules to the custom module implementation in your class,
-so that the library knows how to distribute the custom module. When you
-register the distributed modules, make sure the custom module that you
-use is functionally equivalent to the distributed module. You can verify
-this by taking a look at the equivalent reference implementations in the
-:ref:`smdmp-tp-appendix`.
-These implementations are functionally equivalent to their distributed
-versions in ``smp.nn`` module.
-
-.. decorator:: @smp.tp_register(dist_module, init_hook=None, forward_hook=None, return_hook=None)
-
-   -  A class decorator that registers the ``dist_module`` class with
-      the module class that it is attached to. The hooks can be used to
-      adapt to different interfaces used with ``__init__`` and
-      ``forward`` methods.
-   -  **Arguments:**
-
-      -  ``dist_module``: A subclass of ``smp.nn.DistributedModule``
-         that implements the distributed version of the module class the
-         decorator is attached to. Any distributed module class defined
-         in ``smp.nn`` module can be used.
-      -  ``init_hook``: A callable that translates the arguments of the
-         original module ``__init__`` method to an ``(args, kwargs)``
-         tuple compatible with the arguments of the corresponding
-         distributed module ``__init__`` method. Must return a tuple,
-         whose first element is an iterable representing the positional
-         arguments, and second element is a ``dict`` representing the
-         keyword arguments. The input signature of the ``init_hook``
-         must **exactly** match the signature of the original
-         ``__init__`` method (including argument order and default
-         values), except it must exclude ``self``.
-      -  ``forward_hook``: A callable that translates the arguments of
-         the original module ``forward`` method to an ``(args, kwargs)``
-         tuple compatible with the arguments of the corresponding
-         distributed module ``forward`` method. Must return a tuple,
-         whose first element is an iterable representing the positional
-         arguments, and second element is a ``dict`` representing the
-         keyword arguments. The input signature of the ``init_hook``
-         must **exactly** match the signature of the original
-         ``forward`` method (including argument order and default
-         values), except it must exclude ``self``.
-      -  ``return_hook``: A callable that translates the object returned
-         from the distributed module to the return object expected of
-         the original module.
-
-   -  **Example:**
-
-      .. code:: python
-
-         init_hook = lambda config: ((), config.to_dict())
-
-         # register smp.nn.DistributedTransformer
-         # as the distributed version of MyTransformer
-         @smp.tp_register(smp.nn.DistributedTransformer, init_hook=init_hook)
-         class MyTransformer(nn.Module):
-             def __init__(self, config):
-                 ...
-
-             def forward(self, hidden_states, attention_mask):
-                 ...
-
-.. function:: smp.tp_register_with_module(module_cls, dist_module, init_hook=None, forward_hook=None, return_hook=None)
-   :noindex:
-
-   -  When you do not have direct access to model definition code, you
-      can use this API to similarly register a distributed module with
-      an existing module class.
-
-   -  **Arguments:**
-
-      -  ``module_cls``: The existing module class that will be
-         distributed.
-      -  ``dist_module``: A subclass of ``smp.nn.DistributedModule``
-         that implements the distributed version of the module class the
-         decorator is attached to. Any distributed module class defined
-         in ``smp.nn`` module can be used.
-      -  ``init_hook``: A callable that translates the arguments of the
-         original module ``__init__`` method to an ``(args, kwargs)``
-         tuple compatible with the arguments of the corresponding
-         distributed module ``__init__`` method. Must return a tuple,
-         whose first element is an iterable representing the positional
-         arguments, and second element is a ``dict`` representing the
-         keyword arguments. The input signature of the ``init_hook``
-         must **exactly** match the signature of the original
-         ``__init__`` method (including argument order and default
-         values), except it must exclude ``self``.
-      -  ``forward_hook``: A callable that translates the arguments of
-         the original module ``forward`` method to an ``(args, kwargs)``
-         tuple compatible with the arguments of the corresponding
-         distributed module ``forward`` method. Must return a tuple,
-         whose first element is an iterable representing the positional
-         arguments, and second element is a ``dict`` representing the
-         keyword arguments. The input signature of the ``init_hook``
-         must **exactly** match the signature of the original
-         ``forward`` method (including argument order and default
-         values), except it must exclude ``self``.
-      -  ``return_hook``: A callable that translates the object returned
-         from the distributed module to the return object expected of
-         the original module.
-
-   -  **Example:**
-
-      .. code:: python
-
-         from somelibrary import MyTransformer
-
-         init_hook = lambda config: ((), config.to_dict())
-
-         # register smp.nn.DistributedTransformer as the distributed version of MyTransformer
-         smp.tp_register_with_module(MyTransformer,
-                                     smp.nn.DistributedTransformer,
-                                     init_hook=init_hook)
-
-.. _smdmp-supported-modules-for-tp:
-   :noindex:
-
-Supported Modules for Tensor Parallelism
-----------------------------------------
-
-The following modules are supported for tensor
-parallelism.
-
--  ``smp.nn.DistributedLinear`` (implements ``nn.Linear``)
--  ``smp.nn.DistributedTransformerLMHead``
--  ``smp.nn.DistributedTransformer``
--  ``smp.nn.DistributedTransformerLayer``
--  ``smp.nn.DistributedAttentionLayer``
--  ``smp.nn.DistributedTransformerOutputLayer``
--  ``smp.nn.DistributedEmbedding``
-
-.. contents:: Topics
-  :depth: 3
-  :local:
-
-.. _tp-module-api:
-   :noindex:
-
-Tensor Parallelism Module APIs
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. class:: smp.nn.DistributedLinear(in_features, out_features)
-   :noindex:
-
-   -  Tensor-parallel implementation of the ``nn.Linear`` class.
-      Functionally equivalent to an ``nn.Linear`` module with the same
-      ``in_features`` and ``out_features``. In other words,
-      ``in_features`` and ``out_features`` are the number of *global*
-      channels across tensor-parallel ranks.
-   -  **Arguments:**
-
-      -  ``in_features``: The total number of input channels for the
-         linear layer across all tensor-parallel ranks.
-      -  ``out_features``: The total number of output channels for the
-         linear layer across all tensor-parallel ranks.
-
-.. class:: smp.nn.DistributedTransformerLMHead(num_layers=12, num_attention_heads=32, attention_head_size=32, hidden_size=1024, intermediate_size=4096, vocab_size=30522, num_positions=1024, attention_dropout_prob=0.1, hidden_dropout_prob=0.1, activation="gelu", layernorm_epsilon=1e-5, num_token_types=0, causal_mask_size=None, add_cross_attention=False, add_lm_head=True,  initializer_range=0.02, use_normal_initialization=False, pre_layernorm=False, post_layernorm=True)
-   :noindex:
-
-   -  Constructs a distributed transformer model, including embeddings
-      and a single LM head. A word embedding of size
-      ``(vocab_size, hidden_size)`` is created, as well as a positional
-      embedding of size ``(num_positions, hidden_size)``, and the
-      embeddings are added together. If ``num_token_types`` is larger
-      than 0, a separate embedding of size
-      ``(num_token_types, hidden_size)`` is created, and further added
-      on top.
-   -  The embeddings are fed through a ``DistributedTransformer``, and
-      if ``add_lm_head`` is ``True``, the output passes through a single
-      LM head, which is a linear module without bias whose weight is
-      tied to the word embeddings.
-   -  See :class:`smp.nn.DistributedTransformerLayer` for descriptions of the rest
-      of the arguments.
-   -  **Methods:**
-
-      -  ``forward(self, inputs)``
-
-         -  If ``add_cross_attention`` is ``True``, ``inputs`` must be a
-            tuple
-            ``(input_ids, attention_mask, token_type_ids, position_ids, cross_states, cross_states, cross_mask, labels)``.
-         -  Otherwise, ``inputs`` must be a tuple
-            ``(input_ids, attention_mask, token_type_ids, position_ids, labels)``.
-         -  If ``token_type_ids`` is ``None``, token type embedding will
-            not be used.
-         -  ``input_ids`` is assumed to be of shape ``[N, S]``, where
-            ``N`` is the batch size and ``S`` is sequence length.
-         -  ``attention_mask`` is assumed to be a 0-1 tensor of shape
-            ``[N, S]``, where 1 represents a masked position.
-
-.. class:: smp.nn.DistributedTransformer(num_layers=12, num_attention_heads=32, attention_head_size=32, hidden_size=1024, intermediate_size=4096, attention_dropout_prob=0.1, hidden_dropout_prob=0.1, activation="gelu", layernorm_epsilon=1e-5, initializer_range=0.02, use_normal_initialization=False, causal_mask_size=None, add_cross_attention=False, pre_layernorm=False, post_layernorm=True)
-   :noindex:
-
-   -  A sequence of ``smp.nn.DistributedTransformerLayer``\ s, whose
-      number is given by ``num_layers`` argument. For the other
-      arguments and methods, refer to
-      ``smp.nn.DistributedTransformerLayer``.
-   -  If both ``pre_layernorm`` and ``post_layernorm`` are ``True``,
-      layer normalization is applied to both the input and the output of
-      the ``DistributedTransformer``, in addition to the intermediate
-      attention and transformer-output layers.
-
-.. class:: smp.nn.DistributedTransformerLayer(num_attention_heads=32, attention_head_size=32, hidden_size=1024, intermediate_size=4096, attention_dropout_prob=0.1, hidden_dropout_prob=0.1, activation="gelu", layernorm_epsilon=1e-5, initializer_range=0.02, use_normal_initialization=False, causal_mask_size=None, add_cross_attention=False, pre_layernorm=False, post_layernorm=True)
-   :noindex:
-
-   -  Tensor-parallel implementation of a single transformer layer.
-      Number of attention heads, hidden size, and intermediate size
-      refer to the global quantities across all tensor-parallel ranks.
-   -  **Arguments:**
-
-      -  ``num_attention_heads``: The total number of attention heads
-         across tensor-parallel ranks
-      -  ``attention_head_size``: The number of channels of a single
-         attention head.
-      -  ``hidden_size``: The hidden dimension of the transformer. The
-         input tensor ``hidden_states`` is assumed to have its last
-         dimension size equal to ``hidden_size``.
-      -  ``intermediate_size``: The number of output channels in the
-         first linear transformation of the transformer output layer.
-         ``DistributedTransformerOutputLayer`` first maps
-         ``hidden_size`` dimensions of its input tensor into
-         ``intermediate_size`` dimensions, and then maps it back into
-         ``hidden_size`` dimensions.
-      -  ``attention_dropout_prob``: The dropout probability applied to
-         the attention probabilities.
-      -  ``hidden_dropout_prob``: The dropout probability used in
-         dropout layers other than the one applied to the attention
-         probabilities.
-      -  ``activation``: Choice of activation function to use at the
-         output layer. Must be ``"gelu"`` or ``"relu"``.
-      -  ``layernorm_epsilon``: The epsilon added to the denominator of
-         layer normalization for numerical stability.
-      -  ``initializer_range``: If ``use_normal_initialization`` is
-         ``True``, the standard deviation of the normal random variable
-         to initialize the weights with.
-      -  ``use_normal_initialization``: If ``True``, the weights are
-         initialized with normal distribution with standard deviation
-         given by ``initializer_range``. Otherwise, default PyTorch
-         initialization is used.
-      -  ``causal_mask_size``: If ``None``, no causal mask is used on
-         attentions. Otherwise, should be set to maximum sequence length
-         to apply a causal mask to the attention scores. This is used,
-         for instance, in GPT-2.
-      -  ``add_cross_attention``: If ``True``, a cross-attention layer
-         will be added after the self-attention block. The
-         cross-attention layer computes the attention keys and values
-         based on the ``cross_states`` input (instead of
-         ``hidden_states`` input, as in self-attention. This is used in
-         the decoder block of encoder-decoder architectures. For
-         encoder-only architectures that only use self-attention, this
-         should be kept ``False``.
-      -  ``pre_layernorm``: If ``True``, inserts layer normalization at
-         the input. At least one of ``pre_layernorm`` and
-         ``post_layernorm`` must be ``True``.
-      -  ``post_layernorm``: If ``True``, inserts layer normalization at
-         the output. At least one of ``pre_layernorm`` and
-         ``post_layernorm`` must be ``True``.
-
-   -  **Methods:**
-
-      -  ``forward(self, inputs)``: Forward pass for the transformer
-         layer.
-
-         -  **Arguments:**
-
-            -  If ``add_cross_attention=False``, ``inputs`` must be a
-               tuple ``(hidden_states, attention_mask)``, where
-               ``hidden_states`` is assumed to be a tensor of dimensions
-               ``[N, S, H]``, where ``N`` is batch size, ``S`` is
-               sequence length, and ``H`` is ``hidden_size``.
-               ``attention_mask`` is assumed to be a tensor of
-               dimensions ``[N, 1, 1, S]``, where ``N`` is the batch
-               size, and ``S`` is the sequence length.
-            -  If ``add_cross_attention=True``, ``inputs`` must be a
-               tuple
-               ``(hidden_states, cross_states, attention_mask, cross_mask)``,
-               where ``hidden_states`` is assumed to be a tensor of
-               dimensions ``[N, S_1, H]``, where ``N`` is batch size,
-               ``S_1`` is sequence length, and ``H`` is ``hidden_size``.
-               ``cross_states`` is assumed to be a tensor of size
-               ``[N, S_2, H]``, similarly interpreted.
-               ``attention_mask`` is assumed to be a tensor of
-               dimensions ``[N, 1, 1, S_1]``, where ``N`` is the batch
-               size, and ``S_1`` is the sequence length, and
-               ``cross_mask`` is assumed to be a tensor of size
-               ``[N, 1, 1, S_2]``. Keys and values for the attention
-               heads in the cross-attention layer (but not the
-               self-attention layer) are computed using
-               ``cross_states``, and ``cross_mask`` is applied as the
-               attention mask in the cross-attention layer (but not the
-               self-attention layer).
-
-         -  **Returns:**
-
-            -  If ``add_cross_attention=False``, a tuple
-               ``(hidden_states, attention_mask)``, where
-               ``hidden_states`` is the output of the transformer, and
-               ``attention_mask`` is the same the ``attention_mask``
-               argument.
-            -  If ``add_cross_attention=True``, a tuple
-               ``(hidden_states, cross_states, attention_mask, cross_mask)``,
-               where ``hidden_states`` is the output of the transformer,
-               and the next three tensors are the same as the input
-               arguments.
-
-.. class:: smp.nn.DistributedAttentionLayer(num_attention_heads=32, attention_head_size=32, hidden_size=1024, attention_dropout_prob=0.1, hidden_dropout_prob=0.1, layernorm_epsilon=1e-5, initializer_range=0.02, use_normal_initialization=False, cross_attention=False, causal_mask_size=None, pre_layernorm=False, post_layernorm=True)
-   :noindex:
-
-   -  A distributed implementation for the attention block. Includes the
-      computation of the self- or cross-attention (context layer),
-      followed by a linear mapping and dropout, which is optionally
-      followed by the residual-connection and layer normalization.
-   -  **Arguments:**
-
-      -  See :class:`smp.nn.DistributedTransformerLayer` for descriptions of the
-         arguments.
-      -  ``cross_attention``: If ``True``, it computes the attentions
-         with respect to the ``cross_states`` tensor of the ``forward``
-         method input tuple. (Default: ``False``)
-
-   -  **Methods:**
-
-      -  ``forward(self, inputs)``: Forward pass for the attention
-         layer.
-
-         -  **Arguments:**
-
-            -  If ``cross_attention=False``, ``inputs`` must be a tuple
-               ``(hidden_states, attention_mask)``, where
-               ``hidden_states`` is assumed to be a tensor of dimensions
-               ``[N, S, H]``, where ``N`` is batch size, ``S`` is
-               sequence length, and ``H`` is ``hidden_size``.
-               ``attention_mask`` is assumed to be a tensor of
-               dimensions ``[N, 1, 1, S]``, where ``N`` is the
-               batch size, and ``S`` is the sequence length.
-            -  If ``cross_attention=True``, ``inputs`` must be a tuple
-               ``(hidden_states, cross_states, attention_mask)``, where
-               ``hidden_states`` is assumed to be a tensor of dimensions
-               ``[N, S_1, H]``, where ``N`` is batch size, ``S_1`` is
-               sequence length, and ``H`` is ``hidden_size``.
-               ``cross_states`` is assumed to be a tensor of size
-               ``[N, S_2, H]``, similarly interpreted.
-               ``attention_mask`` is assumed to be a tensor of
-               dimensions ``[N, 1, 1, S_2]``, where ``N`` is the batch
-               size, and ``S_2`` is the sequence length. Keys and values
-               for the attention heads are computed using
-               ``cross_states``.
-
-         -  **Returns:**
-
-            -  A single tensor that is the output of the attention
-               layer.
-
-.. class:: smp.nn.DistributedTransformerOutputLayer(hidden_size=1024, intermediate_size=4096,  hidden_dropout_prob=0.1, activation="gelu", layernorm_epsilon=1e-5, initializer_range=0.02, use_normal_initialization=False, pre_layernorm=False, post_layernorm=True, fp32_residual_addition=False)
-   :noindex:
-
-   -  Distributed implementation of a single transformer output layer. A
-      single :class:`smp.nn.DistributedTransformerLayer` with
-      ``add_cross_attention=False`` consists of a single
-      ``DistributedAttentionLayer`` immediately followed by a single
-      ``DistributedTransformerOutputLayer``. The latter linearly maps
-      the last channel of the input tensor from ``hidden_size`` to
-      ``intermediate_size``, and then maps it back to ``hidden_size``.
-   -  **Arguments:**
-
-      -  See :class:`smp.nn.DistributedTransformerLayer` for descriptions of the
-         arguments.
-      - ``fp32_residual_addition``: Set to ``True`` if you want to avoid overflow
-        (NaN loss values) for large models with more than 100 billion parameters
-        when using FP16. (Default: False)
-
-.. class:: smp.nn.DistributedEmbedding(num_embeddings,embedding_dim, padding_idx=None, max_norm=None, norm_type=2.0, scale_grad_by_freq=False, sparse=False, _weight=None, initializer_range=0.02, _skip_allgather=False,_skip_scatter_and_merge=False,)
-   :noindex:
-
-   -  Distributed implementation of a single Embedding Layer. Currently
-      only supports splitting across the embedding_dim.
-   -  **Arguments:**
-
-      -  See :class:`smp.nn.DistributedEmbedding` for descriptions of the
-         arguments.
-
-.. _enabling-tp:
-   :noindex:
-
-Enabling Tensor Parallelism
-~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-There are two ways tensor parallelism can be enabled.
-
-First, you can use
-the distributed module implementations in ``smp.nn`` module directly in
-your model definition. See :ref:`smdmp-supported-modules-for-tp`
-for a complete list of built-in distributed modules. Here is an example
-of how this can be done:
-
-.. code:: python
-
-   import torch.nn as nn
-   import smdistributed.modelparallel.torch as smp
-
-   class TransformerModel:
-       def __init__(self):
-           self.embedding = nn.Embedding(vocab_size, hidden_size)
-
-           # directly instantiate smp.nn.DistributedTransformer and use it
-           self.encoder = smp.nn.DistributedTransformer(num_layers, hidden_size, **kwargs)
-
-           self.pooler = nn.Linear(hidden_size, hidden_size)
-
-       def forward(self, hidden_states):
-           emb_out = self.embedding(hidden_states)
-           enc_out = self.encoder(emb_out)
-           return self.pooler(enc_out)
-
-Second, you can enable tensor parallelism for specific modules or blocks
-of code, which will automatically enable tensor parallelism for the
-supported modules within that scope. To do this, you can use the
-following API:
-
-.. decorator:: smp.tensor_parallelism(enabled=True, **kwargs)
-  :noindex:
-
-   -  A context manager that enables or disables tensor parallelism for
-      any supported module that is created inside. If there are nested
-      contexts, the innermost overrides the rest. If there are
-      multiple supported modules created within the context, where one
-      is the submodule of the other, only the outermost module will be
-      distributed. If a supported module shares weights with another
-      (supported or unsupported) module, or if its hyperparameters do
-      not support distribution (e.g., not divisible by the tensor
-      parallelism degree), tensor parallelism will **not** be enabled
-      for this module even if this API is used.
-
-      **Example:**
-
-      .. code:: python
-
-         with smp.tensor_parallelism():
-             self.m0 = nn.Linear(20, 20)                   # will be distributed
-             with smp.tensor_parallelism(enabled=False):
-                 self.m1 = nn.Linear(20, 20)               # will not be distributed
-
-   - ``kwargs`` - Keyword arguments that can be used to modify the configurations of
-     the distributed modules created inside the context.
-     If a keyword argument provided through it matches any ``__init__`` method arguments
-     of a ``DistributedModule`` that substitutes a module created inside
-     the ``smp.tensor_parallelism`` context, this keyword will override
-     the value defined in the ``init_hook``.
-
-     - (*For v1.7.0 and later*) Through the following additional keyword arguments,
-       the library supports `NVIDIA Megatron’s fused kernels
-       <https://github.com/NVIDIA/Megatron-LM/tree/main/megatron/fused_kernels>`_
-
-       - ``fused_softmax`` (bool) - Fusion of attention masking and softmax.
-         By default, it is set to ``True``. You can deactivate it by setting
-         ``fused_softmax=False`` in the ``smp.tensor_parallelism`` context manager.
-       - ``fused_bias_gelu`` (bool) - Fusion of bias addition and Gelu activation.
-         By default, it is set to ``False``. You can activate it by setting
-         ``fused_bias_gelu=True`` in the ``smp.tensor_parallelism`` context manager.
-
-
-
-.. function:: smp.set_tensor_parallelism(module, enabled=True, **kwargs)
-   :noindex:
-
-   -  Enables or disables tensor parallelism for the supported
-      submodules of ``module``. If enabling, the outermost supported
-      modules will be distributed. If disabling, tensor parallelism will
-      be disabled for the entire module subtree of ``module``. Unlike
-      the context manager, this API can be used after the model creation
-      (but before wrapping with :class:`smp.DistributedModel`), so direct
-      access to model definition code is not required. If a supported
-      module shares weights with another (supported or unsupported)
-      module, or if its hyperparameters do not support distribution
-      (e.g., not divisible by the tensor parallelism degree), tensor
-      parallelism will **not** be enabled for this module.
-   -  Keyword arguments ``kwargs`` can be used to modify the
-      configurations of the distributed modules created inside the
-      context. If a keyword argument provided here matches any
-      ``__init__`` method arguments of a :class:`smp.DistributedModel` that
-      substitutes a module created inside the ``smp.tensor_parallelism``
-      context, this keyword will override the value defined in the
-      ``init_hook``.
-   -  **Example:**
-
-      .. code:: python
-
-         model = MyModel()
-         smp.set_tensor_parallelism(model.encoder, True)
-         smp.set_tensor_parallelism(model.encoder.embedding, True)
-
-         # outermost supported submodules in model.encoder will be distributed, except for
-         # model.encoder.embedding
-         model = smp.DistributedModel(model)
-         optimizer = smp.DistributedOptimizer(optimizer)
-
-.. _activation-checkpointing-api:
-   :noindex:
-
-Activation Checkpointing APIs
------------------------------
-
-``smdistributed.modelparallel`` provides three APIs to enable
-activation checkpointing: one for checkpointing modules,
-one for checkpointing sequential modules, and
-one for checkpointing pretrained models.
-
-For a conceptual guide and examples, see
-`Activation Checkpointing <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-activation-checkpointing.html>`_
-in the *SageMaker's Distributed Model Parallel developer guide*.
-
-.. class:: smdistributed.modelparallel.torch.patches.checkpoint.checkpoint(module, *args, preserve_rng_state=True)
-   :noindex:
-
-   -  Checkpoints the module passed. Throws error if, during manual
-      partitioning, all children of module are not on same rank as the
-      module itself, i.e. the module tree is split across multiple
-      partitions. During auto-partitioning, if the module is split
-      across multiple partitions, then this call is ignored(with a
-      warning). Note that this call applies to the module instance only,
-      not to the module class.
-
-   -  **Arguments:**
-
-      -  ``module (Instance of nn.Module)``: The module to be
-         checkpointed. Note that unlike native checkpointing in
-         PyTorch’s, activation checkpointing in
-         ``smdistributed.modelparallel`` is at the granularity of a
-         module. A generic function cannot be passed here.
-      -  ``args``: Tuple containing inputs to the module.
-      -  ``preserve_rng_state (bool, default=True)``: Omit stashing and
-         restoring the RNG state during each checkpoint.
-
-.. class:: smdistributed.modelparallel.torch.patches.checkpoint.checkpoint_sequential(sequential_module, input, strategy="each", preserve_rng_state=True, pack_args_as_tuple=False)
-   :noindex:
-
-   -  Checkpoints the modules inside
-      `nn.Sequential <https://pytorch.org/docs/stable/generated/torch.nn.Sequential.html>`__.
-      This can be used even if different layers that are part of the
-      sequential container lie on different partitions. Each layer part
-      of the sequential module that is checkpointed must lie completely
-      within one partition. If this is not the case during manual
-      partitioning, then an error will be thrown. If this is not the
-      case during auto partitioning, a warning will be raised and this
-      module will be run without checkpointing.
-
-   -  **Arguments**
-
-      -  ``sequential_module (nn.Sequential)``: the sequential module to
-         be checkpointed.
-      -  ``input (torch.Tensor or a tuple of torch.Tensors)``: input to
-         the module, which can be a tensor or a tuple of tensors. If a
-         tuple is passed, then pack_args_as_tuple should be set to True.
-      -  ``strategy (string, default=“each”)`` : Strategy determines how
-         many layers part of the sequential module need to be grouped
-         together for one checkpointing call. This determines how much
-         memory can be reduced. It can take the following values
-
-         -  ``each`` : The default is to checkpoint each module inside
-            the sequential separately.
-         -  ``contiguous``: Groups consecutive layers on the same
-            partition together. For example, if a sequential consists of
-            [a, b, c, d] where a,b are on pp_rank0 and c,d are on
-            pp_rank 1, then this strategy would checkpoint a,b together
-            and then c,d together. This means effectively, inputs of a,
-            outputs of b, inputs of c, and outputs of d are in memory;
-            the reamining activations are recomputed.
-         -  ``group_2, group_3, group_4, etc:`` More generally,
-            ``group_x`` where x is an integer. This strategy provides
-            more flexibility in how many layers to group together.
-            ``group_x`` groups x layers together on a best effort basis.
-            It can group x layers together if there are x layers
-            consecutively on the same partition. For example:
-            [a,b,c,d,e] where a,b are on pp_rank0 and c,d,e are on
-            pp_rank 1. If the strategy is ``group_3,`` then a,b are
-            checkpointed together on pp_rank0 and c,d,e are checkpointed
-            together on pp_rank1.
-
-      -  ``preserve_rng_state (bool, default=True)``: Set to ``False``
-         to omit stashing and restoring the RNG state during each
-         checkpoint.
-      -  ``pack_args_as_tuple (bool, default=False)``: To ensure that
-         backward works correctly, the autograd function has to unpack
-         any tuples received. If the checkpointed layer takes a tuple as
-         input, then this needs to be set to True.
-
-.. class:: smp.set_activation_checkpointing(module, preserve_rng_state=True, pack_args_as_tuple=False, strategy="each")
-   :noindex:
-
-   -  This API is recommended when importing pretrained models from
-      libraries, such as PyTorch and Hugging Face Transformers. This is
-      particularly useful when you don’t have access to the model
-      definition code and not be able to replace a module call with
-      checkpoint.
-
-   -  **Arguments**:
-
-      -  ``module (Instance of nn.Module or nn.Sequential)``: The module
-         to checkpoint.
-      -  ``preserve_rng_state (bool, default=True)``: Set to ``False``
-         to omit stashing and restoring the RNG state during each
-         checkpoint.
-      -  ``pack_args_as_tuple (bool, default=False)``: *Can only be
-         passed when module is a sequential module.* To ensure that
-         backward works correctly, the autograd function has to unpack
-         any tuples received. If the layer checkpointed takes a tuple as
-         input, then this needs to be set to True.
-      -  ``strategy: (string, default=“each”)``: *Can only be passed
-         when module is a sequential module.* Strategy determines how
-         many layers part of the sequential module need to be grouped
-         together for one checkpointing call.
-      -  This determines how much memory can be reduced. It can take the
-         following values
-
-         -  ``each`` : The default is to checkpoint each module inside
-            the sequential separately.
-         -  ``contiguous``: Groups consecutive layers on the same
-            partition together. For example if a sequential consists of
-            ``[a, b, c, d]`` where ``a, b`` are on ``pp_rank0`` and ``c, d`` are on
-            ``pp_rank 1``, then this strategy would checkpoint a,b together
-            and then ``c, d`` together. This means effectively, the inputs of
-            ``a``, outputs of ``b``, inputs of ``c``, and outputs of ``d`` are in
-            memory, and the rest of the activations are recomputed.
-         -  ``group_2, group_3, group_4, etc:`` More generally,
-            ``group_x`` where x is an integer. This strategy provides
-            more flexibility in how many layers to group together.
-            ``group_x`` groups x number of layers together on a best
-            effort basis if there are x layers consecutively in the same
-            partition. **Example**: Assume a module with layers ``[a, b,
-            c, d, e]``. The layers a and b are on pp_rank0, and ``c``, ``d``, and
-            ``e`` are on ``pp_rank 1``. If the strategy is ``group_3,`` then ``a``,
-            ``b`` are checkpointed together on ``pp_rank0``, and ``c``, ``d``, ``e`` are
-            checkpointed together on ``pp_rank1``.
-
-.. _smdmp-tp-appendix:
-   :noindex:
-
-Appendix: Reference Implementations for Modules
------------------------------------------------
-
-The following are reference implementations for transformer-related
-modules. Note that this is not the actual ``smdistributed`` source code,
-but the distributed implementations provided in the library are the
-distributed versions of these reference implementations, and can be used
-to determine whether the distributed modules perform the same operations
-as the custom modules in your script.
-
-To keep the implementations simple, we only assume keyword arguments,
-and assume the existence of a method ``parse_args(kwargs)``, which
-parses the arguments to ``__init__`` methods and sets the relevant
-attributes of the module, such as ``hidden_size`` and
-``num_attention_heads``.
-
-``smp.nn.DistributedTransformer``
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. code:: python
-
-   class Transformer(nn.Module):
-       def __init__(self, **kwargs):
-           super(Transformer, self).__init__()
-           self.parse_args(kwargs)
-
-           self.layers = []
-           for l in range(self.num_layers):
-               self.layers.append(TransformerLayer(**kwargs))
-
-           self.seq_layers = nn.Sequential(*self.layers)
-
-       def forward(self, inp):
-           return self.seq_layers(inp)
-
-``smp.nn.DistributedTransformerLayer``
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. code:: python
-
-   class TransformerLayer(nn.Module):
-       def __init__(self, **kwargs):
-           super(TransformerLayer, self).__init__()
-           self.parse_args(kwargs)
-
-           self.attention = AttentionLayer(**kwargs)
-           self.output = TransformerOutputLayer(**kwargs)
-
-           if self.add_cross_attention:
-               self.cross_attention = AttentionLayer(cross_attention=True, **kwargs)
-
-       def forward(self, inp):
-           if self.add_cross_attention:
-               hidden_states, cross_states, attention_mask, cross_mask = inp
-           else:
-               hidden_states, attention_mask = inp
-
-           attention_output = self.attention((hidden_states, attention_mask))
-           if self.add_cross_attention:
-               attention_output = self.cross_attention((attention_output,
-                                                        cross_states,
-                                                        cross_mask))
-
-           output = self.output(attention_output)
-
-           if self.add_cross_attention:
-               return output, cross_states, attention_mask, cross_mask
-           else:
-               return output, attention_mask
-
-``smp.nn.DistributedAttentionLayer``
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. code:: python
-
-   class AttentionLayer(nn.Module):
-       def __init__(self, **kwargs):
-           super(AttentionLayer, self).__init__()
-           self.parse_args(kwargs)
-           self.attention_head_size = self.hidden_size // self.num_attention_heads
-
-           self.query = nn.Linear(self.hidden_size, self.hidden_size)
-           self.key = nn.Linear(self.hidden_size, self.hidden_size)
-           self.value = nn.Linear(self.hidden_size, self.hidden_size)
-           self.dense = nn.Linear(self.hidden_size, self.hidden_size)
-
-           self.dropout1 = nn.Dropout(self.attention_dropout_prob)
-           self.dropout2 = nn.Dropout(self.hidden_dropout_prob)
-
-           if self.pre_layernorm:
-               self.pre_layernorm = nn.LayerNorm(self.hidden_size,
-                                       eps=self.layernorm_epsilon)
-
-           if self.post_layernorm:
-               self.layernorm = nn.LayerNorm(self.hidden_size,
-                                       eps=self.layernorm_epsilon)
-
-       def transpose(self, tensor, key=False):
-           shape = tensor.size()[:-1] +
-                           (self.num_attention_heads, self.attention_head_size)
-           tensor = torch.reshape(tensor, shape)
-           if key:
-               return tensor.permute(0, 2, 3, 1)
-           else:
-               return tensor.permute(0, 2, 1, 3)
-
-       def forward(self, inp):
-           if self.cross_attention:
-               hidden_states, cross_states, attention_mask = inp
-           else:
-               hidden_states, attention_mask = inp
-
-           if self.pre_layernorm:
-               norm_states = self.pre_layernorm(hidden_states)
-           else:
-               norm_states = hidden_states
-
-           query_layer = self.query(norm_states)
-
-           if self.cross_attention:
-               key_layer = self.key(cross_states)
-               value_layer = self.value(cross_states)
-           else:
-               key_layer = self.key(norm_states)
-               value_layer = self.value(norm_states)
-
-           query_layer = self.transpose(query_layer)
-           key_layer = self.transpose(key_layer, key=True)
-           value_layer = self.transpose(value_layer)
-
-           attention_scores = torch.matmul(query_layer, key_layer)
-           attention_scores = attention_scores / math.sqrt(self.attention_head_size)
-
-           if not self.cross_attention and self.causal_mask is not None:
-               attention_scores = self.apply_causal_mask(attention_scores)
-
-           attention_scores = attention_scores + attention_mask
-
-           attention_probs = F.softmax(attention_scores, dim=-1)
-           attention_probs = self.dropout1(attention_probs)
-
-           context_layer = torch.matmul(attention_probs, value_layer)
-           context_layer = context_layer.permute(0, 2, 1, 3)
-           new_context_layer_shape = context_layer.size()[:-2] + \
-                                       (self.local_attention_size,)
-           context_layer = torch.reshape(context_layer, new_context_layer_shape)
-
-           self_attention = self.dense(context_layer)
-           self_attention = self.dropout2(self_attention)
-
-           if self.post_layernorm:
-               return self.layernorm(self_attention + hidden_states)
-           else:
-               return self_attention
-
-``smp.nn.DistributedTransformerOutputLayer``
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. code:: python
-
-   class TransformerOutputLayer(nn.Module):
-       def __init__(self, **kwargs):
-           super(TransformerOutputLayer, self).__init__()
-           self.parse_args(kwargs)
-
-           self.dense1 = nn.Linear(self.hidden_size, self.intermediate_size)
-           self.dense2 = nn.Linear(self.intermediate_size, self.hidden_size)
-
-           self.dropout = nn.Dropout(self.attention_dropout_prob)
-
-           if self.pre_layernorm:
-               self.pre_layernorm = nn.LayerNorm(self.hidden_size,
-                                       eps=self.layernorm_epsilon)
-
-           if self.post_layernorm:
-               self.layernorm = nn.LayerNorm(self.hidden_size,
-                                       eps=self.layernorm_epsilon)
-
-       def forward(self, inp):
-           if self.pre_layernorm:
-               norm_inp = self.pre_layernorm(inp)
-           else:
-               norm_inp = inp
-
-           dense1_output = self.dense1(norm_inp)
-           if self.activation == "gelu":
-               act_output = F.gelu(dense1_output)
-           else:
-               act_output = F.relu(dense1_output)
-
-           dense2_output = self.dense2(act_output)
-           output = self.dropout(dense2_output)
-
-           if self.post_layernorm:
-               return self.layernorm(inp + output)
-           else:
-               return output
diff --git a/doc/api/training/smp_versions/v1.9.0/smd_model_parallel_tensorflow.rst b/doc/api/training/smp_versions/v1.9.0/smd_model_parallel_tensorflow.rst
deleted file mode 100644
index 2c658b487c..0000000000
--- a/doc/api/training/smp_versions/v1.9.0/smd_model_parallel_tensorflow.rst
+++ /dev/null
@@ -1,171 +0,0 @@
-TensorFlow API
-==============
-
-To use the TensorFlow-specific APIs for SageMaker distributed model parallism,
-you need to add the following import statement at the top of your training script.
-
-.. code:: python
-
-   import smdistributed.modelparallel.tensorflow as smp
-
-.. tip::
-
-   Refer to
-   `Modify a TensorFlow Training Script
-   <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-customize-training-script-tf.html>`_
-   to learn how to use the following APIs in your TensorFlow training script.
-
-.. class:: smp.DistributedModel
-   :noindex:
-
-   A sub-class of the Keras \ ``Model`` class, which defines the model to
-   be partitioned. Model definition is done by sub-classing
-   ``smp.DistributedModel`` class, and implementing the ``call()`` method,
-   in the same way as the Keras model sub-classing API. Any operation that
-   is part of the \ ``smp.DistributedModel.call()`` method is subject to
-   partitioning, meaning that every operation placed inside executes in
-   exactly one of the devices (the operations outside run on all devices).
-
-
-   Similar to the regular Keras API, the forward pass is done by directly
-   calling the model object on the input tensors. For example:
-
-   .. code:: python
-
-      predictions = model(inputs)   # model is a smp.DistributedModel object
-
-   However, ``model()`` calls can only be made inside a
-   ``smp.step``-decorated function.
-
-   The outputs from a ``smp.DistributedModel`` are available in all ranks,
-   regardless of which rank computed the last operation.
-
-   **Methods:**
-
-   .. function:: save_model(save_path="/opt/ml/model")
-      :noindex:
-
-      **Inputs**
-      - ``save_path`` (``string``): A path to save an unpartitioned model with latest training weights.
-
-      Saves the entire,
-      unpartitioned model with the latest trained weights to ``save_path`` in
-      TensorFlow ``SavedModel`` format. Defaults to ``"/opt/ml/model"``, which
-      SageMaker monitors to upload the model artifacts to Amazon S3.
-
-.. function:: smp.partition(index)
-   :noindex:
-
-   **Inputs**
-
-   -  ``index`` (``int``): The index of the partition.
-
-   A context manager which places all operations defined inside into the
-   partition whose ID is equal to ``index``. When
-   ``smp.partition`` contexts are nested, the innermost context overrides
-   the rest. The ``index`` argument must be smaller than the number of
-   partitions.
-
-   ``smp.partition`` is used in the manual partitioning API;
-   if \ ``"auto_partition"`` parameter is set to ``True`` while launching
-   training, then ``smp.partition`` contexts are ignored. Any operation
-   that is not placed in any ``smp.partition`` context is placed in the
-   ``default_partition``, as shown in the following example:
-
-   .. code:: python
-
-      # auto_partition: False
-      # default_partition: 0
-      smp.init()
-      [...]
-      x = tf.constant(1.2)                     # placed in partition 0
-      with smp.partition(1):
-          y = tf.add(x, tf.constant(2.3))      # placed in partition 1
-          with smp.partition(3):
-              z = tf.reduce_sum(y)             # placed in partition 3
-
-
-.. function:: register_post_partition_hook(hook)
-    :noindex:
-
-    Registers a callable ``hook`` to
-    be executed after the model is partitioned. This is useful in situations
-    where an operation needs to be executed after the model partition during
-    the first call to ``smp.step``, but before the actual execution of the
-    first forward pass.
-
-    .. code:: python
-
-        @smp.register_post_partition_hook
-        def test_eager():
-            # All statements here will be executed right after partition but before the first forward pass
-            tf.print("Entered hook through eager context")
-
-.. class:: smp.CheckpointManager
-  :noindex:
-
-
-   A subclass of TensorFlow
-   `CheckpointManager <https://www.tensorflow.org/api_docs/python/tf/train/CheckpointManager>`__,
-   which is used to manage checkpoints. The usage is similar to TensorFlow
-   ``CheckpointManager``.
-
-   The following returns a ``CheckpointManager`` object.
-
-   .. code:: python
-
-      smp.CheckpointManager(checkpoint,
-                            directory="/opt/ml/checkpoints",
-                            max_to_keep=None,
-                            checkpoint_name="ckpt")
-
-   **Parameters**
-
-   -  ``checkpoint``: A `tf.train.Checkpoint
-      <https://www.tensorflow.org/api_docs/python/tf/train/Checkpoint>`__ instance
-      that represents a model checkpoint.
-
-   -  ``directory``: (``str``) The path to a directory in which to write
-      checkpoints. A file named "checkpoint" is also written to this
-      directory (in a human-readable text format) which contains the state
-      of the ``CheckpointManager``. Defaults to
-      ``"/opt/ml/checkpoints"``, which is the directory that SageMaker
-      monitors for uploading the checkpoints to Amazon S3.
-   -  ``max_to_keep`` (``int``): The number of checkpoints to keep. If
-      ``None``, all checkpoints are kept.
-   -  ``checkpoint_name`` (``str``): Custom name for the checkpoint file.
-      Defaults to ``"ckpt"``.
-
-
-   **Methods:**
-
-   .. function:: save( )
-      :noindex:
-
-      Saves a new checkpoint in the specified directory. Internally uses ``tf.train.CheckpointManager.save()``.
-
-   .. function:: restore( )
-      :noindex:
-
-      Restores the latest checkpoint in the specified directory.
-      Internally uses ``tf.train.CheckpointManager.restore()``.
-
-
-   **Examples:**
-
-   .. code:: python
-
-      checkpoint = tf.train.Checkpoint(optimizer=optimizer, model=model)
-      ckpt_manager = smp.CheckpointManager(checkpoint, max_to_keep=5)  # use /opt/ml/checkpoints
-
-      for inputs in train_ds:
-          loss = train_step(inputs)
-          # [...]
-          ckpt_manager.save()  # save a new checkpoint in /opt/ml/checkpoints
-
-   .. code:: python
-
-      for step, inputs in enumerate(train_ds):
-          if step == 0:
-              ckpt_manager.restore()
-          loss = train_step(inputs)
diff --git a/doc/api/training/smp_versions/v1_10_0.rst b/doc/api/training/smp_versions/v1_10_0.rst
deleted file mode 100644
index dc2c1d18d1..0000000000
--- a/doc/api/training/smp_versions/v1_10_0.rst
+++ /dev/null
@@ -1,13 +0,0 @@
-
-Version 1.10.0
-==============
-
-To use the library, reference the Common API documentation alongside the framework specific API documentation.
-
-.. toctree::
-   :maxdepth: 1
-
-   v1.10.0/smd_model_parallel_common_api
-   v1.10.0/smd_model_parallel_pytorch
-   v1.10.0/smd_model_parallel_pytorch_tensor_parallel
-   v1.10.0/smd_model_parallel_tensorflow
diff --git a/doc/api/training/smp_versions/v1_1_0.rst b/doc/api/training/smp_versions/v1_1_0.rst
deleted file mode 100644
index 34b2d83b6b..0000000000
--- a/doc/api/training/smp_versions/v1_1_0.rst
+++ /dev/null
@@ -1,12 +0,0 @@
-
-Version 1.1.0
-=============
-
-To use the library, reference the Common API documentation alongside the framework specific API documentation.
-
-.. toctree::
-   :maxdepth: 1
-
-   v1.1.0/smd_model_parallel_common_api
-   v1.1.0/smd_model_parallel_pytorch
-   v1.1.0/smd_model_parallel_tensorflow
diff --git a/doc/api/training/smp_versions/v1_2_0.rst b/doc/api/training/smp_versions/v1_2_0.rst
deleted file mode 100644
index 4201de0b52..0000000000
--- a/doc/api/training/smp_versions/v1_2_0.rst
+++ /dev/null
@@ -1,12 +0,0 @@
-
-Version 1.2.0
-=============
-
-To use the library, reference the Common API documentation alongside the framework specific API documentation.
-
-.. toctree::
-   :maxdepth: 1
-
-   v1.2.0/smd_model_parallel_common_api
-   v1.2.0/smd_model_parallel_pytorch
-   v1.2.0/smd_model_parallel_tensorflow
diff --git a/doc/api/training/smp_versions/v1_3_0.rst b/doc/api/training/smp_versions/v1_3_0.rst
deleted file mode 100644
index 80d73acbd9..0000000000
--- a/doc/api/training/smp_versions/v1_3_0.rst
+++ /dev/null
@@ -1,12 +0,0 @@
-
-Version 1.3.x
-=============
-
-To use the library, reference the Common API documentation alongside the framework specific API documentation.
-
-.. toctree::
-   :maxdepth: 1
-
-   v1.3.0/smd_model_parallel_common_api
-   v1.3.0/smd_model_parallel_pytorch
-   v1.3.0/smd_model_parallel_tensorflow
diff --git a/doc/api/training/smp_versions/v1_4_0.rst b/doc/api/training/smp_versions/v1_4_0.rst
deleted file mode 100644
index 4485ae6a40..0000000000
--- a/doc/api/training/smp_versions/v1_4_0.rst
+++ /dev/null
@@ -1,12 +0,0 @@
-
-Version 1.4.x
-=============
-
-To use the library, reference the Common API documentation alongside the framework specific API documentation.
-
-.. toctree::
-   :maxdepth: 1
-
-   v1.4.0/smd_model_parallel_common_api
-   v1.4.0/smd_model_parallel_pytorch
-   v1.4.0/smd_model_parallel_tensorflow
diff --git a/doc/api/training/smp_versions/v1_5_0.rst b/doc/api/training/smp_versions/v1_5_0.rst
deleted file mode 100644
index c93761efa4..0000000000
--- a/doc/api/training/smp_versions/v1_5_0.rst
+++ /dev/null
@@ -1,12 +0,0 @@
-
-Version 1.5.x
-=============
-
-To use the library, reference the Common API documentation alongside the framework specific API documentation.
-
-.. toctree::
-   :maxdepth: 1
-
-   v1.5.0/smd_model_parallel_common_api
-   v1.5.0/smd_model_parallel_pytorch
-   v1.5.0/smd_model_parallel_tensorflow
diff --git a/doc/api/training/smp_versions/v1_6_0.rst b/doc/api/training/smp_versions/v1_6_0.rst
deleted file mode 100644
index fe02479853..0000000000
--- a/doc/api/training/smp_versions/v1_6_0.rst
+++ /dev/null
@@ -1,13 +0,0 @@
-
-Version 1.6.0
-=============
-
-To use the library, reference the Common API documentation alongside the framework specific API documentation.
-
-.. toctree::
-   :maxdepth: 1
-
-   v1.6.0/smd_model_parallel_common_api
-   v1.6.0/smd_model_parallel_pytorch
-   v1.6.0/smd_model_parallel_pytorch_tensor_parallel
-   v1.6.0/smd_model_parallel_tensorflow
diff --git a/doc/api/training/smp_versions/v1_9_0.rst b/doc/api/training/smp_versions/v1_9_0.rst
deleted file mode 100644
index e2e9acd83a..0000000000
--- a/doc/api/training/smp_versions/v1_9_0.rst
+++ /dev/null
@@ -1,13 +0,0 @@
-
-Version 1.7.0, 1.8.0, 1.8.1, 1.9.0
-==================================
-
-To use the library, reference the Common API documentation alongside the framework specific API documentation.
-
-.. toctree::
-   :maxdepth: 1
-
-   v1.9.0/smd_model_parallel_common_api
-   v1.9.0/smd_model_parallel_pytorch
-   v1.9.0/smd_model_parallel_pytorch_tensor_parallel
-   v1.9.0/smd_model_parallel_tensorflow
diff --git a/src/sagemaker/pytorch/estimator.py b/src/sagemaker/pytorch/estimator.py
index d127a2a2d6..a4e24d1ff0 100644
--- a/src/sagemaker/pytorch/estimator.py
+++ b/src/sagemaker/pytorch/estimator.py
@@ -107,57 +107,65 @@ def __init__(
                 If ``framework_version`` or ``py_version`` are ``None``, then
                 ``image_uri`` is required. If also ``None``, then a ``ValueError``
                 will be raised.
-            distribution (dict): A dictionary with information on how to run distributed training
-                (default: None).  Currently, the following are supported:
-                distributed training with parameter servers, SageMaker Distributed (SMD) Data
-                and Model Parallelism, and MPI. SMD Model Parallelism can only be used with MPI.
+            distribution (dict): A dictionary with information on how to configure and
+                run distributed training
+                (default: None). The following options are available.
 
-                **To enable the SageMaker distributed data parallelism:**
+                **To enable the SageMaker distributed data parallelism (SMDDP) library:**
 
                     .. code:: python
 
                         { "smdistributed": { "dataparallel": { "enabled": True } } }
 
-                    .. seealso::
+                    Beside activating the SMDDP library through this parameter,
+                    you also need to add few lines of code in your training script
+                    for initializing PyTorch Distributed with the SMDDP setups.
+                    To learn how to configure your training job with the SMDDP library v2, see
+                    `Run distributed training with the SageMaker distributed data parallelism
+                    library
+                    <https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel.html>`_
+                    in the *Amazon SageMaker User Guide*.
 
-                        To learn more, see :ref:`sdp_api_docs_toc`.
-
-                **To enable the SageMaker distributed model parallelism:**
+                **To enable the SageMaker distributed model parallelism (SMP) library v2:**
 
                     .. code:: python
 
                         {
+                            "torch_distributed": { "enabled": True },
                             "smdistributed": {
                                 "modelparallel": {
-                                    "enabled":True,
+                                    "enabled": True,
                                     "parameters": {
-                                        "partitions": 2,
-                                        "microbatches": 4,
-                                        "placement_strategy": "spread",
-                                        "pipeline": "interleaved",
-                                        "optimize": "speed",
-                                        "ddp": True,
-                                    }
+                                        "tensor_parallel_degree": 8,
+                                        "hybrid_shard_degree": 1,
+                                        ...
+                                    },
+                                }
                             },
-                            "mpi": {
-                                "enabled" : True,
-                                "processes_per_host" : 8,
-                            }
                         }
 
-                    .. note::
+                    Beside activating the SMP library v2 through this parameter,
+                    you also need to add few lines of code in your training script
+                    for initializing PyTorch Distributed with the SMP setups.
+                    To learn how to configure your training job with the SMP library v2, see
+                    `Run distributed training with the SageMaker model parallelism library v2
+                    <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-v2.html>`_
+                    in the *Amazon SageMaker User Guide*.
 
-                        The SageMaker distributed model parallel library internally uses MPI.
-                        In order to use model parallelism, MPI also must be enabled.
-
-                    .. seealso::
+                    .. note::
 
-                        To learn more, see :ref:`smp_api_docs_toc`.
+                        The SageMaker distributed model parallel library v2 requires with
+                        ``torch_distributed``.
 
-                    .. seealso::
+                    .. note::
 
-                        To find a complete list of parameters for SageMaker model parallelism,
-                        see :ref:`sm-sdk-modelparallel-general`.
+                        The documentation for the SMP library v1.x is archived and available at
+                        `Run distributed training with the SageMaker model parallelism library
+                        <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel.html>`_
+                        in the *Amazon SageMaker User Guide*,
+                        and the SMP v1 API reference is available in the
+                        `SageMaker Python SDK v2.199.0 documentation
+                        <https://sagemaker.readthedocs.io/en/v2.199.0/api/training/distributed.html#the-sagemaker-distributed-model-parallel-library>`_.
 
                 **To enable PyTorch DDP:**
 
diff --git a/src/sagemaker/tensorflow/estimator.py b/src/sagemaker/tensorflow/estimator.py
index eb4366f0a7..523b70ec38 100644
--- a/src/sagemaker/tensorflow/estimator.py
+++ b/src/sagemaker/tensorflow/estimator.py
@@ -86,56 +86,7 @@ def __init__(
                 ``image_uri`` is required. If also ``None``, then a ``ValueError``
                 will be raised.
             distribution (dict): A dictionary with information on how to run distributed training
-                (default: None). Currently, the following are supported:
-                distributed training with parameter servers, SageMaker Distributed (SMD) Data
-                and Model Parallelism, and MPI. SMD Model Parallelism can only be used with MPI.
-
-                **To enable the SageMaker distributed data parallelism:**
-
-                    .. code:: python
-
-                        { "smdistributed": { "dataparallel": { "enabled": True } } }
-
-                    .. seealso::
-
-                        To learn more, see :ref:`sdp_api_docs_toc`.
-
-                **To enable the SageMaker distributed model parallelism:**
-
-                    .. code:: python
-
-                        {
-                            "smdistributed": {
-                                "modelparallel": {
-                                    "enabled":True,
-                                    "parameters": {
-                                        "partitions": 2,
-                                        "microbatches": 4,
-                                        "placement_strategy": "spread",
-                                        "pipeline": "interleaved",
-                                        "optimize": "speed",
-                                        "ddp": True,
-                                    }
-                            },
-                            "mpi": {
-                                "enabled" : True,
-                                "processes_per_host" : 8,
-                            }
-                        }
-
-                    .. note::
-
-                        The SageMaker distributed model parallel library internally uses MPI.
-                        In order to use model parallelism, MPI also must be enabled.
-
-                    .. seealso::
-
-                        To learn more, see :ref:`smp_api_docs_toc`.
-
-                    .. seealso::
-
-                        To find a complete list of parameters for SageMaker model parallelism,
-                        see :ref:`sm-sdk-modelparallel-general`.
+                (default: None).
 
                 **To enable Multi Worker Mirrored Strategy:**
 
@@ -179,6 +130,31 @@ def __init__(
 
                     To learn more, see `Training with parameter servers
                     <https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/using_tf.html#training-with-parameter-servers>`_.
+
+                .. note::
+
+                    The SageMaker distributed data parallelism (SMDDP) library
+                    discontinued support for TensorFlow.
+                    The documentation for the SMDDP library v1.x is still available at
+                    `Use the SMDDP library in your TensorFlow training script (deprecated)
+                    <https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-modify-sdp-tf2.html>`_
+                    in the *Amazon SageMaker User Guide*,
+                    and the `SMDDP v1 API reference in the
+                    SageMaker Python SDK v2.199.0 documentation
+                    <https://sagemaker.readthedocs.io/en/v2.199.0/api/training/distributed.html#the-sagemaker-distributed-data-parallel-library>`_.
+
+                .. note::
+
+                    The SageMaker model parallelism (SMP) library v2 discontinued support
+                    for TensorFlow.
+                    The documentation for the SMP library v1.x is archived and available at
+                    `Run distributed training with the SageMaker model parallelism library
+                    <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel.html>`_
+                    in the *Amazon SageMaker User Guide*,
+                    and the `SMP v1 API reference in the
+                    SageMaker Python SDK v2.199.0 documentation
+                    <https://sagemaker.readthedocs.io/en/v2.199.0/api/training/distributed.html#the-sagemaker-distributed-model-parallel-library>`_.
+
             compiler_config (:class:`~sagemaker.tensorflow.TrainingCompilerConfig`):
                 Configures SageMaker Training Compiler to accelerate training.