diff --git a/doc/Makefile b/doc/Makefile index 1cdfaa77df..af378c2e0f 100644 --- a/doc/Makefile +++ b/doc/Makefile @@ -3,7 +3,7 @@ # You can set these variables from the command line. SPHINXOPTS = -W -SPHINXBUILD = python -msphinx +SPHINXBUILD = python -msphinx SPHINXPROJ = sagemaker SOURCEDIR = . BUILDDIR = _build diff --git a/doc/api/training/sdp_versions/latest/smd_data_parallel_pytorch.rst b/doc/api/training/sdp_versions/latest/smd_data_parallel_pytorch.rst index 52de6223d7..85c9594e73 100644 --- a/doc/api/training/sdp_versions/latest/smd_data_parallel_pytorch.rst +++ b/doc/api/training/sdp_versions/latest/smd_data_parallel_pytorch.rst @@ -2,10 +2,12 @@ PyTorch Guide to SageMaker's distributed data parallel library ############################################################## -.. admonition:: Contents +Use this guide to learn about the SageMaker distributed +data parallel library API for PyTorch. - - :ref:`pytorch-sdp-modify` - - :ref:`pytorch-sdp-api` +.. contents:: Topics + :depth: 3 + :local: .. _pytorch-sdp-modify: @@ -55,7 +57,7 @@ API offered for PyTorch. - Modify the ``torch.utils.data.distributed.DistributedSampler`` to - include the cluster’s information. Set``num_replicas`` to the + include the cluster’s information. Set ``num_replicas`` to the total number of GPUs participating in training across all the nodes in the cluster. This is called ``world_size``. You can get ``world_size`` with @@ -110,7 +112,7 @@ you will have for distributed training with the distributed data parallel librar def main():     # Scale batch size by world size -     batch_size //= dist.get_world_size() // 8 +     batch_size //= dist.get_world_size()     batch_size = max(batch_size, 1)     # Prepare dataset @@ -153,9 +155,132 @@ you will have for distributed training with the distributed data parallel librar PyTorch API =========== -.. rubric:: Supported versions +.. class:: smdistributed.dataparallel.torch.parallel.DistributedDataParallel(module, device_ids=None, output_device=None, broadcast_buffers=True, process_group=None, bucket_cap_mb=None) + + ``smdistributed.dataparallel``'s implementation of distributed data + parallelism for PyTorch. In most cases, wrapping your PyTorch Module + with ``smdistributed.dataparallel``'s ``DistributedDataParallel`` (DDP) is + all you need to do to use ``smdistributed.dataparallel``. + + Creation of this DDP class requires ``smdistributed.dataparallel`` + already initialized + with ``smdistributed.dataparallel.torch.distributed.init_process_group()``. + + This container parallelizes the application of the given module by + splitting the input across the specified devices by chunking in the + batch dimension. The module is replicated on each machine and each + device, and each such replica handles a portion of the input. During the + backwards pass, gradients from each node are averaged. + + The batch size should be larger than the number of GPUs used locally. + ​ + Example usage + of ``smdistributed.dataparallel.torch.parallel.DistributedDataParallel``: + + .. code:: python + + import torch + import smdistributed.dataparallel.torch.distributed as dist + from smdistributed.dataparallel.torch.parallel import DistributedDataParallel as DDP + + dist.init_process_group() + + # Pin GPU to be used to process local rank (one GPU per process) + torch.cuda.set_device(dist.get_local_rank()) + + # Build model and optimizer + model = ... + optimizer = torch.optim.SGD(model.parameters(), +                             lr=1e-3 * dist.get_world_size()) + # Wrap model with smdistributed.dataparallel's DistributedDataParallel + model = DDP(model) + + **Parameters:** + + - ``module (torch.nn.Module)(required):`` PyTorch NN Module to be + parallelized + - ``device_ids (list[int])(optional):`` CUDA devices. This should only + be provided when the input module resides on a single CUDA device. + For single-device modules, + the ``ith module replica is placed on device_ids[i]``. For + multi-device modules and CPU modules, device_ids must be None or an + empty list, and input data for the forward pass must be placed on the + correct device. Defaults to ``None``. + - ``output_device (int)(optional):`` Device location of output for + single-device CUDA modules. For multi-device modules and CPU modules, + it must be None, and the module itself dictates the output location. + (default: device_ids[0] for single-device modules).  Defaults + to ``None``. + - ``broadcast_buffers (bool)(optional):`` Flag that enables syncing + (broadcasting) buffers of the module at beginning of the forward + function. ``smdistributed.dataparallel`` does not support broadcast + buffer yet. Please set this to ``False``. + - ``process_group(smdistributed.dataparallel.torch.distributed.group)(optional):`` Process + group is not supported in ``smdistributed.dataparallel``. This + parameter exists for API parity with torch.distributed only. Only + supported value is + ``smdistributed.dataparallel.torch.distributed.group.WORLD.`` Defaults + to ``None.`` + - ``bucket_cap_mb (int)(optional):`` DistributedDataParallel will + bucket parameters into multiple buckets so that gradient reduction of + each bucket can potentially overlap with backward + computation. ``bucket_cap_mb`` controls the bucket size in + MegaBytes (MB) (default: 25). + + .. note:: + + This module assumes all parameters are registered in the model by the + time it is created. No parameters should be added nor removed later. + + .. note:: + + This module assumes all parameters are registered in the model of + each distributed processes are in the same order. The module itself + will conduct gradient all-reduction following the reverse order of + the registered parameters of the model. In other words, it is users’ + responsibility to ensure that each distributed process has the exact + same model and thus the exact same parameter registration order. + + .. note:: + + You should never change the set of your model’s parameters after + wrapping up your model with DistributedDataParallel. In other words, + when wrapping up your model with DistributedDataParallel, the + constructor of DistributedDataParallel will register the additional + gradient reduction functions on all the parameters of the model + itself at the time of construction. If you change the model’s + parameters after the DistributedDataParallel construction, this is + not supported and unexpected behaviors can happen, since some + parameters’ gradient reduction functions might not get called. + + .. method:: no_sync() + + ``smdistributed.dataparallel`` supports the `PyTorch DDP no_sync() `_ + context manager. It enables gradient accumulation by skipping AllReduce + during training iterations inside the context. + + .. note:: + + The ``no_sync()`` context manager is available from smdistributed-dataparallel v1.2.2. + To find the release note, see :ref:`sdp_1.2.2_release_note`. -**PyTorch 1.7.1, 1.8.1** + **Example:** + + .. code:: python + + # Gradients are accumulated while inside no_sync context + with model.no_sync(): + ... + loss.backward() + + # First iteration upon exiting context + # Incoming gradients are added to the accumulated gradients and then synchronized via AllReduce + ... + loss.backward() + + # Update weights and reset gradients to zero after accumulation is finished + optimizer.step() + optimizer.zero_grad() .. function:: smdistributed.dataparallel.torch.distributed.is_available() @@ -409,99 +534,6 @@ PyTorch API otherwise. -.. class:: smdistributed.dataparallel.torch.parallel.DistributedDataParallel(module, device_ids=None, output_device=None, broadcast_buffers=True, process_group=None, bucket_cap_mb=None) - - ``smdistributed.dataparallel's`` implementation of distributed data - parallelism for PyTorch. In most cases, wrapping your PyTorch Module - with ``smdistributed.dataparallel's`` ``DistributedDataParallel (DDP)`` is - all you need to do to use ``smdistributed.dataparallel``. - - Creation of this DDP class requires ``smdistributed.dataparallel`` - already initialized - with ``smdistributed.dataparallel.torch.distributed.init_process_group()``. - - This container parallelizes the application of the given module by - splitting the input across the specified devices by chunking in the - batch dimension. The module is replicated on each machine and each - device, and each such replica handles a portion of the input. During the - backwards pass, gradients from each node are averaged. - - The batch size should be larger than the number of GPUs used locally. - ​ - Example usage - of ``smdistributed.dataparallel.torch.parallel.DistributedDataParallel``: - - .. code:: python - - import torch - import smdistributed.dataparallel.torch.distributed as dist - from smdistributed.dataparallel.torch.parallel import DistributedDataParallel as DDP - - dist.init_process_group() - - # Pin GPU to be used to process local rank (one GPU per process) - torch.cuda.set_device(dist.get_local_rank()) - - # Build model and optimizer - model = ... - optimizer = torch.optim.SGD(model.parameters(), -                             lr=1e-3 * dist.get_world_size()) - # Wrap model with smdistributed.dataparallel's DistributedDataParallel - model = DDP(model) - - **Parameters:** - - - ``module (torch.nn.Module)(required):`` PyTorch NN Module to be - parallelized - - ``device_ids (list[int])(optional):`` CUDA devices. This should only - be provided when the input module resides on a single CUDA device. - For single-device modules, - the ``ith module replica is placed on device_ids[i]``. For - multi-device modules and CPU modules, device_ids must be None or an - empty list, and input data for the forward pass must be placed on the - correct device. Defaults to ``None``. - - ``output_device (int)(optional):`` Device location of output for - single-device CUDA modules. For multi-device modules and CPU modules, - it must be None, and the module itself dictates the output location. - (default: device_ids[0] for single-device modules).  Defaults - to ``None``. - - ``broadcast_buffers (bool)(optional):`` Flag that enables syncing - (broadcasting) buffers of the module at beginning of the forward - function. ``smdistributed.dataparallel`` does not support broadcast - buffer yet. Please set this to ``False``. - - ``process_group(smdistributed.dataparallel.torch.distributed.group)(optional):`` Process - group is not supported in ``smdistributed.dataparallel``. This - parameter exists for API parity with torch.distributed only. Only - supported value is - ``smdistributed.dataparallel.torch.distributed.group.WORLD.`` Defaults - to ``None.`` - - ``bucket_cap_mb (int)(optional):`` DistributedDataParallel will - bucket parameters into multiple buckets so that gradient reduction of - each bucket can potentially overlap with backward - computation. ``bucket_cap_mb`` controls the bucket size in - MegaBytes (MB) (default: 25). - - .. rubric:: Notes - - - This module assumes all parameters are registered in the model by the - time it is created. No parameters should be added nor removed later. - - This module assumes all parameters are registered in the model of - each distributed processes are in the same order. The module itself - will conduct gradient all-reduction following the reverse order of - the registered parameters of the model. In other words, it is users’ - responsibility to ensure that each distributed process has the exact - same model and thus the exact same parameter registration order. - - You should never change the set of your model’s parameters after - wrapping up your model with DistributedDataParallel. In other words, - when wrapping up your model with DistributedDataParallel, the - constructor of DistributedDataParallel will register the additional - gradient reduction functions on all the parameters of the model - itself at the time of construction. If you change the model’s - parameters after the DistributedDataParallel construction, this is - not supported and unexpected behaviors can happen, since some - parameters’ gradient reduction functions might not get called. - - .. class:: smdistributed.dataparallel.torch.distributed.ReduceOp An enum-like class for supported reduction operations diff --git a/doc/api/training/sdp_versions/latest/smd_data_parallel_tensorflow.rst b/doc/api/training/sdp_versions/latest/smd_data_parallel_tensorflow.rst index 48f729d9a2..c615ad67aa 100644 --- a/doc/api/training/sdp_versions/latest/smd_data_parallel_tensorflow.rst +++ b/doc/api/training/sdp_versions/latest/smd_data_parallel_tensorflow.rst @@ -155,10 +155,6 @@ script you will have for distributed training with the library. TensorFlow API ============== -.. rubric:: Supported versions - -**TensorFlow 2.3.1, 2.4.1, 2.5.0** - .. function:: smdistributed.dataparallel.tensorflow.init() Initialize ``smdistributed.dataparallel``. Must be called at the diff --git a/doc/api/training/smd_data_parallel_release_notes/smd_data_parallel_change_log.rst b/doc/api/training/smd_data_parallel_release_notes/smd_data_parallel_change_log.rst index 5357a2166c..8de575a218 100644 --- a/doc/api/training/smd_data_parallel_release_notes/smd_data_parallel_change_log.rst +++ b/doc/api/training/smd_data_parallel_release_notes/smd_data_parallel_change_log.rst @@ -1,6 +1,43 @@ -Sagemaker Distributed Data Parallel 1.2.1 Release Notes +.. _sdp_1.2.2_release_note: + +SageMaker Distributed Data Parallel 1.2.2 Release Notes ======================================================= +*Date: November. 24. 2021* + +**New Features** + +* Added support for PyTorch 1.10 +* PyTorch ``no_sync`` API support for DistributedDataParallel +* Timeout when training stalls due to allreduce and broadcast collective calls + +**Bug Fixes** + +* Fixed a bug that would impact correctness in the mixed dtype case +* Fixed a bug related to the timeline writer that would cause a crash when SageMaker Profiler is enabled for single node jobs. + +**Improvements** + +* Performance optimizations for small models on small clusters + +**Migration to AWS Deep Learning Containers** + +This version passed benchmark testing and is migrated to the following AWS Deep Learning Containers: + +- PyTorch 1.10 DLC release: `v1.0-pt-sagemaker-1.10.0-py38 `_ + + .. code:: + + 763104351884.dkr.ecr..amazonaws.com/pytorch-training:1.10.0-gpu-py38-cu113-ubuntu20.04-sagemaker + +---- + +Release History +=============== + +SageMaker Distributed Data Parallel 1.2.1 Release Notes +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + *Date: June. 29. 2021* **New Features:** @@ -28,12 +65,8 @@ This version passed benchmark testing and is migrated to the following AWS Deep 763104351884.dkr.ecr..amazonaws.com/tensorflow-training:2.5.0-gpu-py37-cu112-ubuntu18.04-v1.0 ----- - -Release History -=============== -Sagemaker Distributed Data Parallel 1.2.0 Release Notes +SageMaker Distributed Data Parallel 1.2.0 Release Notes ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - New features @@ -46,7 +79,7 @@ Sagemaker Distributed Data Parallel 1.2.0 Release Notes AllReduce. For best performance, it is recommended you use an instance type that supports Amazon Elastic Fabric Adapter (ml.p3dn.24xlarge and ml.p4d.24xlarge) when you train a model using - Sagemaker Distributed data parallel. + SageMaker Distributed data parallel. **Bug Fixes:** @@ -54,7 +87,7 @@ Sagemaker Distributed Data Parallel 1.2.0 Release Notes ---- -Sagemaker Distributed Data Parallel 1.1.2 Release Notes +SageMaker Distributed Data Parallel 1.1.2 Release Notes ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - Bug Fixes @@ -68,7 +101,7 @@ Sagemaker Distributed Data Parallel 1.1.2 Release Notes **Known Issues:** -- Sagemaker Distributed data parallel has slower throughput than NCCL +- SageMaker Distributed data parallel has slower throughput than NCCL when run using a single node. For the best performance, use multi-node distributed training with smdistributed.dataparallel. Use a single node only for experimental runs while preparing your @@ -76,7 +109,7 @@ Sagemaker Distributed Data Parallel 1.1.2 Release Notes ---- -Sagemaker Distributed Data Parallel 1.1.1 Release Notes +SageMaker Distributed Data Parallel 1.1.1 Release Notes ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - New Features @@ -103,7 +136,7 @@ Sagemaker Distributed Data Parallel 1.1.1 Release Notes ---- -Sagemaker Distributed Data Parallel 1.1.0 Release Notes +SageMaker Distributed Data Parallel 1.1.0 Release Notes ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - New Features @@ -139,7 +172,7 @@ SDK Guide ---- -Sagemaker Distributed Data Parallel 1.0.0 Release Notes +SageMaker Distributed Data Parallel 1.0.0 Release Notes ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - First Release