Skip to content

Commit 4651c11

Browse files
authored
Centralize DDP speedups in docs (#12448)
1 parent cf35182 commit 4651c11

File tree

2 files changed

+68
-64
lines changed

2 files changed

+68
-64
lines changed

docs/source/advanced/model_parallel.rst

Lines changed: 67 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -718,6 +718,73 @@ DDP Optimizations
718718
^^^^^^^^^^^^^^^^^
719719

720720

721+
When Using DDP Strategies, Set find_unused_parameters=False
722+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
723+
724+
By default, we have set ``find_unused_parameters=True`` for compatibility reasons that have been observed in the past (refer to the `discussion <https://github.com/PyTorchLightning/pytorch-lightning/discussions/6219>`_ for more details).
725+
When enabled, it can result in a performance hit and can be disabled in most cases. Read more about it `here <https://pytorch.org/docs/stable/notes/ddp.html#internal-design>`_.
726+
727+
.. tip::
728+
It applies to all DDP strategies that support ``find_unused_parameters`` as input.
729+
730+
.. code-block:: python
731+
732+
from pytorch_lightning.strategies import DDPStrategy
733+
734+
trainer = pl.Trainer(
735+
gpus=2,
736+
strategy=DDPStrategy(find_unused_parameters=False),
737+
)
738+
739+
.. code-block:: python
740+
741+
from pytorch_lightning.strategies import DDPSpawnStrategy
742+
743+
trainer = pl.Trainer(
744+
gpus=2,
745+
strategy=DDPSpawnStrategy(find_unused_parameters=False),
746+
)
747+
748+
749+
DDP Static Graph
750+
""""""""""""""""
751+
752+
`DDP static graph <https://pytorch.org/blog/pytorch-1.11-released/#stable-ddp-static-graph>`__ assumes that your model
753+
employs the same set of used/unused parameters in every iteration, so that it can deterministically know the flow of
754+
training and apply special optimizations during runtime.
755+
756+
.. note::
757+
DDP static graph support requires PyTorch>=1.11.0
758+
759+
.. code-block:: python
760+
761+
from pytorch_lightning import Trainer
762+
from pytorch_lightning.strategies import DDPStrategy
763+
764+
trainer = Trainer(devices=4, strategy=DDPStrategy(static_graph=True))
765+
766+
767+
When Using DDP on a Multi-node Cluster, Set NCCL Parameters
768+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
769+
770+
`NCCL <https://developer.nvidia.com/nccl>`__ is the NVIDIA Collective Communications Library that is used by PyTorch to handle communication across nodes and GPUs. There are reported benefits in terms of speedups when adjusting NCCL parameters as seen in this `issue <https://github.com/PyTorchLightning/pytorch-lightning/issues/7179>`__. In the issue, we see a 30% speed improvement when training the Transformer XLM-RoBERTa and a 15% improvement in training with Detectron2.
771+
772+
NCCL parameters can be adjusted via environment variables.
773+
774+
.. note::
775+
776+
AWS and GCP already set default values for these on their clusters. This is typically useful for custom cluster setups.
777+
778+
* `NCCL_NSOCKS_PERTHREAD <https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-nsocks-perthread>`__
779+
* `NCCL_SOCKET_NTHREADS <https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-socket-nthreads>`__
780+
* `NCCL_MIN_NCHANNELS <https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-min-nchannels>`__
781+
782+
.. code-block:: bash
783+
784+
export NCCL_NSOCKS_PERTHREAD=4
785+
export NCCL_SOCKET_NTHREADS=2
786+
787+
721788
Gradients as Bucket View
722789
""""""""""""""""""""""""
723790

@@ -839,20 +906,3 @@ When using Post-localSGD, you must also pass ``model_averaging_period`` to allow
839906
),
840907
)
841908
trainer.fit(model)
842-
843-
DDP Static Graph
844-
""""""""""""""""
845-
846-
`DDP static graph <https://pytorch.org/blog/pytorch-1.11-released/#stable-ddp-static-graph>`__ assumes that your model
847-
employs the same set of used/unused parameters in every iteration, so that it can deterministically know the flow of
848-
training and apply special optimizations during runtime.
849-
850-
.. note::
851-
DDP static graph support requires PyTorch>=1.11.0
852-
853-
.. code-block:: python
854-
855-
from pytorch_lightning import Trainer
856-
from pytorch_lightning.strategies import DDPStrategy
857-
858-
trainer = Trainer(devices=4, strategy=DDPStrategy(static_graph=True))

docs/source/guides/speed.rst

Lines changed: 1 addition & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -77,53 +77,7 @@ Whereas :class:`~pytorch_lightning.strategies.ddp.DDPStrategy` only performs two
7777

7878
|
7979
80-
81-
When Using DDP Plugins, Set find_unused_parameters=False
82-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
83-
84-
By default, we have set ``find_unused_parameters=True`` for compatibility reasons that have been observed in the past (refer to the `discussion <https://github.com/PyTorchLightning/pytorch-lightning/discussions/6219>`_ for more details).
85-
When enabled, it can result in a performance hit and can be disabled in most cases. Read more about it `here <https://pytorch.org/docs/stable/notes/ddp.html#internal-design>`_.
86-
87-
.. tip::
88-
It applies to all DDP strategies that support ``find_unused_parameters`` as input.
89-
90-
.. code-block:: python
91-
92-
from pytorch_lightning.strategies import DDPStrategy
93-
94-
trainer = pl.Trainer(
95-
gpus=2,
96-
strategy=DDPStrategy(find_unused_parameters=False),
97-
)
98-
99-
.. code-block:: python
100-
101-
from pytorch_lightning.strategies import DDPSpawnStrategy
102-
103-
trainer = pl.Trainer(
104-
gpus=2,
105-
strategy=DDPSpawnStrategy(find_unused_parameters=False),
106-
)
107-
108-
When Using DDP on a Multi-node Cluster, Set NCCL Parameters
109-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
110-
111-
`NCCL <https://developer.nvidia.com/nccl>`__ is the NVIDIA Collective Communications Library that is used by PyTorch to handle communication across nodes and GPUs. There are reported benefits in terms of speedups when adjusting NCCL parameters as seen in this `issue <https://github.com/PyTorchLightning/pytorch-lightning/issues/7179>`__. In the issue, we see a 30% speed improvement when training the Transformer XLM-RoBERTa and a 15% improvement in training with Detectron2.
112-
113-
NCCL parameters can be adjusted via environment variables.
114-
115-
.. note::
116-
117-
AWS and GCP already set default values for these on their clusters. This is typically useful for custom cluster setups.
118-
119-
* `NCCL_NSOCKS_PERTHREAD <https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-nsocks-perthread>`__
120-
* `NCCL_SOCKET_NTHREADS <https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-socket-nthreads>`__
121-
* `NCCL_MIN_NCHANNELS <https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-min-nchannels>`__
122-
123-
.. code-block:: bash
124-
125-
export NCCL_NSOCKS_PERTHREAD=4
126-
export NCCL_SOCKET_NTHREADS=2
80+
For more details on how to tune performance with DDP, please see the :ref:`DDP Optimizations <ddp-optimizations>` section.
12781

12882
DataLoaders
12983
^^^^^^^^^^^

0 commit comments

Comments
 (0)