Skip to content

Commit 0974d66

Browse files
Sean Narenawaelchlikaushikb11
authored
Add docs for IPUs (#7923)
* Added base docs for IPUs * Fix * Add details around poptorch profiler and model parallelism * more description * Add image * Clearer messaging * Cleanup * Better name * Add note * Add some details around device iterations and model parallelism * Apply suggestions from code review Co-authored-by: Adrian Wälchli <[email protected]> * Add a small install comment * Add clip gradients not supported * Update docs/source/advanced/ipu.rst Co-authored-by: Kaushik B <[email protected]> * Add note Co-authored-by: Adrian Wälchli <[email protected]> Co-authored-by: Kaushik B <[email protected]>
1 parent 024cf23 commit 0974d66

File tree

3 files changed

+235
-0
lines changed

3 files changed

+235
-0
lines changed
127 KB
Loading

docs/source/advanced/ipu.rst

Lines changed: 234 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,234 @@
1+
.. _ipu:
2+
3+
IPU support
4+
===========
5+
6+
.. note::
7+
IPU Support is experimental and a work in progress (see :ref:`known-limitations`). If you run into any problems, please leave an issue.
8+
9+
Lightning supports `Graphcore Information Processing Units (IPUs) <https://www.graphcore.ai/products/ipu>`_, processors built for Artificial Intelligence and Machine Learning.
10+
11+
IPU Terminology
12+
---------------
13+
14+
IPUs consist of many individual cores, allowing parallelization across computation. Due to the high bandwidth speed between cores,
15+
IPUs facilitate machine learning loads where parallelization is essential. Because computation is heavily parallelized,
16+
IPUs operate in a different way to conventional accelerators such as CPU/GPUs.
17+
IPUs do not require large batch sizes for maximum parallelization, can provide optimizations across the compiled graph and rely on model parallelism to fully utilize cores for larger models.
18+
19+
IPUs are also found within IPU pods, a collection of IPU enabled machines for larger workloads. See the `IPU Architecture <https://www.graphcore.ai/products/ipu>`__ for more information.
20+
21+
How to access IPUs
22+
------------------
23+
24+
To use IPUs you must have access to a server with IPU devices attached. To get access see `getting started <https://www.graphcore.ai/getstarted>`_.
25+
26+
You must ensure that the server with IPUs attached has enabled the SDK popart and poplar packages. Instructions should be given by Graphcore.
27+
28+
Training with IPUs
29+
------------------
30+
31+
Specify the number of IPUs to train with. Note that when training with IPUs, you must select 1 or a power of 2 number of IPUs (i.e. 2/4/8..).
32+
33+
.. code-block:: python
34+
35+
trainer = pl.Trainer(ipus=8) # Train using data parallel on 8 IPUs
36+
37+
IPUs only support specifying a single number to allocate devices, which is handled via the underlying libraries.
38+
39+
Mixed Precision & 16 bit precision
40+
----------------------------------
41+
42+
Lightning also supports training in mixed precision with IPUs.
43+
By default, IPU training will use 32-bit precision. To enable mixed precision,
44+
set the precision flag.
45+
46+
.. note::
47+
Currently there is no dynamic scaling of the loss with mixed precision training.
48+
49+
.. code-block:: python
50+
51+
import pytorch_lightning as pl
52+
53+
model = MyLightningModule()
54+
trainer = pl.Trainer(ipus=8, precision=16)
55+
trainer.fit(model)
56+
57+
You can also use pure 16-bit training, where the weights are also in 16 bit precision.
58+
59+
.. code-block:: python
60+
61+
import pytorch_lightning as pl
62+
from pytorch_lightning.plugins import IPUPlugin
63+
64+
model = MyLightningModule()
65+
model = model.half()
66+
trainer = pl.Trainer(ipus=8, precision=16)
67+
trainer.fit(model)
68+
69+
Advanced IPU Options
70+
--------------------
71+
72+
IPUs provide further optimizations to speed up training. By using the ``IPUPlugin`` we can set the ``device_iterations``, which controls the number of iterations run directly on the IPU devices before returning to host. Increasing the number of on device iterations will improve throughput as there is less device to host communication required.
73+
74+
.. note::
75+
76+
When using model parallel, it is a hard requirement to increase the number of device iterations to ensure we fully saturate the devices via micro-batching. see :ref:`ipu-model-parallelism` for more information.
77+
78+
.. code-block:: python
79+
80+
import pytorch_lightning as pl
81+
from pytorch_lightning.plugins import IPUPlugin
82+
83+
model = MyLightningModule()
84+
trainer = pl.Trainer(ipus=8, plugins=IPUPlugin(device_iterations=32))
85+
trainer.fit(model)
86+
87+
Note that by default we return the last device iteration loss. You can override this by passing in your own ``poptorch.Options`` and setting the AnchorMode as described in the `poptorch documentation <https://docs.graphcore.ai/projects/poptorch-user-guide/en/latest/reference.html#poptorch.Options.anchorMode>`__.
88+
89+
.. code-block:: python
90+
91+
import poptorch
92+
import pytorch_lightning as pl
93+
from pytorch_lightning.plugins import IPUPlugin
94+
95+
model = MyLightningModule()
96+
inference_opts = poptorch.Options()
97+
inference_opts.deviceIterations(32)
98+
99+
training_opts = poptorch.Options()
100+
training_opts.anchorMode(poptorch.AnchorMode.All)
101+
training_opts.deviceIterations(32)
102+
103+
trainer = Trainer(
104+
ipus=8,
105+
plugins=IPUPlugin(inference_opts=inference_opts, training_opts=training_opts)
106+
)
107+
trainer.fit(model)
108+
109+
You can also override all options by passing the ``poptorch.Options`` to the plugin. See `poptorch options documentation <https://docs.graphcore.ai/projects/poptorch-user-guide/en/latest/batching.html>`_ for more information.
110+
111+
PopVision Graph Analyser
112+
------------------------
113+
114+
.. figure:: ../_static/images/accelerator/ipus/profiler.png
115+
:alt: PopVision Graph Analyser
116+
:width: 500
117+
118+
Lightning supports integration with the `PopVision Graph Analyser Tool <https://docs.graphcore.ai/projects/graphcore-popvision-user-guide/en/latest/popvision.html>`__. This helps to look at utilization of IPU devices and provides helpful metrics during the lifecycle of your trainer. Once you have gained access, The PopVision Graph Analyser Tool can be downloaded via the `GraphCore download website <https://downloads.graphcore.ai/>`__.
119+
120+
Lightning supports dumping all reports to a directory to open using the tool.
121+
122+
.. code-block:: python
123+
124+
import pytorch_lightning as pl
125+
from pytorch_lightning.plugins import IPUPlugin
126+
127+
model = MyLightningModule()
128+
trainer = pl.Trainer(ipus=8, plugins=IPUPlugin(autoreport_dir='report_dir/'))
129+
trainer.fit(model)
130+
131+
This will dump all reports to ``report_dir/`` which can then be opened using the Graph Analyser Tool, see `Opening Reports <https://docs.graphcore.ai/projects/graphcore-popvision-user-guide/en/latest/graph/graph.html#opening-reports>`__.
132+
133+
.. _ipu-model-parallelism:
134+
135+
Model Parallelism
136+
-----------------
137+
138+
Due to the IPU architecture, larger models should be parallelized across IPUs by design. Currently poptorch provides the capabilities via annotations as described in `Parallel Execution <https://docs.graphcore.ai/projects/poptorch-user-guide/en/latest/overview.html#id1>`__
139+
140+
Below is an example using the block annotation in a LightningModule.
141+
142+
.. note::
143+
144+
Currently when using model parallelism, we do not infer the number of IPUs required for you. This is done via the annotations themselves. If you specify 4 different IDs when defining Blocks, this means your model will be split onto 4 different IPUs.
145+
146+
This is also mutually exclusive with the Trainer flag, i.e. if your model is split onto 2 IPUs and you set ``Trainer(ipus=4)`` this will require 8 IPUs in total; replicating the model 4 times in data parallel.
147+
148+
When pipelining the model you must also increase the `device_iterations` to ensure full data saturation of the devices data, i.e whilst one device in the model pipeline processes a batch of data, the other device can start on the next batch. For example if the model is split onto 4 IPUs, we require `device_iterations` to be at-least 4.
149+
150+
151+
.. code-block:: python
152+
153+
import pytorch_lightning as pl
154+
import poptorch
155+
156+
class MyLightningModule(pl.LightningModule):
157+
158+
def __init__(self):
159+
super().__init__()
160+
# This will place layer1, layer2+layer3, layer4, softmax on different IPUs at runtime.
161+
# BeginBlock will start a new id for all layers within this block
162+
self.layer1 = poptorch.BeginBlock(torch.nn.Linear(5, 10), ipu_id=0)
163+
164+
# This layer starts a new block,
165+
# adding subsequent layers to this current block at runtime
166+
# till the next block has been declared
167+
self.layer2 = poptorch.BeginBlock(torch.nn.Linear(10, 5), ipu_id=1)
168+
self.layer3 = torch.nn.Linear(5, 5)
169+
170+
# Create new blocks
171+
self.layer4 = poptorch.BeginBlock(torch.nn.Linear(5, 5), ipu_id=2)
172+
self.softmax = poptorch.BeginBlock(torch.nn.Softmax(dim=1), ipu_id=3)
173+
174+
...
175+
176+
model = MyLightningModule()
177+
trainer = pl.Trainer(ipus=8, plugins=IPUPlugin(device_iterations=20))
178+
trainer.fit(model)
179+
180+
181+
You can also use the block context manager within the forward function, or any of the step functions.
182+
183+
.. code-block:: python
184+
185+
import pytorch_lightning as pl
186+
import poptorch
187+
188+
class MyLightningModule(pl.LightningModule):
189+
190+
def __init__(self):
191+
super().__init__()
192+
self.layer1 = torch.nn.Linear(5, 10)
193+
self.layer2 = torch.nn.Linear(10, 5)
194+
self.layer3 = torch.nn.Linear(5, 5)
195+
self.layer4 = torch.nn.Linear(5, 5)
196+
197+
self.act = torch.nn.ReLU()
198+
self.softmax = torch.nn.Softmax(dim=1)
199+
200+
def forward(self, x):
201+
202+
with poptorch.Block(ipu_id=0):
203+
x = self.act(self.layer1(x))
204+
205+
with poptorch.Block(ipu_id=1):
206+
x = self.act(self.layer2(x))
207+
208+
with poptorch.Block(ipu_id=2):
209+
x = self.act(self.layer3(x))
210+
x = self.act(self.layer4(x))
211+
212+
with poptorch.Block(ipu_id=3):
213+
x = self.softmax(x)
214+
return x
215+
...
216+
217+
model = MyLightningModule()
218+
trainer = pl.Trainer(ipus=8, plugins=IPUPlugin(device_iterations=20))
219+
trainer.fit(model)
220+
221+
222+
.. _known-limitations:
223+
224+
Known Limitations
225+
-----------------
226+
227+
Currently there are some known limitations that are being addressed in the near future to make the experience seamless when moving from different devices.
228+
229+
Please see the `MNIST example <https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pl_examples/ipu_examples/mnist.py>`__ which displays most of the limitations and how to overcome them till they are resolved.
230+
231+
* ``self.log`` is not supported in the ``training_step``, ``validation_step``, ``test_step`` or ``predict_step``. This is due to the step function being traced and sent to the IPU devices. We're actively working on fixing this
232+
* Multiple optimizers are not supported. ``training_step`` only supports returning one loss from the ``training_step`` function as a result
233+
* Since the step functions are traced, branching logic or any form of primitive values are traced into constants. Be mindful as this could lead to errors in your custom code
234+
* Clipping gradients is not supported

docs/source/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -118,6 +118,7 @@ PyTorch Lightning Documentation
118118
advanced/training_tricks
119119
advanced/pruning_quantization
120120
advanced/transfer_learning
121+
advanced/ipu
121122
advanced/tpu
122123
common/test_set
123124
common/production_inference

0 commit comments

Comments
 (0)