Skip to content

Commit 71928c7

Browse files
committed
Update
1 parent d61392e commit 71928c7

File tree

1 file changed

+43
-43
lines changed

1 file changed

+43
-43
lines changed

recipes_source/distributed_rpc_profiling.rst

Lines changed: 43 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -3,9 +3,9 @@ Profiling PyTorch RPC-Based Workloads
33

44
In this recipe, you will learn:
55

6-
- An overview of the Distributed RPC Framework
7-
- An overview of the PyTorch Profiler
8-
- How to use the PyTorch Profiler to profile RPC-based workloads
6+
- An overview of the `Distributed RPC Framework`_
7+
- An overview of the `PyTorch Profiler`_
8+
- How to use the profiler to profile RPC-based workloads
99

1010
Requirements
1111
------------
@@ -18,19 +18,19 @@ available at `pytorch.org`_.
1818
What is the Distributed RPC Framework?
1919
---------------------------------------
2020

21-
The ** Distributed RPC Framework ** provides mechanisms for multi-machine model
21+
The **Distributed RPC Framework** provides mechanisms for multi-machine model
2222
training through a set of primitives to allow for remote communication, and a
2323
higher-level API to automatically differentiate models split across several machines.
24-
For this recipe, it would be helpful to be familiar with the Distributed RPC Framework
25-
as well as the tutorials.
24+
For this recipe, it would be helpful to be familiar with the `Distributed RPC Framework`_
25+
as well as the `RPC Tutorials`_.
2626

2727
What is the PyTorch Profiler?
2828
---------------------------------------
2929
The profiler is a context manager based API that allows for on-demand profiling of
3030
operators in a model's workload. The profiler can be used to analyze various aspects
3131
of a model including execution time, operators invoked, and memory consumption. For a
3232
detailed tutorial on using the profiler to profile a single-node model, please see the
33-
Profiler Recipe.
33+
`Profiler Recipe`_.
3434

3535

3636

@@ -40,11 +40,11 @@ How to use the Profiler for RPC-based workloads
4040
The profiler supports profiling of calls made of RPC and allows the user to have a
4141
detailed view into the operations that take place on different nodes. To demonstrate an
4242
example of this, let's first set up the RPC framework. The below code snippet will initialize
43-
two RPC workers on the same host, named "worker0" and "worker1" respectively. The workers will
43+
two RPC workers on the same host, named ``worker0`` and ``worker1`` respectively. The workers will
4444
be spawned as subprocesses, and we set some environment variables required for proper
45-
initialization (see torch.distributed documentation for more details).
45+
initialization.
4646

47-
.. code:: python3
47+
::
4848

4949
import torch
5050
import torch.distributed.rpc as rpc
@@ -88,7 +88,7 @@ initialization (see torch.distributed documentation for more details).
8888

8989
Running the above program should present you with the following output:
9090

91-
..
91+
::
9292

9393
DEBUG:root:worker0 successfully initialized RPC.
9494
DEBUG:root:worker1 successfully initialized RPC.
@@ -97,7 +97,7 @@ Now that we have a skeleton setup of our RPC framework, we can move on to
9797
sending RPCs back and forth and using the profiler to obtain a view of what's
9898
happening under the hood. Let's add to the above "worker" function:
9999

100-
..code:: python3
100+
::
101101

102102
def worker(rank, world_size):
103103
# Above code omitted...
@@ -115,14 +115,15 @@ happening under the hood. Let's add to the above "worker" function:
115115

116116
print(prof.key_averages().table())
117117

118-
The aformented code creates 2 RPCs, specifying torch.add and torch.mul, respectively,
119-
to be run with two random input tensors on worker 1. Since we use the rpc_async API,
120-
we are returned a torch.futures.Future object, which must be awaited for the result
118+
The aformented code creates 2 RPCs, specifying ``torch.add`` and ``torch.mul``, respectively,
119+
to be run with two random input tensors on worker 1. Since we use the ``rpc_async`` API,
120+
we are returned a ``torch.futures.Future`` object, which must be awaited for the result
121121
of the computation. Note that this wait must take place within the scope created by
122122
the profiling context manager in order for the RPC to be accurately profiled. Running
123123
the code with this new worker function should result in the following output:
124124

125-
..
125+
::
126+
126127
# Some columns are omitted for brevity, exact output subject to randomness
127128
---------------------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- ---------------
128129
Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg Number of Calls Node ID
@@ -138,24 +139,24 @@ the code with this new worker function should result in the following output:
138139
---------------------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- ---------------
139140
Self CPU time total: 11.237ms
140141

141-
Here we can see that the profiler has profiled our rpc_async calls made to worker 1
142-
from worker 0. In particular, the first 2 entries in the table show details (such as
142+
Here we can see that the profiler has profiled our ``rpc_async`` calls made to ``worker1``
143+
from ``worker0``. In particular, the first 2 entries in the table show details (such as
143144
the operator name, originating worker, and destination worker) about each RPC call made
144-
and the "CPU total" column indicates the end-to-end latency of the RPC call.
145+
and the ``CPU total`` column indicates the end-to-end latency of the RPC call.
145146

146147
We also have visibility into the actual operators invoked remotely on worker 1 due RPC.
147-
We can see operations that took place on worker 1 by checking the "Node ID" column. For
148-
example, we can interpret the row with name ::'rpc_async#aten::mul(worker0 -> worker1)#remote_op: mul'
149-
as a `mul` operation taking place on the remote node, as a result of the RPC sent to worker 1
150-
from worker 0, specifying worker 1 to run the builtin `mul` operator on the input tensors.
148+
We can see operations that took place on ``worker1`` by checking the ``Node ID`` column. For
149+
example, we can interpret the row with name ``rpc_async#aten::mul(worker0 -> worker1)#remote_op: mul``
150+
as a ``mul`` operation taking place on the remote node, as a result of the RPC sent to ``worker1``
151+
from ``worker0``, specifying ``worker1`` to run the builtin ``mul`` operator on the input tensors.
151152
Note that names of remote operations are prefixed with the name of the RPC event that resulted
152-
in them. For example, remote operations corresponding to the ::rpc.rpc_async(dst_worker_name, torch.add, args=(t1, t2))
153-
call are prefixed with ::rpc_async#aten::mul(worker0 -> worker1).
153+
in them. For example, remote operations corresponding to the ``rpc.rpc_async(dst_worker_name, torch.add, args=(t1, t2))``
154+
call are prefixed with ``rpc_async#aten::mul(worker0 -> worker1)``.
154155

155156
We can also use the profiler gain insight into user-defined functions that are executed over RPC.
156-
For example, let's add the following to the above "worker" function:
157+
For example, let's add the following to the above ``worker`` function:
157158

158-
..code:: python3
159+
::
159160

160161
# Define somewhere outside of worker() func.
161162
def udf_with_ops():
@@ -165,7 +166,6 @@ For example, let's add the following to the above "worker" function:
165166
torch.add(t1, t2)
166167
torch.mul(t1, t2)
167168

168-
..code::python3
169169
def worker(rank, world_size):
170170
# Above code omitted
171171
with profiler.profile() as p:
@@ -177,7 +177,8 @@ The above code creates a user-defined function that sleeps for 1 second, and the
177177
operators. Similar to what we've done above, we send an RPC to the remote worker, specifying it to
178178
run our user-defined function. Running this code should result in the following output:
179179

180-
..
180+
::
181+
181182
# Exact output subject to randomness
182183
-------------------------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- ---------------
183184
Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg Number of Calls Node ID
@@ -194,14 +195,14 @@ run our user-defined function. Running this code should result in the following
194195
-------------------------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- ---------------
195196

196197
Here we can see that the user-defined function has successfully been profiled with its name
197-
(rpc_async#udf_with_ops(worker0 -> worker1)), and has the CPU total time we would roughly expect
198-
(slightly greater than 1s given the sleep). Similar to the above profiling output, we can see the
198+
``(rpc_async#udf_with_ops(worker0 -> worker1))``, and has the CPU total time we would roughly expect
199+
(slightly greater than 1s given the ``sleep``). Similar to the above profiling output, we can see the
199200
remote operators that have been executed on worker 1 as part of executing this RPC request.
200201

201202
Lastly, we can visualize remote execution using the tracing functionality provided by the profiler.
202-
Let's add the following code to the above "worker" function:
203+
Let's add the following code to the above ``worker`` function:
203204

204-
..code:: python3
205+
::
205206

206207
def worker(rank, world_size):
207208
# Above code omitted
@@ -217,11 +218,11 @@ the following:
217218
:scale: 25 %
218219

219220
As we can see, we have traced our RPC requests and can also visualize traces of the remote operations,
220-
in this case, given in the trace column for "node_id: 1".
221+
in this case, given in the trace column for ``node_id: 1``.
221222

222223
Putting it all together, we have the following code for this recipe:
223224

224-
..code:: python3
225+
::
225226

226227
import torch
227228
import torch.distributed.rpc as rpc
@@ -298,13 +299,12 @@ Learn More
298299

299300
- `pytorch.org`_ for installation instructions, and more documentation
300301
and tutorials.
301-
- `Introduction to TorchScript tutorial`_ for a deeper initial
302-
exposition of TorchScript
303-
- `Full TorchScript documentation`_ for complete TorchScript language
304-
and API reference
302+
- `Distributed RPC Framework`_ for RPC framework and API reference.
303+
- `Full profiler documentation`_ for profiler documentation.
305304

306305
.. _pytorch.org: https://pytorch.org/
307-
.. _Introduction to TorchScript tutorial: https://pytorch.org/tutorials/beginner/Intro_to_TorchScript_tutorial.html
308-
.. _Full TorchScript documentation: https://pytorch.org/docs/stable/jit.html
309-
.. _Loading A TorchScript Model in C++ tutorial: https://pytorch.org/tutorials/advanced/cpp_export.html
310-
.. _full TorchScript documentation: https://pytorch.org/docs/stable/jit.html
306+
.. _Full profiler documentation: https://pytorch.org/docs/stable/autograd.html#profiler
307+
.. _Pytorch Profiler: https://pytorch.org/docs/stable/autograd.html#profiler
308+
.. _Distributed RPC Framework: https://pytorch.org/docs/stable/rpc.html
309+
.. _RPC Tutorials: https://pytorch.org/tutorials/intermediate/rpc_tutorial.html
310+
.. _Profiler Recipe: https://pytorch.org/tutorials/recipes/recipes/profiler.html

0 commit comments

Comments
 (0)