@@ -3,9 +3,9 @@ Profiling PyTorch RPC-Based Workloads
33
44In this recipe, you will learn:
55
6- - An overview of the Distributed RPC Framework
7- - An overview of the PyTorch Profiler
8- - How to use the PyTorch Profiler to profile RPC-based workloads
6+ - An overview of the ` Distributed RPC Framework `_
7+ - An overview of the ` PyTorch Profiler `_
8+ - How to use the profiler to profile RPC-based workloads
99
1010Requirements
1111------------
@@ -18,19 +18,19 @@ available at `pytorch.org`_.
1818What is the Distributed RPC Framework?
1919---------------------------------------
2020
21- The ** Distributed RPC Framework ** provides mechanisms for multi-machine model
21+ The **Distributed RPC Framework ** provides mechanisms for multi-machine model
2222training through a set of primitives to allow for remote communication, and a
2323higher-level API to automatically differentiate models split across several machines.
24- For this recipe, it would be helpful to be familiar with the Distributed RPC Framework
25- as well as the tutorials .
24+ For this recipe, it would be helpful to be familiar with the ` Distributed RPC Framework `_
25+ as well as the ` RPC Tutorials `_ .
2626
2727What is the PyTorch Profiler?
2828---------------------------------------
2929The profiler is a context manager based API that allows for on-demand profiling of
3030operators in a model's workload. The profiler can be used to analyze various aspects
3131of a model including execution time, operators invoked, and memory consumption. For a
3232detailed tutorial on using the profiler to profile a single-node model, please see the
33- Profiler Recipe.
33+ ` Profiler Recipe `_ .
3434
3535
3636
@@ -40,11 +40,11 @@ How to use the Profiler for RPC-based workloads
4040The profiler supports profiling of calls made of RPC and allows the user to have a
4141detailed view into the operations that take place on different nodes. To demonstrate an
4242example of this, let's first set up the RPC framework. The below code snippet will initialize
43- two RPC workers on the same host, named " worker0" and " worker1" respectively. The workers will
43+ two RPC workers on the same host, named `` worker0 `` and `` worker1 `` respectively. The workers will
4444be spawned as subprocesses, and we set some environment variables required for proper
45- initialization (see torch.distributed documentation for more details) .
45+ initialization.
4646
47- .. code :: python3
47+ ::
4848
4949 import torch
5050 import torch.distributed.rpc as rpc
@@ -88,7 +88,7 @@ initialization (see torch.distributed documentation for more details).
8888
8989Running the above program should present you with the following output:
9090
91- ..
91+ ::
9292
9393 DEBUG:root:worker0 successfully initialized RPC.
9494 DEBUG:root:worker1 successfully initialized RPC.
@@ -97,7 +97,7 @@ Now that we have a skeleton setup of our RPC framework, we can move on to
9797sending RPCs back and forth and using the profiler to obtain a view of what's
9898happening under the hood. Let's add to the above "worker" function:
9999
100- ..code:: python3
100+ ::
101101
102102 def worker(rank, world_size):
103103 # Above code omitted...
@@ -115,14 +115,15 @@ happening under the hood. Let's add to the above "worker" function:
115115
116116 print(prof.key_averages().table())
117117
118- The aformented code creates 2 RPCs, specifying torch.add and torch.mul, respectively,
119- to be run with two random input tensors on worker 1. Since we use the rpc_async API,
120- we are returned a torch.futures.Future object, which must be awaited for the result
118+ The aformented code creates 2 RPCs, specifying `` torch.add `` and `` torch.mul `` , respectively,
119+ to be run with two random input tensors on worker 1. Since we use the `` rpc_async `` API,
120+ we are returned a `` torch.futures.Future `` object, which must be awaited for the result
121121of the computation. Note that this wait must take place within the scope created by
122122the profiling context manager in order for the RPC to be accurately profiled. Running
123123the code with this new worker function should result in the following output:
124124
125- ..
125+ ::
126+
126127 # Some columns are omitted for brevity, exact output subject to randomness
127128 ---------------------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- ---------------
128129 Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg Number of Calls Node ID
@@ -138,24 +139,24 @@ the code with this new worker function should result in the following output:
138139 ---------------------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- ---------------
139140 Self CPU time total: 11.237ms
140141
141- Here we can see that the profiler has profiled our rpc_async calls made to worker 1
142- from worker 0 . In particular, the first 2 entries in the table show details (such as
142+ Here we can see that the profiler has profiled our `` rpc_async `` calls made to `` worker1 ``
143+ from `` worker0 `` . In particular, the first 2 entries in the table show details (such as
143144the operator name, originating worker, and destination worker) about each RPC call made
144- and the " CPU total" column indicates the end-to-end latency of the RPC call.
145+ and the `` CPU total `` column indicates the end-to-end latency of the RPC call.
145146
146147We also have visibility into the actual operators invoked remotely on worker 1 due RPC.
147- We can see operations that took place on worker 1 by checking the " Node ID" column. For
148- example, we can interpret the row with name ::' rpc_async#aten::mul(worker0 -> worker1)#remote_op: mul'
149- as a `mul ` operation taking place on the remote node, as a result of the RPC sent to worker 1
150- from worker 0 , specifying worker 1 to run the builtin `mul ` operator on the input tensors.
148+ We can see operations that took place on `` worker1 `` by checking the `` Node ID `` column. For
149+ example, we can interpret the row with name `` rpc_async#aten::mul(worker0 -> worker1)#remote_op: mul ``
150+ as a `` mul `` operation taking place on the remote node, as a result of the RPC sent to `` worker1 ``
151+ from `` worker0 `` , specifying `` worker1 `` to run the builtin `` mul ` ` operator on the input tensors.
151152Note that names of remote operations are prefixed with the name of the RPC event that resulted
152- in them. For example, remote operations corresponding to the :: rpc.rpc_async(dst_worker_name, torch.add, args=(t1, t2))
153- call are prefixed with :: rpc_async#aten::mul(worker0 -> worker1).
153+ in them. For example, remote operations corresponding to the `` rpc.rpc_async(dst_worker_name, torch.add, args=(t1, t2)) ``
154+ call are prefixed with `` rpc_async#aten::mul(worker0 -> worker1) `` .
154155
155156We can also use the profiler gain insight into user-defined functions that are executed over RPC.
156- For example, let's add the following to the above " worker" function:
157+ For example, let's add the following to the above `` worker `` function:
157158
158- ..code:: python3
159+ ::
159160
160161 # Define somewhere outside of worker() func.
161162 def udf_with_ops():
@@ -165,7 +166,6 @@ For example, let's add the following to the above "worker" function:
165166 torch.add(t1, t2)
166167 torch.mul(t1, t2)
167168
168- ..code::python3
169169 def worker(rank, world_size):
170170 # Above code omitted
171171 with profiler.profile() as p:
@@ -177,7 +177,8 @@ The above code creates a user-defined function that sleeps for 1 second, and the
177177operators. Similar to what we've done above, we send an RPC to the remote worker, specifying it to
178178run our user-defined function. Running this code should result in the following output:
179179
180- ..
180+ ::
181+
181182 # Exact output subject to randomness
182183 -------------------------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- ---------------
183184 Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg Number of Calls Node ID
@@ -194,14 +195,14 @@ run our user-defined function. Running this code should result in the following
194195 -------------------------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- ---------------
195196
196197Here we can see that the user-defined function has successfully been profiled with its name
197- (rpc_async#udf_with_ops(worker0 -> worker1)), and has the CPU total time we would roughly expect
198- (slightly greater than 1s given the sleep). Similar to the above profiling output, we can see the
198+ `` (rpc_async#udf_with_ops(worker0 -> worker1)) `` , and has the CPU total time we would roughly expect
199+ (slightly greater than 1s given the `` sleep `` ). Similar to the above profiling output, we can see the
199200remote operators that have been executed on worker 1 as part of executing this RPC request.
200201
201202Lastly, we can visualize remote execution using the tracing functionality provided by the profiler.
202- Let's add the following code to the above " worker" function:
203+ Let's add the following code to the above `` worker `` function:
203204
204- ..code:: python3
205+ ::
205206
206207 def worker(rank, world_size):
207208 # Above code omitted
@@ -217,11 +218,11 @@ the following:
217218 :scale: 25 %
218219
219220As we can see, we have traced our RPC requests and can also visualize traces of the remote operations,
220- in this case, given in the trace column for " node_id: 1" .
221+ in this case, given in the trace column for `` node_id: 1 `` .
221222
222223Putting it all together, we have the following code for this recipe:
223224
224- ..code:: python3
225+ ::
225226
226227 import torch
227228 import torch.distributed.rpc as rpc
@@ -298,13 +299,12 @@ Learn More
298299
299300- `pytorch.org `_ for installation instructions, and more documentation
300301 and tutorials.
301- - `Introduction to TorchScript tutorial `_ for a deeper initial
302- exposition of TorchScript
303- - `Full TorchScript documentation `_ for complete TorchScript language
304- and API reference
302+ - `Distributed RPC Framework `_ for RPC framework and API reference.
303+ - `Full profiler documentation `_ for profiler documentation.
305304
306305.. _pytorch.org : https://pytorch.org/
307- .. _Introduction to TorchScript tutorial : https://pytorch.org/tutorials/beginner/Intro_to_TorchScript_tutorial.html
308- .. _Full TorchScript documentation : https://pytorch.org/docs/stable/jit.html
309- .. _Loading A TorchScript Model in C++ tutorial : https://pytorch.org/tutorials/advanced/cpp_export.html
310- .. _full TorchScript documentation : https://pytorch.org/docs/stable/jit.html
306+ .. _Full profiler documentation : https://pytorch.org/docs/stable/autograd.html#profiler
307+ .. _Pytorch Profiler : https://pytorch.org/docs/stable/autograd.html#profiler
308+ .. _Distributed RPC Framework : https://pytorch.org/docs/stable/rpc.html
309+ .. _RPC Tutorials : https://pytorch.org/tutorials/intermediate/rpc_tutorial.html
310+ .. _Profiler Recipe : https://pytorch.org/tutorials/recipes/recipes/profiler.html
0 commit comments