Skip to content

Commit 110920f

Browse files
committed
fix typos
1 parent 755bd2f commit 110920f

File tree

1 file changed

+38
-31
lines changed

1 file changed

+38
-31
lines changed

intermediate_source/rpc_tutorial.rst

Lines changed: 38 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,8 @@ Getting Started with Distributed RPC Framework
55

66
This tutorial uses two simple examples to demonstrate how to build distributed
77
training with the `torch.distributed.rpc <https://pytorch.org/docs/master/rpc.html>`__
8-
package. Source code of the two examples can be found in
8+
package which is first introduced as an experimental feature in PyTorch v1.4.
9+
Source code of the two examples can be found in
910
`PyTorch examples <https://github.com/pytorch/examples>`__
1011

1112
`Previous <https://deploy-preview-807--pytorch-tutorials-preview.netlify.com/intermediate/ddp_tutorial.html>`__
@@ -189,7 +190,7 @@ there is any live user of that ``RRef``. Please refer to the ``RRef``
189190
190191
191192
Next, the agent exposes two APIs to observers for selecting actions and
192-
reporting rewards. Those functions are only run locally on the agent, but will
193+
reporting rewards. Those functions only run locally on the agent, but will
193194
be triggered by observers through RPC.
194195

195196

@@ -240,9 +241,10 @@ contain the recorded action probs and rewards.
240241
241242
242243
Finally, after one episode, the agent needs to train the model, which
243-
is implemented in the ``finish_episode`` function below. It is also a local
244-
function and mostly borrowed from the single-thread
244+
is implemented in the ``finish_episode`` function below. There is no RPCs in
245+
this function and it is mostly borrowed from the single-thread
245246
`example <https://github.com/pytorch/examples/blob/master/reinforcement_learning>`__.
247+
Hence, we skip describing its contents.
246248

247249

248250

@@ -285,14 +287,14 @@ With ``Policy``, ``Observer``, and ``Agent`` classes, we are ready to launch
285287
multiple processes to perform the distributed training. In this example, all
286288
processes run the same ``run_worker`` function, and they use the rank to
287289
distinguish their role. Rank 0 is always the agent, and all other ranks are
288-
observers. As agent as server as master, repeatedly call ``run_episode`` and
290+
observers. The agent serves as master by repeatedly calling ``run_episode`` and
289291
``finish_episode`` until the running reward surpasses the reward threshold
290-
specified by the environment. All observers just passively waiting for commands
292+
specified by the environment. All observers passively waiting for commands
291293
from the agent. The code is wrapped by
292294
`rpc.init_rpc <https://pytorch.org/docs/master/rpc.html#torch.distributed.rpc.init_rpc>`__ and
293295
`rpc.shutdown <https://pytorch.org/docs/master/rpc.html#torch.distributed.rpc.shutdown>`__,
294296
which initializes and terminates RPC instances respectively. More details are
295-
available in the API page.
297+
available in the `API page <https://pytorch.org/docs/master/rpc.html>`__.
296298

297299

298300
.. code:: python
@@ -375,33 +377,35 @@ Below are some sample outputs when training with `world_size=2`.
375377

376378

377379
In this example, we show how to use RPC as the communication vehicle to pass
378-
date across workers, and how to use RRef to reference remote objects. It is true
380+
data across workers, and how to use RRef to reference remote objects. It is true
379381
that you could build the entire structure directly on top of ``ProcessGroup``
380382
``send`` and ``recv`` APIs or use other communication/RPC libraries. However,
381-
by using `torch.distributed.rpc`, you can get the native support plus
383+
by using `torch.distributed.rpc`, you can get the native support and
382384
continuously optimized performance under the hood.
383385

384386
Next, we will show how to combine RPC and RRef with distributed autograd and
385387
distributed optimizer to perform distributed model parallel training.
386388

387389

388390

389-
390391
Distributed RNN using Distributed Autograd and Distributed Optimizer
391392
--------------------------------------------------------------------
392393

393394
In this section, we use an RNN model to show how to build distributed model
394-
parallel training using the RPC API. The example RNN model is very small and
395-
easily fit into a single GPU, but developer can apply the similar techniques to
396-
much larger models that need to span multiple devices. The RNN model design is
397-
borrowed from the word language model in PyTorch
395+
parallel training with the RPC API. The example RNN model is very small and
396+
can easily fit into a single GPU, but we still divide its layers onto two
397+
different workers to demonstrate the idea. Developer can apply the similar
398+
techniques to distribute much larger models across multiple devices and
399+
machines.
400+
401+
The RNN model design is borrowed from the word language model in PyTorch
398402
`example <https://github.com/pytorch/examples/tree/master/word_language_model>`__
399403
repository, which contains three main components, an embedding table, an
400404
``LSTM`` layer, and a decoder. The code below wraps the embedding table and the
401405
decoder into sub-modules, so that their constructors can be passed to the RPC
402406
API. In the `EmbeddingTable` sub-module, we intentionally put the `Embedding`
403-
layer on GPU to demonstrate the use case. In v1.4, RPC always creates CPU tensor
404-
arguments or return values on the destination server. If the function takes a
407+
layer on GPU to cover the use case. In v1.4, RPC always creates CPU tensor
408+
arguments or return values on the destination worker. If the function takes a
405409
GPU tensor, you need to move it to the proper device explicitly.
406410

407411

@@ -437,7 +441,7 @@ With the above sub-modules, we can now piece them together using RPC to
437441
create an RNN model. In the code below ``ps`` represents a parameter server,
438442
which hosts parameters of the embedding table and the decoder. The constructor
439443
uses the `remote <https://pytorch.org/docs/master/rpc.html#torch.distributed.rpc.remote>`__
440-
API to create an `EmbeddingTable` and a `Decoder` object on the parameter
444+
API to create an `EmbeddingTable` object and a `Decoder` object on the parameter
441445
server, and locally creates the ``LSTM`` sub-module. During the forward pass,
442446
the trainer uses the ``EmbeddingTable`` ``RRef`` to find the remote sub-module
443447
and passes the input data to the ``EmbeddingTable`` using RPC and fetches the
@@ -475,12 +479,13 @@ Before introducing the distributed optimizer, let's add a helper function to
475479
generate a list of RRefs of model parameters, which will be consumed by the
476480
distributed optimizer. In local training, applications could call
477481
``Module.parameters()`` to grab references to all parameter tensors, and pass it
478-
to the local optimizer to update. However, the same API does not work in
479-
the distributed training scenarios as some parameters live on remote machines.
480-
Therefore, instead of taking a list of parameter ``Tensors``, the distributed
481-
optimizer takes a list of ``RRefs``, one ``RRef`` per model parameter for both
482-
local and remote parameters. The helper function is pretty simple, just call
483-
``Module.parameters()`` and creates a local ``RRef`` on each of the parameters.
482+
to the local optimizer for subsequent updates. However, the same API does not
483+
work in distributed training scenarios as some parameters live on remote
484+
machines. Therefore, instead of taking a list of parameter ``Tensors``, the
485+
distributed optimizer takes a list of ``RRefs``, one ``RRef`` per model
486+
parameter for both local and remote model parameters. The helper function is
487+
pretty simple, just call ``Module.parameters()`` and creates a local ``RRef`` on
488+
each of the parameters.
484489
485490
486491
.. code:: python
@@ -511,7 +516,7 @@ Then, as the ``RNNModel`` contains three sub-modules, we need to call
511516
return remote_params
512517
513518
514-
Now, we are ready to implement the training loop. After initializing the model
519+
Now, we are ready to implement the training loop. After initializing model
515520
arguments, we create the ``RNNModel`` and the ``DistributedOptimizer``. The
516521
distributed optimizer will take a list of parameter ``RRefs``, find all distinct
517522
owner workers, and create the given local optimizer (i.e., ``SGD`` in this case,
@@ -520,17 +525,19 @@ the given arguments (i.e., ``lr=0.05``).
520525
521526
In the training loop, it first creates a distributed autograd context, which
522527
will help the distributed autograd engine to find gradients and involved RPC
523-
send/recv functions. Then, it kicks off the forward pass as if it is a local
528+
send/recv functions. The design details of the distributed autograd engine can
529+
be found in its `design note <https://pytorch.org/docs/master/notes/distributed_autograd.html>`__.
530+
Then, it kicks off the forward pass as if it is a local
524531
model, and run the distributed backward pass. For the distributed backward, you
525532
only need to specify a list of roots, in this case, it is the loss ``Tensor``.
526533
The distributed autograd engine will traverse the distributed graph
527534
automatically and write gradients properly. Next, it runs the ``step``
528-
API on the distributed optimizer, which will reach out to all involved local
529-
optimizers to update model parameters. Compared to local training, one minor
530-
difference is that you don't need to run ``zero_grad()`` because each autograd
531-
context has dedicated space to store gradients, and as we create a context
532-
per iteration, those gradients from different iterations will not accumulate to
533-
the same set of ``Tensors``.
535+
function on the distributed optimizer, which will reach out to all involved
536+
local optimizers to update model parameters. Compared to local training, one
537+
minor difference is that you don't need to run ``zero_grad()`` because each
538+
autograd context has dedicated space to store gradients, and as we create a
539+
context per iteration, those gradients from different iterations will not
540+
accumulate to the same set of ``Tensors``.
534541
535542
536543
.. code:: python

0 commit comments

Comments
 (0)