Skip to content

Commit 7d71285

Browse files
authored
Merge branch 'master' into jingxu10
2 parents a1233ce + 4e1d8d2 commit 7d71285

File tree

6 files changed

+132
-133
lines changed

6 files changed

+132
-133
lines changed

advanced_source/rpc_ddp_tutorial.rst

Lines changed: 70 additions & 69 deletions
Original file line numberDiff line numberDiff line change
@@ -1,92 +1,92 @@
11
Combining Distributed DataParallel with Distributed RPC Framework
22
=================================================================
3-
**Author**: `Pritam Damania <https://github.com/pritamdamania87>`_
3+
**Authors**: `Pritam Damania <https://github.com/pritamdamania87>`_ and `Yi Wang <https://github.com/SciPioneer>`_
44

55

6-
This tutorial uses a simple example to demonstrate how you can combine
6+
This tutorial uses a simple example to demonstrate how you can combine
77
`DistributedDataParallel <https://pytorch.org/docs/stable/nn.html#torch.nn.parallel.DistributedDataParallel>`__ (DDP)
8-
with the `Distributed RPC framework <https://pytorch.org/docs/master/rpc.html>`__
9-
to combine distributed data parallelism with distributed model parallelism to
8+
with the `Distributed RPC framework <https://pytorch.org/docs/master/rpc.html>`__
9+
to combine distributed data parallelism with distributed model parallelism to
1010
train a simple model. Source code of the example can be found `here <https://github.com/pytorch/examples/tree/master/distributed/rpc/ddp_rpc>`__.
1111

1212
Previous tutorials,
1313
`Getting Started With Distributed Data Parallel <https://pytorch.org/tutorials/intermediate/ddp_tutorial.html>`__
1414
and `Getting Started with Distributed RPC Framework <https://pytorch.org/tutorials/intermediate/rpc_tutorial.html>`__,
15-
described how to perform distributed data parallel and distributed model
16-
parallel training respectively. Although, there are several training paradigms
15+
described how to perform distributed data parallel and distributed model
16+
parallel training respectively. Although, there are several training paradigms
1717
where you might want to combine these two techniques. For example:
1818

19-
1) If we have a model with a sparse part (large embedding table) and a dense
20-
part (FC layers), we might want to put the embedding table on a parameter
19+
1) If we have a model with a sparse part (large embedding table) and a dense
20+
part (FC layers), we might want to put the embedding table on a parameter
2121
server and replicate the FC layer across multiple trainers using `DistributedDataParallel <https://pytorch.org/docs/stable/nn.html#torch.nn.parallel.DistributedDataParallel>`__.
22-
The `Distributed RPC framework <https://pytorch.org/docs/master/rpc.html>`__
22+
The `Distributed RPC framework <https://pytorch.org/docs/master/rpc.html>`__
2323
can be used to perform embedding lookups on the parameter server.
2424
2) Enable hybrid parallelism as described in the `PipeDream <https://arxiv.org/abs/1806.03377>`__ paper.
25-
We can use the `Distributed RPC framework <https://pytorch.org/docs/master/rpc.html>`__
26-
to pipeline stages of the model across multiple workers and replicate each
25+
We can use the `Distributed RPC framework <https://pytorch.org/docs/master/rpc.html>`__
26+
to pipeline stages of the model across multiple workers and replicate each
2727
stage (if needed) using `DistributedDataParallel <https://pytorch.org/docs/stable/nn.html#torch.nn.parallel.DistributedDataParallel>`__.
2828

2929
|
30-
In this tutorial we will cover case 1 mentioned above. We have a total of 4
30+
In this tutorial we will cover case 1 mentioned above. We have a total of 4
3131
workers in our setup as follows:
3232

3333

34-
1) 1 Master, which is responsible for creating an embedding table
35-
(nn.EmbeddingBag) on the parameter server. The master also drives the
34+
1) 1 Master, which is responsible for creating an embedding table
35+
(nn.EmbeddingBag) on the parameter server. The master also drives the
3636
training loop on the two trainers.
37-
2) 1 Parameter Server, which basically holds the embedding table in memory and
37+
2) 1 Parameter Server, which basically holds the embedding table in memory and
3838
responds to RPCs from the Master and Trainers.
39-
3) 2 Trainers, which store an FC layer (nn.Linear) which is replicated amongst
39+
3) 2 Trainers, which store an FC layer (nn.Linear) which is replicated amongst
4040
themselves using `DistributedDataParallel <https://pytorch.org/docs/stable/nn.html#torch.nn.parallel.DistributedDataParallel>`__.
41-
The trainers are also responsible for executing the forward pass, backward
41+
The trainers are also responsible for executing the forward pass, backward
4242
pass and optimizer step.
4343

4444
|
4545
The entire training process is executed as follows:
4646

47-
1) The master creates an embedding table on the Parameter Server and holds an
48-
`RRef <https://pytorch.org/docs/master/rpc.html#rref>`__ to it.
49-
2) The master, then kicks off the training loop on the trainers and passes the
50-
embedding table RRef to the trainers.
51-
3) The trainers create a ``HybridModel`` which first performs an embedding lookup
52-
using the embedding table RRef provided by the master and then executes the
47+
1) The master creates a `RemoteModule <https://pytorch.org/docs/master/rpc.html#remotemodule>`__
48+
that holds an embedding table on the Parameter Server.
49+
2) The master, then kicks off the training loop on the trainers and passes the
50+
remote module to the trainers.
51+
3) The trainers create a ``HybridModel`` which first performs an embedding lookup
52+
using the remote module provided by the master and then executes the
5353
FC layer which is wrapped inside DDP.
54-
4) The trainer executes the forward pass of the model and uses the loss to
54+
4) The trainer executes the forward pass of the model and uses the loss to
5555
execute the backward pass using `Distributed Autograd <https://pytorch.org/docs/master/rpc.html#distributed-autograd-framework>`__.
56-
5) As part of the backward pass, the gradients for the FC layer are computed
56+
5) As part of the backward pass, the gradients for the FC layer are computed
5757
first and synced to all trainers via allreduce in DDP.
58-
6) Next, Distributed Autograd propagates the gradients to the parameter server,
58+
6) Next, Distributed Autograd propagates the gradients to the parameter server,
5959
where the gradients for the embedding table are updated.
6060
7) Finally, the `Distributed Optimizer <https://pytorch.org/docs/master/rpc.html#module-torch.distributed.optim>`__ is used to update all the parameters.
6161

6262

6363
.. attention::
6464

65-
You should always use `Distributed Autograd <https://pytorch.org/docs/master/rpc.html#distributed-autograd-framework>`__
65+
You should always use `Distributed Autograd <https://pytorch.org/docs/master/rpc.html#distributed-autograd-framework>`__
6666
for the backward pass if you're combining DDP and RPC.
6767

6868

69-
Now, let's go through each part in detail. Firstly, we need to setup all of our
70-
workers before we can perform any training. We create 4 processes such that
71-
ranks 0 and 1 are our trainers, rank 2 is the master and rank 3 is the
69+
Now, let's go through each part in detail. Firstly, we need to setup all of our
70+
workers before we can perform any training. We create 4 processes such that
71+
ranks 0 and 1 are our trainers, rank 2 is the master and rank 3 is the
7272
parameter server.
7373

74-
We initialize the RPC framework on all 4 workers using the TCP init_method.
75-
Once RPC initialization is done, the master creates an `EmbeddingBag <https://pytorch.org/docs/master/generated/torch.nn.EmbeddingBag.html>`__
76-
on the Parameter Server using `rpc.remote <https://pytorch.org/docs/master/rpc.html#torch.distributed.rpc.remote>`__.
77-
The master then loops through each trainer and kicks of the training loop by
74+
We initialize the RPC framework on all 4 workers using the TCP init_method.
75+
Once RPC initialization is done, the master creates a remote module that holds an `EmbeddingBag <https://pytorch.org/docs/master/generated/torch.nn.EmbeddingBag.html>`__
76+
layer on the Parameter Server using `RemoteModule <https://pytorch.org/docs/master/rpc.html#torch.distributed.nn.api.remote_module.RemoteModule>`__.
77+
The master then loops through each trainer and kicks off the training loop by
7878
calling ``_run_trainer`` on each trainer using `rpc_async <https://pytorch.org/docs/master/rpc.html#torch.distributed.rpc.rpc_async>`__.
7979
Finally, the master waits for all training to finish before exiting.
8080

81-
The trainers first initialize a ``ProcessGroup`` for DDP with world_size=2
81+
The trainers first initialize a ``ProcessGroup`` for DDP with world_size=2
8282
(for two trainers) using `init_process_group <https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group>`__.
83-
Next, they initialize the RPC framework using the TCP init_method. Note that
84-
the ports are different in RPC initialization and ProcessGroup initialization.
85-
This is to avoid port conflicts between initialization of both frameworks.
86-
Once the initialization is done, the trainers just wait for the ``_run_trainer``
83+
Next, they initialize the RPC framework using the TCP init_method. Note that
84+
the ports are different in RPC initialization and ProcessGroup initialization.
85+
This is to avoid port conflicts between initialization of both frameworks.
86+
Once the initialization is done, the trainers just wait for the ``_run_trainer``
8787
RPC from the master.
8888

89-
The parameter server just initializes the RPC framework and waits for RPCs from
89+
The parameter server just initializes the RPC framework and waits for RPCs from
9090
the trainers and master.
9191

9292

@@ -95,16 +95,15 @@ the trainers and master.
9595
:start-after: BEGIN run_worker
9696
:end-before: END run_worker
9797

98-
Before we discuss details of the Trainer, let's introduce the ``HybridModel`` that
99-
the trainer uses. As described below, the ``HybridModel`` is initialized using an
100-
RRef to the embedding table (emb_rref) on the parameter server and the ``device``
101-
to use for DDP. The initialization of the model wraps an
102-
`nn.Linear <https://pytorch.org/docs/master/generated/torch.nn.Linear.html>`__
98+
Before we discuss details of the Trainer, let's introduce the ``HybridModel`` that
99+
the trainer uses. As described below, the ``HybridModel`` is initialized using a
100+
remote module that holds an embedding table (``remote_emb_module``) on the parameter server and the ``device``
101+
to use for DDP. The initialization of the model wraps an
102+
`nn.Linear <https://pytorch.org/docs/master/generated/torch.nn.Linear.html>`__
103103
layer inside DDP to replicate and synchronize this layer across all trainers.
104104

105-
The forward method of the model is pretty straightforward. It performs an
106-
embedding lookup on the parameter server using an
107-
`RRef helper <https://pytorch.org/docs/master/rpc.html#torch.distributed.rpc.RRef.rpc_sync>`__
105+
The forward method of the model is pretty straightforward. It performs an
106+
embedding lookup on the parameter server using RemoteModule's ``forward``
108107
and passes its output onto the FC layer.
109108

110109

@@ -113,37 +112,39 @@ and passes its output onto the FC layer.
113112
:start-after: BEGIN hybrid_model
114113
:end-before: END hybrid_model
115114

116-
Next, let's look at the setup on the Trainer. The trainer first creates the
117-
``HybridModel`` described above using an RRef to the embedding table on the
115+
Next, let's look at the setup on the Trainer. The trainer first creates the
116+
``HybridModel`` described above using a remote module that holds the embedding table on the
118117
parameter server and its own rank.
119118

120-
Now, we need to retrieve a list of RRefs to all the parameters that we would
121-
like to optimize with `DistributedOptimizer <https://pytorch.org/docs/master/rpc.html#module-torch.distributed.optim>`__.
122-
To retrieve the parameters for the embedding table from the parameter server,
123-
we define a simple helper function ``_retrieve_embedding_parameters``, which
124-
basically walks through all the parameters for the embedding table and returns
125-
a list of RRefs. The trainer calls this method on the parameter server via RPC
126-
to receive a list of RRefs to the desired parameters. Since the
127-
DistributedOptimizer always takes a list of RRefs to parameters that need to
128-
be optimized, we need to create RRefs even for the local parameters for our
129-
FC layers. This is done by walking ``model.parameters()``, creating an RRef for
130-
each parameter and appending it to a list. Note that ``model.parameters()`` only
131-
returns local parameters and doesn't include ``emb_rref``.
132-
133-
Finally, we create our DistributedOptimizer using all the RRefs and define a
119+
Now, we need to retrieve a list of RRefs to all the parameters that we would
120+
like to optimize with `DistributedOptimizer <https://pytorch.org/docs/master/rpc.html#module-torch.distributed.optim>`__.
121+
To retrieve the parameters for the embedding table from the parameter server,
122+
we can call RemoteModule's `remote_parameters <https://pytorch.org/docs/master/rpc.html#torch.distributed.nn.api.remote_module.RemoteModule.remote_parameters>`__,
123+
which basically walks through all the parameters for the embedding table and returns
124+
a list of RRefs. The trainer calls this method on the parameter server via RPC
125+
to receive a list of RRefs to the desired parameters. Since the
126+
DistributedOptimizer always takes a list of RRefs to parameters that need to
127+
be optimized, we need to create RRefs even for the local parameters for our
128+
FC layers. This is done by walking ``model.fc.parameters()``, creating an RRef for
129+
each parameter and appending it to the list returned from ``remote_parameters()``.
130+
Note that we cannnot use ``model.parameters()``,
131+
because it will recursively call ``model.remote_emb_module.parameters()``,
132+
which is not supported by ``RemoteModule``.
133+
134+
Finally, we create our DistributedOptimizer using all the RRefs and define a
134135
CrossEntropyLoss function.
135136

136137
.. literalinclude:: ../advanced_source/rpc_ddp_tutorial/main.py
137138
:language: py
138-
:start-after: BEGIN setup_trainer
139+
:start-after: BEGIN setup_trainer
139140
:end-before: END setup_trainer
140141

141-
Now we're ready to introduce the main training loop that is run on each trainer.
142-
``get_next_batch`` is just a helper function to generate random inputs and
143-
targets for training. We run the training loop for multiple epochs and for each
142+
Now we're ready to introduce the main training loop that is run on each trainer.
143+
``get_next_batch`` is just a helper function to generate random inputs and
144+
targets for training. We run the training loop for multiple epochs and for each
144145
batch:
145146

146-
1) Setup a `Distributed Autograd Context <https://pytorch.org/docs/master/rpc.html#torch.distributed.autograd.context>`__
147+
1) Setup a `Distributed Autograd Context <https://pytorch.org/docs/master/rpc.html#torch.distributed.autograd.context>`__
147148
for Distributed Autograd.
148149
2) Run the forward pass of the model and retrieve its output.
149150
3) Compute the loss based on our outputs and targets using the loss function.

0 commit comments

Comments
 (0)