You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -36,17 +36,17 @@ workers in our setup as follows:
36
36
training loop on the two trainers.
37
37
2) 1 Parameter Server, which basically holds the embedding table in memory and
38
38
responds to RPCs from the Master and Trainers.
39
-
3) 2 Trainers, which store a FC layer (nn.Linear) which is replicated amongst
39
+
3) 2 Trainers, which store an FC layer (nn.Linear) which is replicated amongst
40
40
themselves using `DistributedDataParallel <https://pytorch.org/docs/stable/nn.html#torch.nn.parallel.DistributedDataParallel>`__.
41
41
The trainers are also responsible for executing the forward pass, backward
42
42
pass and optimizer step.
43
43
44
44
|
45
45
The entire training process is executed as follows:
46
46
47
-
1) The master creates an embedding table on the Parameter Server and holds a
47
+
1) The master creates an embedding table on the Parameter Server and holds an
48
48
`RRef <https://pytorch.org/docs/master/rpc.html#rref>`__ to it.
49
-
2) The master, then kicks of the training loop on the trainers and passes the
49
+
2) The master, then kicks off the training loop on the trainers and passes the
50
50
embedding table RRef to the trainers.
51
51
3) The trainers create a ``HybridModel`` which first performs an embedding lookup
52
52
using the embedding table RRef provided by the master and then executes the
@@ -60,11 +60,13 @@ The entire training process is executed as follows:
60
60
7) Finally, the `Distributed Optimizer <https://pytorch.org/docs/master/rpc.html#module-torch.distributed.optim>`__ is used to update all the parameters.
61
61
62
62
63
-
|
64
-
**NOTE**: You should always use `Distributed Autograd <https://pytorch.org/docs/master/rpc.html#distributed-autograd-framework>`__ for the backward pass if you're combining DDP and RPC.
63
+
.. attention::
64
+
65
+
You should always use `Distributed Autograd <https://pytorch.org/docs/master/rpc.html#distributed-autograd-framework>`__
66
+
for the backward pass if you're combining DDP and RPC.
65
67
66
68
67
-
Now, lets go through each part in detail. Firstly, we need to setup all of our
69
+
Now, let's go through each part in detail. Firstly, we need to setup all of our
68
70
workers before we can perform any training. We create 4 processes such that
69
71
ranks 0 and 1 are our trainers, rank 2 is the master and rank 3 is the
70
72
parameter server.
@@ -79,90 +81,24 @@ Finally, the master waits for all training to finish before exiting.
79
81
The trainers first initialize a ``ProcessGroup`` for DDP with world_size=2
80
82
(for two trainers) using `init_process_group <https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group>`__.
81
83
Next, they initialize the RPC framework using the TCP init_method. Note that
82
-
the ports are different in RPC initialization and ProcessGroup intialization.
84
+
the ports are different in RPC initialization and ProcessGroup initialization.
83
85
This is to avoid port conflicts between initialization of both frameworks.
84
86
Once the initialization is done, the trainers just wait for the ``_run_trainer``
85
87
RPC from the master.
86
88
87
89
The parameter server just initializes the RPC framework and waits for RPCs from
88
90
the trainers and master.
89
91
90
-
.. code:: python
91
-
92
-
defrun_worker(rank, world_size):
93
-
r"""
94
-
A wrapper function that initializes RPC, calls the function, and shuts down
95
-
RPC.
96
-
"""
97
-
98
-
# We need to use different port numbers in TCP init_method for init_rpc and
0 commit comments