Simple example to demonstrate parameter server training pattern #705

rohan-varma · 2020-02-07T23:12:43Z

torch.distributed.rpc now enables parameter-server style training in pytorch with RPC-based APIs. This PR adds a simple parameter-server training example that launches a bunch of trainers and a single PS to train a single model for mnist.

distributed/rpc/parameter_server/rpc_param_server.py

mrshenli · 2020-02-10T15:00:23Z

distributed/rpc/parameter_server/rpc_param_server.py

+
+# --------- Helper Methods --------------------
+
+# On the local node, call a method with first arg as the value held by the RRef. Other args are passed in as arguments to the function called.


shall we break long comments into multiple lines?

mrshenli · 2020-02-10T15:00:38Z

distributed/rpc/parameter_server/rpc_param_server.py

+    return method(rref.local_value(), *args, **kwargs)
+
+# Syncrhnous RPC to run a method remotely and get a result. The method should be a class method corresponding to
+# Given an RRef, return the result of calling the passed in method on the value held by the RRef. This call is done on the remote node that owns the RRef. args and kwargs are passed into the method.


mrshenli · 2020-02-10T15:01:00Z

distributed/rpc/parameter_server/rpc_param_server.py

+def call_method(method, rref, *args, **kwargs):
+    return method(rref.local_value(), *args, **kwargs)
+
+# Syncrhnous RPC to run a method remotely and get a result. The method should be a class method corresponding to


Syncrhnous -> Synchronous

distributed/rpc/parameter_server/rpc_param_server.py

mrshenli · 2020-02-10T15:22:07Z

distributed/rpc/parameter_server/rpc_param_server.py

+            print(loss)
+            dist_autograd.backward([loss])
+            param_rrefs = net.get_global_param_rrefs()
+            opt = DistributedOptimizer(optim.SGD, param_rrefs, lr=0.03)


Do we have to create an opt per iteration? Can this be done before the for loop?

mrshenli

Thanks a lot for putting together this example!

mrshenli · 2020-02-10T15:28:03Z

distributed/rpc/parameter_server/rpc_param_server.py

+        ])),
+        batch_size=32, shuffle=True, )
+    processes = []
+    # Run num_trainers workers, plus 1 for the parameter serever.


serever -> server

mrshenli · 2020-02-10T15:28:48Z

distributed/rpc/parameter_server/rpc_param_server.py

+            dist_autograd.backward([loss])
+            param_rrefs = net.get_global_param_rrefs()
+            opt = DistributedOptimizer(optim.SGD, param_rrefs, lr=0.03)
+            opt.step()


This is hogwild, right?

Yes, should we use locks?

hogwild is good, as long as we clearly state it. :)

mrshenli

LGTM! Just need a few more minor edits before landing I think. Thanks @rohan-varma !

mrshenli · 2020-03-05T22:07:28Z

distributed/rpc/parameter_server/rpc_param_server.py

+        "world_size",
+        type=int,
+        default=4,
+        help="Total number of participating processes. Should be the sum of master node and all training nodes, add 1 if creating training node on master.")


Shall we break long lines into shorter ones? There are a few more below.

mrshenli · 2020-03-05T22:09:28Z

distributed/rpc/parameter_server/rpc_param_server.py

+        x = torch.flatten(x, 1)
+        # need to put this on CUDA
+        next_device = next(self.fc1.parameters()).device
+        # print("In forward, changing device to {}".format(str(next_device)))


This is vestige from past debugging code?

mrshenli · 2020-03-05T22:13:05Z

distributed/rpc/parameter_server/rpc_param_server.py

+    return method(rref.local_value(), *args, **kwargs)
+
+# Synchronous RPC to run a method remotely and get a result.
+# The method should be a class method corresponding to Given an RRef,


There seems to be some missing text between "corresponding to" and "Given".

Or

Given an -> a given, and then fix the following clause?

I'll remove this portion and start with "Given an RRef" since the former point is covered below.

distributed/rpc/parameter_server/rpc_param_server.py

mrshenli · 2020-03-05T22:16:36Z

distributed/rpc/parameter_server/rpc_param_server.py

+            # construct it once
+            param_server = ParameterServer(num_gpus=num_gpus)
+            print(
+                "Returning parameter server with ID {}".format(


looks like we can fit this in one line with f"Returning parameter server with ID {id(param_server)}"?

distributed/rpc/parameter_server/rpc_param_server.py

rohan-varma · 2020-03-20T22:43:29Z

@mrshenli Updated to address all comments, and also to explicitly move tensors in and out of GPU so that they work with the latest RPC, which disallows sending CUDA tensors. Could you take another look? Thanks!

mrshenli

LGTM! @jlin27 shall we land this example?

Simple example to demonstrate parameter server training pattern

rohan-varma added 4 commits February 7, 2020 15:09

WIP: parameter server example

53a0731

Update

7ce9e92

Added data and loss to show actual training

fb6829c

Updat

063aac7

mrshenli reviewed Feb 10, 2020

View reviewed changes

rohan-varma added 7 commits February 12, 2020 16:04

Make multi-host setup possible

01d70cd

Address comments and train on GPU

788bc79

Format

d2f6682

Added instructions

14deff8

Still need to take in CUDA device programatically

bfc5b5d

Added support for using 2 GPUs

1cabfdb

update

9838fbc

mrshenli reviewed Mar 5, 2020

View reviewed changes

rohan-varma added 5 commits March 20, 2020 13:36

Update

81790c6

Move tensors in and out of GPU

9f50c64

UPdate

22cda00

Use f-strings

a47667c

update

2da3f63

mrshenli approved these changes Mar 20, 2020

View reviewed changes

rohan-varma changed the title ~~[WIP] Simple example to demonstrate parameter server training pattern~~ Simple example to demonstrate parameter server training pattern Mar 21, 2020

rohan-varma added 3 commits March 23, 2020 12:21

update

23695e8

update

a0da8d4

update

6d29c3b

jlin27 merged commit 8a5b379 into pytorch:master Mar 23, 2020

YinZhengxun pushed a commit to YinZhengxun/mt-exercise-02 that referenced this pull request Mar 30, 2025

Merge pull request pytorch#705 from rohan-varma/add_param_server

368ef64

Simple example to demonstrate parameter server training pattern


		# --------- Helper Methods --------------------

		# On the local node, call a method with first arg as the value held by the RRef. Other args are passed in as arguments to the function called.

Simple example to demonstrate parameter server training pattern #705

Simple example to demonstrate parameter server training pattern #705

Uh oh!

Conversation

rohan-varma commented Feb 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrshenli left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrshenli left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rohan-varma commented Mar 20, 2020

Uh oh!

mrshenli left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rohan-varma commented Feb 7, 2020 •

edited

Loading