Add example on how to run DDP with PyTorch elastic (#1948)

fduwjj · web-flow · commit 46e472697ddf · 2022-06-20T10:31:27.000-07:00
* Add example on how to run DDP with PyTorch elastic

Add one example of how to initialize DDP on 2 hosts with 8 processes on each.

* Fix typo

* Address the comment from the reviewer

* Address comments from reviewers.
diff --git a/intermediate_source/ddp_tutorial.rst b/intermediate_source/ddp_tutorial.rst
@@ -294,3 +294,80 @@ either the application or the model ``forward()`` method.
         run_demo(demo_basic, world_size)
         run_demo(demo_checkpoint, world_size)
         run_demo(demo_model_parallel, world_size)
+
+Initialize DDP with torch.distributed.run/torchrun
+----------------------------------
+
+We can leverage PyTorch Elastic to simplify the DDP code and initialize the job more easily.
+Let's still use the Toymodel example and create a file named ``elastic_ddp.py``.
+
+.. code:: python
+
+    import torch
+    import torch.distributed as dist
+    import torch.nn as nn
+    import torch.optim as optim
+
+    from torch.nn.parallel import DistributedDataParallel as DDP
+
+    class ToyModel(nn.Module):
+        def __init__(self):
+            super(ToyModel, self).__init__()
+            self.net1 = nn.Linear(10, 10)
+            self.relu = nn.ReLU()
+            self.net2 = nn.Linear(10, 5)
+
+        def forward(self, x):
+            return self.net2(self.relu(self.net1(x)))
+
+
+    def demo_basic():
+        dist.init_process_group("nccl")
+        rank = dist.get_rank()
+        print(f"Start running basic DDP example on rank {rank}.")
+   
+        # create model and move it to GPU with id rank
+        device_id = rank % torch.cuda.device_count()
+        model = ToyModel().to(device_id)
+        ddp_model = DDP(model, device_ids=[device_id])
+
+        loss_fn = nn.MSELoss()
+        optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)
+
+        optimizer.zero_grad()
+        outputs = ddp_model(torch.randn(20, 10))
+        labels = torch.randn(20, 5).to(device_id)
+        loss_fn(outputs, labels).backward()
+        optimizer.step()
+        
+    if __name__ == "__main__":
+        demo_basic()
+
+One can then run a `torch elastic/torchrun<https://pytorch.org/docs/stable/elastic/quickstart.html>`__ command 
+on all nodes to initialize the DDP job created above:
+
+.. code:: bash
+
+    torchrun --nnodes=2 --nproc_per_node=8 --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR:29400 elastic_ddp.py
+
+We are running the DDP script on two hosts, and each host we run with 8 processes, aka, we 
+are running it on 16 GPUs. Note that ``$MASTER_ADDR`` must be the same across all nodes.
+
+Here torchrun will launch 8 process and invoke ``elastic_ddp.py`` 
+on each process on the node it is launched on, but user also needs to apply cluster 
+management tools like slurm to actually run this command on 2 nodes.
+
+For example, on a SLURM enabled cluster, we can write a script to run the command above
+and set ``MASTER_ADDR`` as:
+
+.. code:: bash
+
+    export MASTER_ADDR=$(scontrol show hostname ${SLURM_NODELIST} | head -n 1)
+
+
+Then we can just run this script using the SLURM command: ``srun --nodes=2 ./torchrun_script.sh``.
+Of course, this is just an example; you can choose your own cluster scheduling tools
+to initiate the torchrun job.
+
+For more information about Elastic run, one can check this 
+`quick start document <https://pytorch.org/docs/stable/elastic/quickstart.html>`__ to learn more.