-
Notifications
You must be signed in to change notification settings - Fork 9.8k
Adding distributed pipeline parallelism example #749
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few suggestions for comments. Thanks for the great example @mrshenli!
| labels = torch.zeros(batch_size, num_classes) \ | ||
| .scatter_(1, one_hot_indices, 1) | ||
|
|
||
| with dist_autograd.context() as context_id: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a short comment about what dist_autograd/dist_optimizer is doing here?
|
|
||
| return nn.Sequential(*layers) | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should include some comments about what we're doing here at a high level (defining resnet with 2 partitions so we can place them on separate machines.) Also, should we call these Partitions or Shards instead of parts?
| Distributed Pipeline Parallel Example | ||
|
|
||
| This example shows how to distribute a ResNet50 model on two RPC workers and | ||
| then implement distributed pipeline parallelism using RPC. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we include a quick description of the pipelining strategy (pipelining micro-batches within a batch and then synchronously running the optimizer step)? Since this is like GPipe, should we also link the paper here?
Co-authored-by: Shen Li <[email protected]>
This example shows how to use RPC to implement pipeline parallelism. This can be viewed as a distributed version of single machine multiple GPU pipeline parallelism.
The numbers below show how the total execution time decreases with the increase of
num_split.