Support device dispatching during stage creation #923

kwen2501 · 2023-12-26T20:06:52Z

Description

This PR adds support to a case where the user creates model and trace model on CPU, then creates pipeline stage on GPU.
PiPPy would move only the stage module to the corresponding GPU.

Test

torchrun --nproc-per-node 2 test_cpu_init.py

Update:

Sometimes, the forward function of user code may create constant tensors based on input device:

device = input_ids.device
attention_mask = torch.ones(…, device=device)

As of now, PT2 tracer does not treat input_ids.device as a symbolic device. As a result, device="cpu" got burned in the generated code:

ones = torch.ones(…, device = device(type='cpu'))

To workaround this, this PR added call in PipelineStage creation:

def _move_ops_to_device(new_device)

After this call, the device= kwarg of torch.ones will be modified to the new_device.
This call is hidden from user, thus when symbolic device support is added, we can silently remove this and not involve user code change.

We also checked native_functions.yaml, all APIs involving the "device" kwarg are generator ops, which are safe to change the device value. (And we should).

Real Example

cd examples/cpu_init
torchrun --nproc-per-node 4 bert_cpu_init.py

Cc: @muellerzr @SunMarc

lessw2020

looks good overall!
I did find an issue on latest nightlies that will require a change in IR.py to make it work there, but isn't directly related to this PR.
Specifically there was a tracing refactor that will generate this issue:
AttributeError: module 'torch._export' has no attribute '_export_to_torch_ir'

Modifying the import to use the new location (torch.export rather than torch._export) and all is well - simple fix ala:

Not sure how you want to integrate that but that would be the one item to be fixed to ensure code works on latest nightlies.
With that:

and

lessw2020 · 2023-12-28T04:33:29Z

note - to separate the import issue I hit from this PR, as it is independent, made a new PR expressly for that:
#924

@muellerzr

## Description This PR adds support to a case where the user creates model and trace model on CPU, then creates pipeline stage on GPU. PiPPy would move only the stage module to the corresponding GPU. ## Test ``` torchrun --nproc-per-node 2 test_cpu_init.py ``` ## Update: Sometimes, the `forward` function of user code may create constant tensors based on input device: ``` device = input_ids.device attention_mask = torch.ones(…, device=device) ``` As of now, PT2 tracer does not treat `input_ids.device` as a symbolic device. As a result, `device="cpu"` got burned in the generated code: ``` ones = torch.ones(…, device = device(type='cpu')) ``` To workaround this, this PR added call in `PipelineStage` creation: ``` def _move_ops_to_device(new_device) ``` After this call, the `device=` kwarg of `torch.ones` will be modified to the `new_device`. This call is hidden from user, thus when symbolic device support is added, we can silently remove this and not involve user code change. We also checked native_functions.yaml, all APIs involving the "device" kwarg are generator ops, which are safe to change the device value. (And we should). ## Real Example ``` cd examples/cpu_init torchrun --nproc-per-node 4 bert_cpu_init.py ``` Cc: @muellerzr @SunMarc

Support device dispatching during stage creation

004e4c3

facebook-github-bot added the cla signed label Dec 26, 2023

kwen2501 added 4 commits December 26, 2023 12:12

pythonic

a872575

Add BERT example

ef0a308

Add _move_ops_to_device

20e0669

Lint

8d93469

kwen2501 requested review from H-Huang, HamidShojanazeri, angelayi, gmagogsfm, lessw2020 and tugsbayasgalan December 27, 2023 21:25

Move _move_ops_to_device to PipelinStage creation

afb9bb0

lessw2020 approved these changes Dec 28, 2023

View reviewed changes

kwen2501 merged commit a4cc35f into main Dec 28, 2023

kwen2501 mentioned this pull request Jan 2, 2024

PyTorch native 2D LLaMA inference #922

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support device dispatching during stage creation #923

Support device dispatching during stage creation #923

Uh oh!

kwen2501 commented Dec 26, 2023 •

edited

Loading

Uh oh!

lessw2020 left a comment

Uh oh!

lessw2020 commented Dec 28, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Support device dispatching during stage creation #923

Support device dispatching during stage creation #923

Uh oh!

Conversation

kwen2501 commented Dec 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Test

Update:

Real Example

Uh oh!

lessw2020 left a comment

Choose a reason for hiding this comment

Uh oh!

lessw2020 commented Dec 28, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kwen2501 commented Dec 26, 2023 •

edited

Loading