Skip to content

Conversation

@anton-l
Copy link
Member

@anton-l anton-l commented Nov 7, 2022

Fixes the implementation and tests introduced in #1085

Looks like the two test_stable_diffusion_pipeline_with_sequential_cpu_offloading tests weren't checked with a gpu originally, which resulted in device mismatch: https://github.com/huggingface/diffusers/actions/runs/3410777950/jobs/5674151054#step:10:551

@piEsposito FYI: github actions for the GPU tests aren't launched for PRs, so for future PRs please check them locally too :)

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

if isinstance(module, torch.nn.Module):
if module.device == torch.device("meta"):
return torch.device("cpu")
return torch.device("cuda" if torch.cuda.is_available() else "cpu")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@patrickvonplaten @piEsposito this feels hacky, but is required to make the pipelines work when self.device is not the same as e.g. generator after offloading.
See the error here: https://github.com/huggingface/diffusers/actions/runs/3410777950/jobs/5674151054#step:10:551

Seems that accelerate doesn't populate param_original_devices here, so the only way to know where the model was supposed to go is to guess?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand - and actually think this is a clever solution, I've learned a few thing with this PR of yours.

Also, IMO it is correct to assume that, is a user has a GPU, then they will use it instead of CPU for Diffusion models.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the more I think about it, wouldn't the cleanest solution be just to return torch.device("meta") and then fix the bugs in the pipelines directly.

Bit worried about making such a fundamental function this hacky.

Also cc @pcuenca - curious to hear your thoughts!

Comment on lines +638 to +639
# make sure that less than 2.2 GB is allocated
assert mem_bytes < 2.2 * 10**9
Copy link
Member Author

@anton-l anton-l Nov 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@piEsposito the 768x512 images require ~2.16GB of memory, compared to 1.5 for the 512x512 text2img tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, thank you for catching that!

Copy link
Contributor

@piEsposito piEsposito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the approach, thank you for teaching me those few things.

Comment on lines +638 to +639
# make sure that less than 2.2 GB is allocated
assert mem_bytes < 2.2 * 10**9
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, thank you for catching that!

if isinstance(module, torch.nn.Module):
if module.device == torch.device("meta"):
return torch.device("cpu")
return torch.device("cuda" if torch.cuda.is_available() else "cpu")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand - and actually think this is a clever solution, I've learned a few thing with this PR of yours.

Also, IMO it is correct to assume that, is a user has a GPU, then they will use it instead of CPU for Diffusion models.

@anton-l
Copy link
Member Author

anton-l commented Nov 7, 2022

@piEsposito thank you for contributing the offloading solution too! 🤗

Copy link
Contributor

@patrickvonplaten patrickvonplaten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a mille for fixing this @anton-l !
The tests fixes look great - I'm not sure we want to make the self.device function so hacky though - left a comment! Wdyt? Also cc @pcuenca

@anton-l
Copy link
Member Author

anton-l commented Nov 8, 2022

If we decide to return "meta" from self.device then we'll have to replace every .to(self.device) in the pipeline to .to(self.execution_device) (where execution_device is saved before doing the cpu offload). Because once we offload the models we can no longer access the original device name inside pipeline.__call__, it's all just "meta" (the device remapping happens inside the magic module hooks that accelerate assigns).

But @piEsposito (cc @sgugger) maybe there's still a way to access the intended execution device after cpu offloading? self.unet._hf_hook.execution_device is None when I inspect it from the pipeline

@sgugger
Copy link
Contributor

sgugger commented Nov 8, 2022

The execution device will be attached to the bottom-level module, not the top-level one.

@piEsposito
Copy link
Contributor

piEsposito commented Nov 8, 2022

@anton-l cc @sgugger the execution device appears as None because it is using a SequentialHook. If you get into this hook and try finding the align_execution_device_hook you should get what you are looking for.

If you try pipe.unet._hf_hook.hooks[0].execution_device it should return an integer corresponding to the GPU index, assuming you have accelerate installed from master.

@anton-l
Copy link
Member Author

anton-l commented Nov 8, 2022

If you try pipe.unet._hf_hook.hooks[0].execution_device it should return an integer corresponding to the GPU index

That seems to only work when device_map="auto", so I've added a deeper check over submodules.

# make sure that less than 1.5 GB is allocated
assert mem_bytes < 1.5 * 10**9
# make sure that less than 2.8 GB is allocated
assert mem_bytes < 2.8 * 10**9
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how this increase happened yet, if someone can check mem_bytes here on their machine that would be great :)

@patrickvonplaten
Copy link
Contributor

Thanks a lot for the help here @sgugger - the PR looks good to me. Merging this as it's blocking some other PRs.
@anton-l - I actually think we simply didn't run the text encoder on GPU at all before which is why the GPU memory went up. Also related to #1047

@patrickvonplaten patrickvonplaten merged commit 24895a1 into main Nov 9, 2022
@patrickvonplaten patrickvonplaten deleted the fix-cpu-offload branch November 9, 2022 09:28
@patrickvonplaten patrickvonplaten mentioned this pull request Nov 9, 2022
yoonseokjin pushed a commit to yoonseokjin/diffusers that referenced this pull request Dec 25, 2023
* Fix cpu offloading

* get offloaded devices locally for SD pipelines
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants