stable diffusion using < 2.3GB of GPU memory #537

piEsposito · 2022-09-16T18:05:52Z

On Stable Diffusion, if we leave all models on fp32 on CPU but the unet on GPU, and enable_attention_slicing with slice size 1 we have run Stable Diffusion on 2.2 GB on GPU and offloading the ligher processes (everything but diffusion steps) to CPU.

To enable this behavior, we just have to call enable_minimal_memory_usage on StableDiffusionPipeline after instancing it on CPU.

I think it would close #540 if I'm not wrong.

piEsposito · 2022-09-16T18:08:52Z

@patrickvonplaten while working in #361, I think I found a way to reduce GPU ram usage and enable running on 2.2 GB GPUs while offloading compute-light and memory heavy processes to CPU.

(We just fp32 everything on CPU and put the unet on fp16 on GPU, then adapt the device where each tensor is on call method).

HuggingFaceDocBuilderDev · 2022-09-16T18:09:32Z

The documentation is not available anymore as the PR was closed or merged.

jn-jairo · 2022-09-18T05:15:49Z

I have a 2 GB GPU and can run the stable diffusion from this fork basujindal/stable-diffusion up to the size 512x384 but with your code I can't run it even with smaller sizes, I always get the out of memory error.

Traceback (most recent call last):
  File "tst.py", line 27, in <module>
    pipe.enable_minimal_memory_usage()
  File "/mnt/hd/opt/stable-diffusion/env/lib/python3.8/site-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py", line 96, in enable_minimal_memory_usage
    self.unet.to(torch.float16).to(torch.device("cuda"))
  File "/mnt/hd/opt/stable-diffusion/env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 907, in to
    return self._apply(convert)
  File "/mnt/hd/opt/stable-diffusion/env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 578, in _apply
    module._apply(fn)
  File "/mnt/hd/opt/stable-diffusion/env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 578, in _apply
    module._apply(fn)
  File "/mnt/hd/opt/stable-diffusion/env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 578, in _apply
    module._apply(fn)
  [Previous line repeated 2 more times]
  File "/mnt/hd/opt/stable-diffusion/env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 601, in _apply
    param_applied = fn(param)
  File "/mnt/hd/opt/stable-diffusion/env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 905, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 1.96 GiB total capacity; 1.38 GiB already allocated; 14.94 MiB free; 1.44 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Maybe you could find out in that fork what makes it so optimized and add it to your code.

I wish I could help more, but I don't understand these A.I. codes.

piEsposito · 2022-09-18T12:17:09Z

@jn-jairo, please run a nvtop and check what else is running and occupying space on your GPU. You can see that it stops allocating memory on 1.4 GB, so it is not that the model takes more memory than that, but your GPU that have only 1.4 GB of memory free.

As you can see here, we have a passing test that ensures it takes < 2.2 GB to run this model.

What makes it so memory efficient, in this case, is that we offload all the lighter models to the CPU and keep only the unet,responsible for the heavy processing of the diffusion steps, on the GPU and in half precision.

jn-jairo · 2022-09-18T18:41:34Z

@piEsposito there is literally nothing running on the GPU, the GPU is on demand mode, so just the stable diffusion is running on the GPU.

As I said, the other fork is better than yours and more optimized, have you take a look in theirs code to see if they do something else that you are not doing to optimize the code? Because it looks like it does.

In the other fork if I choose a smaller image size it uses less memory, in yours it makes no difference, it never fits in a 2 GB GPU.

piEsposito · 2022-09-18T19:23:27Z

@piEsposito there is literally nothing running on the GPU, the GPU is on demand mode, so just the stable diffusion is running on the GPU.

As I said, the other fork is better than yours and more optimized, have you take a look in theirs code to see if they do something else that you are not doing to optimize the code? Because it looks like it does.

In the other fork if I choose a smaller image size it uses less memory, in yours it makes no difference, it never fits in a 2 GB GPU.

The log you printed showed 1.96 GB of GPU memory capacity, while this model will take < 2.3 GB of memory verified by passing unit test, so it won't fit on the 1.96 GB, as said, so you would need a larger one.

IMO using the GPU for the diffusion steps only, putting the unet to fp16 and enabling attention slicing with slice size 1 is the most optimized it can get, while keeping the whole unet on the GPU.

The fork you posted goes further on memory optimization by fragmenting the unet in a few parts and loading them during the diffusion steps, which leads to better memory usage, but also a decrease in performance due to the GPU I/O, which is a possibility but is out of the scope of this PR.

Can you share the code snippet you are running so I can try reproducing this here and help you?

jn-jairo · 2022-09-18T20:13:04Z

@piEsposito there is literally nothing running on the GPU, the GPU is on demand mode, so just the stable diffusion is running on the GPU.
As I said, the other fork is better than yours and more optimized, have you take a look in theirs code to see if they do something else that you are not doing to optimize the code? Because it looks like it does.
In the other fork if I choose a smaller image size it uses less memory, in yours it makes no difference, it never fits in a 2 GB GPU.

The log you printed showed 1.96 GB of GPU memory capacity, while this model will take < 2.3 GB of memory verified by passing unit test, so it won't fit on the 1.96 GB, as said, so you would need a larger one.

IMO using the GPU for the diffusion steps only, putting the unet to fp16 and enabling attention slicing with slice size 1 is the most optimized it can get, while keeping the whole unet on the GPU.

The fork you posted goes further on memory optimization by fragmenting the unet in a few parts and loading them during the diffusion steps, which leads to better memory usage, but also a decrease in performance due to the GPU I/O, which is a possibility but is out of the scope of this PR.

Can you share the code snippet you are running so I can try reproducing this here and help you?

Ok, so it is possible to make that optimization, great, at least in this repo people gave attention, the original stable diffusion is pretty much abandoned, if that optimization could be available as a option in some future update would be great.

About the code I am trying it is just the one from the readme with the astronaut prompt, I just set the width and height to smaller values trying to make it fit in the memory, but as you said it won't fit in the memory without the other optimization.

I will keep using the other fork for now, thank you for your patience and help.

patil-suraj · 2022-09-20T15:37:12Z

Thanks a lot for the PR! As stated in our philosophy

Readability and clarity is preferred over highly optimized code.

The goal with pipelines is to provide simple. readable and easy to modify implementation that can be tweaked by users for their own use cases. Rather than supporting everything in the default pipeline we encourage users to take the code and tweak it the way they want. As there are various strategies to optimize the memory usage, it's best to let users choose what they want and tweak it themselves. So we are not in favor of handling this here.

Hope you don't mind if we close the PR.

CrazyBoyM · 2022-10-10T14:27:49Z

In my 1660ti 6g, it got memory out also.

piEsposito added 2 commits September 16, 2022 14:44

enable shrkinking of sd to run on 2gb GPUs

16367e1

add test to ensure reduced GPU memory usage

94561c4

piEsposito mentioned this pull request Sep 16, 2022

Reduce Stable Diffusion memory usage by keeping unet only on GPU. #540

Closed

piEsposito added 3 commits September 16, 2022 15:26

format code using black

c8b4581

fix imports and remove unused variables

0a11fe4

fix tensor devices for cases when safety checker is mocked

6a42078

piEsposito changed the title ~~Minimal memory usage stable diffusion~~ stable diffusion using < 2.2GB of GPU memory Sep 16, 2022

leszekhanusz mentioned this pull request Sep 18, 2022

Merging Stable diffusion pipelines just makes sense #551

Closed

piEsposito changed the title ~~stable diffusion using < 2.2GB of GPU memory~~ stable diffusion using < 2.3GB of GPU memory Sep 18, 2022

patil-suraj closed this Sep 20, 2022

piEsposito mentioned this pull request Oct 5, 2022

Add context to to move tensors within multiple submodel devices without intrusion huggingface/accelerate#743

Closed

piEsposito mentioned this pull request Oct 14, 2022

minimal stable diffusion GPU memory usage with accelerate hooks #850

Merged

siriux mentioned this pull request Nov 6, 2022

Implement GPU memory optimization LaurentMazare/diffusers-rs#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

stable diffusion using < 2.3GB of GPU memory #537

stable diffusion using < 2.3GB of GPU memory #537

Uh oh!

piEsposito commented Sep 16, 2022 •

edited

Loading

Uh oh!

piEsposito commented Sep 16, 2022

Uh oh!

HuggingFaceDocBuilderDev commented Sep 16, 2022 •

edited

Loading

Uh oh!

jn-jairo commented Sep 18, 2022

Uh oh!

piEsposito commented Sep 18, 2022 •

edited

Loading

Uh oh!

jn-jairo commented Sep 18, 2022

Uh oh!

piEsposito commented Sep 18, 2022

Uh oh!

jn-jairo commented Sep 18, 2022

Uh oh!

patil-suraj commented Sep 20, 2022

Uh oh!

CrazyBoyM commented Oct 10, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

stable diffusion using < 2.3GB of GPU memory #537

stable diffusion using < 2.3GB of GPU memory #537

Uh oh!

Conversation

piEsposito commented Sep 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

piEsposito commented Sep 16, 2022

Uh oh!

HuggingFaceDocBuilderDev commented Sep 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jn-jairo commented Sep 18, 2022

Uh oh!

piEsposito commented Sep 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jn-jairo commented Sep 18, 2022

Uh oh!

piEsposito commented Sep 18, 2022

Uh oh!

jn-jairo commented Sep 18, 2022

Uh oh!

patil-suraj commented Sep 20, 2022

Uh oh!

CrazyBoyM commented Oct 10, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

piEsposito commented Sep 16, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Sep 16, 2022 •

edited

Loading

piEsposito commented Sep 18, 2022 •

edited

Loading