Skip to content

Conversation

@piEsposito
Copy link
Contributor

@piEsposito piEsposito commented Sep 16, 2022

On Stable Diffusion, if we leave all models on fp32 on CPU but the unet on GPU, and enable_attention_slicing with slice size 1 we have run Stable Diffusion on 2.2 GB on GPU and offloading the ligher processes (everything but diffusion steps) to CPU.

To enable this behavior, we just have to call enable_minimal_memory_usage on StableDiffusionPipeline after instancing it on CPU.

I think it would close #540 if I'm not wrong.

@piEsposito
Copy link
Contributor Author

@patrickvonplaten while working in #361, I think I found a way to reduce GPU ram usage and enable running on 2.2 GB GPUs while offloading compute-light and memory heavy processes to CPU.

(We just fp32 everything on CPU and put the unet on fp16 on GPU, then adapt the device where each tensor is on call method).

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Sep 16, 2022

The documentation is not available anymore as the PR was closed or merged.

@piEsposito piEsposito changed the title Minimal memory usage stable diffusion stable diffusion using < 2.2GB of GPU memory Sep 16, 2022
@jn-jairo
Copy link

I have a 2 GB GPU and can run the stable diffusion from this fork basujindal/stable-diffusion up to the size 512x384 but with your code I can't run it even with smaller sizes, I always get the out of memory error.

Traceback (most recent call last):
  File "tst.py", line 27, in <module>
    pipe.enable_minimal_memory_usage()
  File "/mnt/hd/opt/stable-diffusion/env/lib/python3.8/site-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py", line 96, in enable_minimal_memory_usage
    self.unet.to(torch.float16).to(torch.device("cuda"))
  File "/mnt/hd/opt/stable-diffusion/env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 907, in to
    return self._apply(convert)
  File "/mnt/hd/opt/stable-diffusion/env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 578, in _apply
    module._apply(fn)
  File "/mnt/hd/opt/stable-diffusion/env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 578, in _apply
    module._apply(fn)
  File "/mnt/hd/opt/stable-diffusion/env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 578, in _apply
    module._apply(fn)
  [Previous line repeated 2 more times]
  File "/mnt/hd/opt/stable-diffusion/env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 601, in _apply
    param_applied = fn(param)
  File "/mnt/hd/opt/stable-diffusion/env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 905, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 1.96 GiB total capacity; 1.38 GiB already allocated; 14.94 MiB free; 1.44 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Maybe you could find out in that fork what makes it so optimized and add it to your code.

I wish I could help more, but I don't understand these A.I. codes.

@piEsposito
Copy link
Contributor Author

piEsposito commented Sep 18, 2022

@jn-jairo, please run a nvtop and check what else is running and occupying space on your GPU. You can see that it stops allocating memory on 1.4 GB, so it is not that the model takes more memory than that, but your GPU that have only 1.4 GB of memory free.

As you can see here, we have a passing test that ensures it takes < 2.2 GB to run this model.

What makes it so memory efficient, in this case, is that we offload all the lighter models to the CPU and keep only the unet,responsible for the heavy processing of the diffusion steps, on the GPU and in half precision.

@jn-jairo
Copy link

@piEsposito there is literally nothing running on the GPU, the GPU is on demand mode, so just the stable diffusion is running on the GPU.

As I said, the other fork is better than yours and more optimized, have you take a look in theirs code to see if they do something else that you are not doing to optimize the code? Because it looks like it does.

In the other fork if I choose a smaller image size it uses less memory, in yours it makes no difference, it never fits in a 2 GB GPU.

@piEsposito piEsposito changed the title stable diffusion using < 2.2GB of GPU memory stable diffusion using < 2.3GB of GPU memory Sep 18, 2022
@piEsposito
Copy link
Contributor Author

@piEsposito there is literally nothing running on the GPU, the GPU is on demand mode, so just the stable diffusion is running on the GPU.

As I said, the other fork is better than yours and more optimized, have you take a look in theirs code to see if they do something else that you are not doing to optimize the code? Because it looks like it does.

In the other fork if I choose a smaller image size it uses less memory, in yours it makes no difference, it never fits in a 2 GB GPU.

The log you printed showed 1.96 GB of GPU memory capacity, while this model will take < 2.3 GB of memory verified by passing unit test, so it won't fit on the 1.96 GB, as said, so you would need a larger one.

IMO using the GPU for the diffusion steps only, putting the unet to fp16 and enabling attention slicing with slice size 1 is the most optimized it can get, while keeping the whole unet on the GPU.

The fork you posted goes further on memory optimization by fragmenting the unet in a few parts and loading them during the diffusion steps, which leads to better memory usage, but also a decrease in performance due to the GPU I/O, which is a possibility but is out of the scope of this PR.

Can you share the code snippet you are running so I can try reproducing this here and help you?

@jn-jairo
Copy link

@piEsposito there is literally nothing running on the GPU, the GPU is on demand mode, so just the stable diffusion is running on the GPU.
As I said, the other fork is better than yours and more optimized, have you take a look in theirs code to see if they do something else that you are not doing to optimize the code? Because it looks like it does.
In the other fork if I choose a smaller image size it uses less memory, in yours it makes no difference, it never fits in a 2 GB GPU.

The log you printed showed 1.96 GB of GPU memory capacity, while this model will take < 2.3 GB of memory verified by passing unit test, so it won't fit on the 1.96 GB, as said, so you would need a larger one.

IMO using the GPU for the diffusion steps only, putting the unet to fp16 and enabling attention slicing with slice size 1 is the most optimized it can get, while keeping the whole unet on the GPU.

The fork you posted goes further on memory optimization by fragmenting the unet in a few parts and loading them during the diffusion steps, which leads to better memory usage, but also a decrease in performance due to the GPU I/O, which is a possibility but is out of the scope of this PR.

Can you share the code snippet you are running so I can try reproducing this here and help you?

Ok, so it is possible to make that optimization, great, at least in this repo people gave attention, the original stable diffusion is pretty much abandoned, if that optimization could be available as a option in some future update would be great.

About the code I am trying it is just the one from the readme with the astronaut prompt, I just set the width and height to smaller values trying to make it fit in the memory, but as you said it won't fit in the memory without the other optimization.

I will keep using the other fork for now, thank you for your patience and help.

@patil-suraj
Copy link
Contributor

Thanks a lot for the PR! As stated in our philosophy

Readability and clarity is preferred over highly optimized code.

The goal with pipelines is to provide simple. readable and easy to modify implementation that can be tweaked by users for their own use cases. Rather than supporting everything in the default pipeline we encourage users to take the code and tweak it the way they want. As there are various strategies to optimize the memory usage, it's best to let users choose what they want and tweak it themselves. So we are not in favor of handling this here.

Hope you don't mind if we close the PR.

@CrazyBoyM
Copy link

In my 1660ti 6g, it got memory out also.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Reduce Stable Diffusion memory usage by keeping unet only on GPU.

5 participants