-
Notifications
You must be signed in to change notification settings - Fork 6.5k
stable diffusion using < 2.3GB of GPU memory #537
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stable diffusion using < 2.3GB of GPU memory #537
Conversation
|
@patrickvonplaten while working in #361, I think I found a way to reduce GPU ram usage and enable running on 2.2 GB GPUs while offloading compute-light and memory heavy processes to CPU. (We just fp32 everything on CPU and put the unet on fp16 on GPU, then adapt the device where each tensor is on call method). |
|
The documentation is not available anymore as the PR was closed or merged. |
|
I have a 2 GB GPU and can run the stable diffusion from this fork basujindal/stable-diffusion up to the size Maybe you could find out in that fork what makes it so optimized and add it to your code. I wish I could help more, but I don't understand these A.I. codes. |
|
@jn-jairo, please run a nvtop and check what else is running and occupying space on your GPU. You can see that it stops allocating memory on 1.4 GB, so it is not that the model takes more memory than that, but your GPU that have only 1.4 GB of memory free. As you can see here, we have a passing test that ensures it takes < 2.2 GB to run this model. What makes it so memory efficient, in this case, is that we offload all the lighter models to the CPU and keep only the unet,responsible for the heavy processing of the diffusion steps, on the GPU and in half precision. |
|
@piEsposito there is literally nothing running on the GPU, the GPU is on demand mode, so just the stable diffusion is running on the GPU. As I said, the other fork is better than yours and more optimized, have you take a look in theirs code to see if they do something else that you are not doing to optimize the code? Because it looks like it does. In the other fork if I choose a smaller image size it uses less memory, in yours it makes no difference, it never fits in a 2 GB GPU. |
The log you printed showed 1.96 GB of GPU memory capacity, while this model will take < 2.3 GB of memory verified by passing unit test, so it won't fit on the 1.96 GB, as said, so you would need a larger one. IMO using the GPU for the diffusion steps only, putting the unet to fp16 and enabling attention slicing with slice size 1 is the most optimized it can get, while keeping the whole unet on the GPU. The fork you posted goes further on memory optimization by fragmenting the unet in a few parts and loading them during the diffusion steps, which leads to better memory usage, but also a decrease in performance due to the GPU I/O, which is a possibility but is out of the scope of this PR. Can you share the code snippet you are running so I can try reproducing this here and help you? |
Ok, so it is possible to make that optimization, great, at least in this repo people gave attention, the original stable diffusion is pretty much abandoned, if that optimization could be available as a option in some future update would be great. About the code I am trying it is just the one from the readme with the astronaut prompt, I just set the width and height to smaller values trying to make it fit in the memory, but as you said it won't fit in the memory without the other optimization. I will keep using the other fork for now, thank you for your patience and help. |
|
Thanks a lot for the PR! As stated in our philosophy
The goal with pipelines is to provide simple. readable and easy to modify implementation that can be tweaked by users for their own use cases. Rather than supporting everything in the default pipeline we encourage users to take the code and tweak it the way they want. As there are various strategies to optimize the memory usage, it's best to let users choose what they want and tweak it themselves. So we are not in favor of handling this here. Hope you don't mind if we close the PR. |
|
In my 1660ti 6g, it got memory out also. |
On Stable Diffusion, if we leave all models on fp32 on CPU but the
uneton GPU, andenable_attention_slicingwith slice size 1 we have run Stable Diffusion on 2.2 GB on GPU and offloading the ligher processes (everything but diffusion steps) to CPU.To enable this behavior, we just have to call
enable_minimal_memory_usageonStableDiffusionPipelineafter instancing it on CPU.I think it would close #540 if I'm not wrong.