-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Strategize slicing based on free [V]RAM #2572
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I know we're nowhere near there, but do you have any ideas about how we'd make this work in a batched/parallel-generation environment? |
We're nowhere near there. :) For a batched system, each batch gets to make the call as to whether there's available [V]RAM to run fast or If we're on a parallel-generation system we'd have to have some awareness of the other concurrent generations on the system and add all of their memory requirements first before deciding how to run ours, or (the simple solution) every job on that type of system has to run with |
|
I also want to get some Windows + CUDA testers to run in a debugger and see if you're getting an accurate amount of free VRAM back from torch. If not, everything will still run - just in slices in all cases. |
|
Thanks for doing this. I am looking forward to testing this feature after the 2.3.0 dust settles |
|
@JPPhoto I'm happy to test, but I'm not sure what I'm looking for... should I just see if the magic smoke escapes? |
That is definitely part of it. But I want to make sure you can render at sizes that are appropriate for your free [V]RAM without crashing with as much speed as possible. What's your setup? On my 12GB NVIDIA card, tensors for a 512x512 image fit entirely in VRAM alongside the model. When I scale up, memory requirements grow - the rough formula is that you need 16 * ((x * y / 64) ^ 2) * 6 (or 8 if using fp32) bytes of [V]RAM. So a 1280x1280 image is possible on my card via slicing implemented by diffusers and shouldn't cause an OOM. |
|
I'm on an M1 MacBook with 32GB memory (shared RAM/VRAM). I don't think invoke can access my VRAM usage, so anything that makes decisions based on that probably will not work. |
|
I want to see if it works on that platform as well (and more importantly that I didn't break it), so please go ahead and give it a shot. |
lstein
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirmed working on Ubuntu.
damian0815
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good other than the naming the _enable_memory_efficient_attention function
…t_attention as this happens every generation.
* new OffloadingDevice loads one model at a time, on demand * fixup! new OffloadingDevice loads one model at a time, on demand * fix(prompt_to_embeddings): call the text encoder directly instead of its forward method allowing any associated hooks to run with it. * more attempts to get things on the right device from the offloader * more attempts to get things on the right device from the offloader * make offloading methods an explicit part of the pipeline interface * inlining some calls where device is only used once * ensure model group is ready after pipeline.to is called * fixup! Strategize slicing based on free [V]RAM (#2572) * doc(offloading): docstrings for offloading.ModelGroup * doc(offloading): docstrings for offloading-related pipeline methods * refactor(offloading): s/SimpleModelGroup/FullyLoadedModelGroup * refactor(offloading): s/HotSeatModelGroup/LazilyLoadedModelGroup to frame it is the same terms as "FullyLoadedModelGroup" --------- Co-authored-by: Damian Stewart <[email protected]>
I had some spare time so I strategized slicing - none or
max- at runtime based on the size of the generation requested and free [V]RAM at that time. This needs testers on multiple platforms.