Skip to content

Conversation

@JPPhoto
Copy link
Contributor

@JPPhoto JPPhoto commented Feb 7, 2023

I had some spare time so I strategized slicing - none or max - at runtime based on the size of the generation requested and free [V]RAM at that time. This needs testers on multiple platforms.

@Kyle0654
Copy link
Contributor

Kyle0654 commented Feb 7, 2023

I know we're nowhere near there, but do you have any ideas about how we'd make this work in a batched/parallel-generation environment?

@JPPhoto
Copy link
Contributor Author

JPPhoto commented Feb 7, 2023

I know we're nowhere near there, but do you have any ideas about how we'd make this work in a batched/parallel-generation environment?

We're nowhere near there. :)

For a batched system, each batch gets to make the call as to whether there's available [V]RAM to run fast or max sliced. I don't think there are any code changes required for that case.

If we're on a parallel-generation system we'd have to have some awareness of the other concurrent generations on the system and add all of their memory requirements first before deciding how to run ours, or (the simple solution) every job on that type of system has to run with max to avoid OOM errors. I also imagine that sub-quadratic slicing will take care of a lot of these issues when it's implemented in Invoke or (better) diffusers.

@JPPhoto
Copy link
Contributor Author

JPPhoto commented Feb 7, 2023

I also want to get some Windows + CUDA testers to run in a debugger and see if you're getting an accurate amount of free VRAM back from torch. If not, everything will still run - just in slices in all cases.

@lstein
Copy link
Collaborator

lstein commented Feb 8, 2023

Thanks for doing this. I am looking forward to testing this feature after the 2.3.0 dust settles

@psychedelicious
Copy link
Contributor

@JPPhoto I'm happy to test, but I'm not sure what I'm looking for... should I just see if the magic smoke escapes?

@JPPhoto
Copy link
Contributor Author

JPPhoto commented Feb 8, 2023

@JPPhoto I'm happy to test, but I'm not sure what I'm looking for... should I just see if the magic smoke escapes?

That is definitely part of it. But I want to make sure you can render at sizes that are appropriate for your free [V]RAM without crashing with as much speed as possible. What's your setup?

On my 12GB NVIDIA card, tensors for a 512x512 image fit entirely in VRAM alongside the model. When I scale up, memory requirements grow - the rough formula is that you need 16 * ((x * y / 64) ^ 2) * 6 (or 8 if using fp32) bytes of [V]RAM. So a 1280x1280 image is possible on my card via slicing implemented by diffusers and shouldn't cause an OOM.

@psychedelicious
Copy link
Contributor

I'm on an M1 MacBook with 32GB memory (shared RAM/VRAM). I don't think invoke can access my VRAM usage, so anything that makes decisions based on that probably will not work.

@JPPhoto
Copy link
Contributor Author

JPPhoto commented Feb 8, 2023

I want to see if it works on that platform as well (and more importantly that I didn't break it), so please go ahead and give it a shot.

@JPPhoto JPPhoto marked this pull request as draft February 8, 2023 14:37
@JPPhoto JPPhoto marked this pull request as ready for review February 10, 2023 16:00
Copy link
Collaborator

@lstein lstein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed working on Ubuntu.

Copy link
Contributor

@damian0815 damian0815 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good other than the naming the _enable_memory_efficient_attention function

…t_attention as this happens every generation.
@JPPhoto JPPhoto enabled auto-merge (squash) February 12, 2023 18:10
@JPPhoto JPPhoto merged commit 9eed191 into main Feb 12, 2023
@JPPhoto JPPhoto deleted the JPPhoto-choose-slicing-strategy branch February 12, 2023 18:24
damian0815 added a commit that referenced this pull request Feb 16, 2023
* new OffloadingDevice loads one model at a time, on demand

* fixup! new OffloadingDevice loads one model at a time, on demand

* fix(prompt_to_embeddings): call the text encoder directly instead of its forward method

allowing any associated hooks to run with it.

* more attempts to get things on the right device from the offloader

* more attempts to get things on the right device from the offloader

* make offloading methods an explicit part of the pipeline interface

* inlining some calls where device is only used once

* ensure model group is ready after pipeline.to is called

* fixup! Strategize slicing based on free [V]RAM (#2572)

* doc(offloading): docstrings for offloading.ModelGroup

* doc(offloading): docstrings for offloading-related pipeline methods

* refactor(offloading): s/SimpleModelGroup/FullyLoadedModelGroup

* refactor(offloading): s/HotSeatModelGroup/LazilyLoadedModelGroup

to frame it is the same terms as "FullyLoadedModelGroup"

---------

Co-authored-by: Damian Stewart <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants