Optimize VRAM use in textual inversion training #687

Ttl · 2022-09-30T13:02:35Z

Cast frozen modules to fp16/bf16 when using mixed precision. Add gradient checkpoint command line option.

OOMs before on my 8 GB VRAM GPU. With these changes and using --mixed_precision=fp16 --gradient_checkpointing VRAM use is 6341 MB and the results look good.

HuggingFaceDocBuilderDev · 2022-09-30T13:05:45Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

keturn · 2022-10-02T18:33:23Z

examples/textual_inversion/textual_inversion.py

        text_encoder.train()
        for step, batch in enumerate(train_dataloader):
-            with accelerator.accumulate(text_encoder):
+            with accelerator.autocast(), accelerator.accumulate(text_encoder):


What would be necessary in order for this to work without autocast?

There's some concern about its use: #511

Speed is about 10% better without autocast and VRAM use decreases to 5500 MB. It does need some additional casts and I don't think it's quite identical functionally since some operations that would be autocasted to fp32 are computed in fp16. Results look still fine though so it might be okay. Smaller VRAM use does enable increasing batch size or disabling gradient checkpointing for even bigger speed up.

I checked a bit. The main problem is

accelerator.backward(loss)

If that can be done in fp16, it should work without autocast

keturn · 2022-10-02T18:35:44Z

@isamu-isozaki is this the approach you've been using?

isamu-isozaki · 2022-10-02T19:48:43Z

@keturn Pretty much! I didn't know about the autocast functionality so I manually moved most of the parts to cpu and cuda. The code is here. One thing I remember was that for 6gb of ram, it'll omm before that accelerator.accumulate part so most models are better moved to the cpu.

isamu-isozaki · 2022-10-03T14:19:08Z

Hi! Great pr! @Ttl @keturn. It doesn't fit in 6gb ram as it is now but once I did this

slice_size = unet.config.attention_head_dim // 2
unet.set_attention_slice(slice_size)

it fits. Thanks for this! Now my training will get way better.

isamu-isozaki · 2022-10-03T19:23:43Z

@keturn sry on second thought this is way different from my approach but it's way better too!

keturn · 2022-10-03T19:45:56Z

If you add revision="fp16" to from_pretrained, do you still have to do the conversions to weight_dtype?

patrickvonplaten · 2022-10-04T13:08:07Z

@patil-suraj can you take a look here?

Ttl · 2022-10-04T15:24:00Z

I'm using locally saved weights and adding revision="fp16" doesn't seem to do anything in that case.

isamu-isozaki · 2022-10-04T19:08:47Z

I tried revision fp16 and got oom for some reason. will double check later today

patil-suraj

Thanks a lot for the PR! As explained in the comments, we shouldn't cast all weights to half-precision unless specified by user and autocast should not be used as default as it won't allow full-precision training.

Also, please note that, the examples scripts are just examples, they show how to do a certain example with simple and easy way. For more customization it is recommended that the user should modify the script on their own as they need it. This will help keep the script simple so any one can understand and modify it for themselves if needed.

if you just pass mixed_precision="fp16, accelerate should enable mixed-precision without any code changes.

I'm not in favor of this, Hope you understand, thanks a lot!

patil-suraj · 2022-10-05T09:48:50Z

examples/textual_inversion/textual_inversion.py

+    if args.gradient_checkpointing:
+        unet.enable_gradient_checkpointing()


unet is not trained in textul inversion, so gradient checkpointing here is not necessary, as no grads are computed for it.

patil-suraj · 2022-10-05T09:51:22Z

examples/textual_inversion/textual_inversion.py

+    weight_dtype = torch.float32
+    if args.mixed_precision == "fp16":
+        weight_dtype = torch.float16
+    elif args.mixed_precision == "bf16":
+        weight_dtype = torch.bfloat16


This should be enabled by a flag, we can't always assume if user wants to cast weighst to half-precision. Also mixed precision training the weights are usually not cast to half-precision, only the forward pass runs in half precision.

patil-suraj · 2022-10-05T09:53:51Z

examples/textual_inversion/textual_inversion.py

        text_encoder.train()
        for step, batch in enumerate(train_dataloader):
-            with accelerator.accumulate(text_encoder):
+            with accelerator.autocast(), accelerator.accumulate(text_encoder):


This will always do the training in half-precision, what if the user wants to do fp32 training. We should not put autocast directly here.

patil-suraj · 2022-10-05T09:54:29Z

src/diffusers/models/unet_blocks.py

        output_states = ()

        for resnet, attn in zip(self.resnets, self.attentions):
-            if self.training and self.gradient_checkpointing:


This should not be removed, gradient checkpointing is only required during training.

Ttl · 2022-10-05T11:34:28Z

The idea with casting the weights of non-trained nets is that without it fp32 weights are transferred to vram even when training in fp16. Since they are not trained we don't need to keep fp32 copy of them in vram.

self.training is controlled by train() or eval() call of the module. Since in this case we have set unet to be in eval() without removing the self.training gradient checkpointing is not enabled. Gradient checkpointing is useful in this case since we need to store activations in the unet for backwards pass since it's between our trainable weights and loss calculation. I checked that enabling it saves 1080 MB of memory.

I can maintain a copy of the script that casts non-trained weights to fp16 locally, but it would be nice if the gradient checkpointing changes would be merged. Would you be fine with that?

patil-suraj · 2022-10-05T12:38:34Z

Gradient checkpointing is useful in this case since we need to store activations in the unet for backwards pass since it's between our trainable weights and loss calculation. I checked that enabling it saves 1080 MB of memory.

That's a really good observation! Sorry, I rushed the review a bit. In this case keeping the gradient checkpointing changes makes sense, let me try it quickly and get back to you.

Thanks a lot!

patil-suraj · 2022-10-05T12:41:16Z

Also pinging @patrickvonplaten and @anton-l . Are the activations stored even when the grads are disabled for the model ?

Cast frozen modules to fp16/bf16 when using mixed precision. Add gradient checkpoint command line option.

Ttl · 2022-10-05T13:49:32Z

I added a commit that removes the autocast. It should work with fp32 and bf16 too but I can't test it on my GPU. This PR does have a side effect that it saves fp16 quantized weights of unet and vae since fp32 weights from those were discarded if training in fp16. If you prefer I can remove the training changes and only keep the gradient checkpointing change.

patrickvonplaten · 2022-10-07T13:37:13Z

I think this PR is currently blocked by:

gradient_checkpointing being a bit flaky when model is set to train mode
that we're not able to pass dropout down

cc @patil-suraj

Thomas-MMJ · 2022-10-30T13:41:15Z

Any progress on the blockers?

patrickvonplaten · 2022-11-02T13:52:17Z

@patil-suraj could you have another look here?

patil-suraj

Sorry to only comeback to this now!

the output of gradient_checkpointing enabled model in train mode is deterministic when dropout is zero (which is the default), I ran a few tests to confirm this.
We have dropout set to 0 by default, and need to allow passing it to the model. But this could be added later.

So there is no blocker for this PR now. @Ttl could you please adapt the PR to put the unet in train mode, and not modify the gradient checkpointing parts. Then I'll open a follow-up PR to allow passing dropout to the model that we can set it 0 to always make sure we have deterministic output from unet.

patrickvonplaten · 2022-11-16T07:35:22Z

cc @patil-suraj here - could you maybe go into the PR if author doesn't reply anymore to not forget it?

github-actions · 2022-12-10T15:03:41Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

patrickvonplaten · 2022-12-13T15:49:58Z

cc @patil-suraj - could you maybe post some instructions here on how to proceed? Then someone else could pick it up

patil-suraj · 2022-12-26T17:09:21Z

Sorry for being late again, I've posted instructions in this comment #687 (review)

@Ttl LMK if you are busy, then I'll make the necessary changes and merge :)

Ttl · 2022-12-28T10:47:58Z

It's been quite long since I last looked at this code and I haven't used textual inversion much anymore. Feel free to make necessary changes to get it merged if you want to.

patil-suraj · 2022-12-28T12:30:22Z

Thanks, will make open a PR then :)

This commit simplifies the code to identify the model name for a particular set of flags. This is achieved by introducing a json file that stores the model names information. The models are uploaded in gcloud with these names. Signed-Off-by: Gaurav Shukla <[email protected]> Signed-off-by: Gaurav Shukla <[email protected]>

keturn reviewed Oct 2, 2022

View reviewed changes

patrickvonplaten requested a review from patil-suraj October 4, 2022 13:08

patrickvonplaten assigned patil-suraj Oct 4, 2022

patil-suraj reviewed Oct 5, 2022

View reviewed changes

patil-suraj closed this Oct 5, 2022

patil-suraj reopened this Oct 5, 2022

Ttl added 3 commits October 5, 2022 16:41

Optimize VRAM use in textual inversion training

0e0526a

Cast frozen modules to fp16/bf16 when using mixed precision. Add gradient checkpoint command line option.

Fix formatting

209789e

Remove autocast

71efcab

Ttl force-pushed the ti_vram branch from 8e0da14 to 71efcab Compare October 5, 2022 13:43

patil-suraj mentioned this pull request Oct 7, 2022

Implement attention splicing for textual inversion #757

Closed

patrickvonplaten requested a review from patil-suraj November 2, 2022 13:52

patil-suraj reviewed Nov 2, 2022

View reviewed changes

github-actions bot added the stale Issues that haven't received updates label Dec 10, 2022

Ttl closed this Dec 28, 2022

patil-suraj mentioned this pull request Dec 28, 2022

[textual inversion] add gradient checkpointing and small fixes. #1848

Merged

		if args.gradient_checkpointing:
		unet.enable_gradient_checkpointing()

Optimize VRAM use in textual inversion training #687

Optimize VRAM use in textual inversion training #687

Conversation

Ttl commented Sep 30, 2022

Uh oh!

HuggingFaceDocBuilderDev commented Sep 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

keturn Oct 2, 2022

Choose a reason for hiding this comment

Uh oh!

Ttl Oct 3, 2022

Choose a reason for hiding this comment

Uh oh!

isamu-isozaki Oct 4, 2022

Choose a reason for hiding this comment

Uh oh!

keturn commented Oct 2, 2022

Uh oh!

isamu-isozaki commented Oct 2, 2022

Uh oh!

isamu-isozaki commented Oct 3, 2022

Uh oh!

isamu-isozaki commented Oct 3, 2022

Uh oh!

keturn commented Oct 3, 2022

Uh oh!

patrickvonplaten commented Oct 4, 2022

Uh oh!

Ttl commented Oct 4, 2022

Uh oh!

isamu-isozaki commented Oct 4, 2022

Uh oh!

patil-suraj left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

patil-suraj Oct 5, 2022

Choose a reason for hiding this comment

Uh oh!

patil-suraj Oct 5, 2022

Choose a reason for hiding this comment

Uh oh!

patil-suraj Oct 5, 2022

Choose a reason for hiding this comment

Uh oh!

patil-suraj Oct 5, 2022

Choose a reason for hiding this comment

Uh oh!

Ttl commented Oct 5, 2022

Uh oh!

patil-suraj commented Oct 5, 2022

Uh oh!

patil-suraj commented Oct 5, 2022

Uh oh!

Ttl commented Oct 5, 2022

Uh oh!

patrickvonplaten commented Oct 7, 2022

Uh oh!

Thomas-MMJ commented Oct 30, 2022

Uh oh!

patrickvonplaten commented Nov 2, 2022

Uh oh!

patil-suraj left a comment

Choose a reason for hiding this comment

Uh oh!

patrickvonplaten commented Nov 16, 2022

Uh oh!

github-actions bot commented Dec 10, 2022

Uh oh!

patrickvonplaten commented Dec 13, 2022

Uh oh!

patil-suraj commented Dec 26, 2022

Uh oh!

Ttl commented Dec 28, 2022

Uh oh!

patil-suraj commented Dec 28, 2022

Uh oh!

Reviewers

Assignees

HuggingFaceDocBuilderDev commented Sep 30, 2022 •

edited

Loading

patil-suraj left a comment •

edited

Loading