[train_text_to_image] allow using non-ema weights for training #1834

patil-suraj · 2022-12-26T15:34:25Z

This PR allows using non-ema weights for training and ema weights for EMA updates to mimic the original training process. For now, the workflow is as follows

Each pre-trained SD checkpoint will have a branch called non-ema.
The script allows specifying this using the --non_ema_revision argument. If it's None it will default to using ema weights for both training and ema as is the case now.
if --non_ema_revision is specified it will be used to load the unet for training and the ema (main) weights will be used for ema updates.

This approach of using branches is not the best solution but will be used until we have the variations feature in diffusers.

This PR also

allows checkpointing of the ema model. Currently only the train unet is checkpointed.
adds an argument --allow_tf32 to enable TF32 on Ampere GPUs (A100) for faster full-precision training. Gives about ~1.33x speed-up.

Example command:

export MODEL_NAME="runwayml/stable-diffusion-v1-5"
export NOM_EMA_REVISION="non-ema"
export DATASET_NAME="lambdalabs/pokemon-blip-captions"
export WANDB_PROJECT="stable-diffusion-pokemon"

accelerate launch --multi_gpu --gpu_ids="0,1" --mixed_precision="no" \
   ../diffusers/examples/text_to_image/train_text_to_image.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --non_ema_revision=$NOM_EMA_REVISION --use_ema \
  --dataset_name=$DATASET_NAME --caption_column="text" \
  --resolution=512 --center_crop --random_flip \
  --train_batch_size=16 --gradient_checkpointing \
  --max_train_steps=5000 --checkpointing_steps=1000 \
  --learning_rate=3e-05 \
  --lr_scheduler="constant" --lr_warmup_steps=0 \

Fixes #1153

HuggingFaceDocBuilderDev · 2022-12-26T15:39:22Z

The documentation is not available anymore as the PR was closed or merged.

pcuenca

Looks great, thanks a lot!

However, I may not be fully understanding it yet. I know that state_dict and load_state_dict are used by accelerate during the checkpointing process, but I don't understand how store and restore are used. In addition, the line to resume from a checkpoint appears to have been removed, is resuming performed differently now?

examples/text_to_image/train_text_to_image.py

pcuenca · 2022-12-27T13:29:29Z

examples/text_to_image/train_text_to_image.py

+                temporarily stored. If `None`, the parameters of with which this
+                `ExponentialMovingAverage` was initialized will be used.
+        """
+        parameters = list(parameters)


Very minor question, why do we need the conversion to list here?

examples/text_to_image/train_text_to_image.py

Co-authored-by: Pedro Cuenca <[email protected]>

patil-suraj · 2022-12-27T14:20:21Z

but I don't understand how store and restore are used.

The store and restore methods can be used for evaluation during training. For evaluation we want the EMA params, so we save the training params temporarily in EMAModel using the store, copy the ema params to the model, do the evaluation and then use the restore to restore the training parameters, back. But this is not currently used in the script, so will remove these methods.

In addition, the line to resume from a checkpoint appears to have been removed, is resuming performed differently now?

My bad, removed it by mistake.

patil-suraj · 2022-12-30T13:46:24Z

examples/text_to_image/train_text_to_image.py

+    if args.allow_tf32:
+        torch.backends.cuda.matmul.allow_tf32 = True


This gives ~1.3x speed-up on A100.

patil-suraj · 2022-12-30T13:47:23Z

examples/text_to_image/train_text_to_image.py

    unet, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
        unet, optimizer, train_dataloader, lr_scheduler
    )
-    accelerator.register_for_checkpointing(lr_scheduler)


It's not required to register lr_scheduler here, it's automatically checkpointed by accelerate. We only need to register custom objects.

Interesting, thanks. This documentation led me to believe we needed to register it, but in those examples the learning rate scheduler is not being passed to prepare.

Hmm should we maybe ask on the accelerate repo?

I've verified this, all standard objects that we pass to prepare (like nn.Module, DataLoader, Optimizer, Scheduler) are automatically checkpointed by accelerate. We only need to register custom objects or models that we don't pass to prepare.

patil-suraj · 2022-12-30T14:12:21Z

examples/text_to_image/train_text_to_image.py

+        inputs = tokenizer(
+            captions, max_length=tokenizer.model_max_length, padding="max_length", truncation=True, return_tensors="pt"
+        )


always padding to max_length now to completely match with original implem.

pcuenca

Looks great!

patrickvonplaten

Looks clean! I trust @pcuenca and @patil-suraj here :-)

allow using non-ema weights for training

32c7cde

patil-suraj mentioned this pull request Dec 26, 2022

Why does train_text_to_image.py perform so differently from the CompVis script? #1153

Closed

patil-suraj requested review from patrickvonplaten, pcuenca and williamberman December 26, 2022 16:32

pcuenca reviewed Dec 27, 2022

View reviewed changes

Apply suggestions from code review

a706c96

Co-authored-by: Pedro Cuenca <[email protected]>

patil-suraj added 11 commits December 27, 2022 15:25

address more review comment

5918f66

reorganise a few lines

103eaca

Merge branch 'main' into finetune-non-ema

6f8654d

always pad text to max_length to match original training

5c8f069

ifx collate_fn

008014e

Merge branch 'main' into finetune-non-ema

80b5c19

Merge branch 'main' into finetune-non-ema

0171fcd

remove unused code

fe51c7c

don't prepare ema_unet, don't register lr scheduler

50782eb

style

4079f5b

assert => ValueError

114969e

patil-suraj changed the title ~~[wip][examples/train_text_to_image] allow using non-ema weights for training~~ [train_text_to_image] allow using non-ema weights for training Dec 30, 2022

add allow_tf32

a54a56b

patil-suraj commented Dec 30, 2022

View reviewed changes

patil-suraj added 2 commits December 30, 2022 14:47

Merge branch 'main' into finetune-non-ema

aae03ec

set log level

ad7ebe3

patil-suraj commented Dec 30, 2022

View reviewed changes

patil-suraj requested a review from pcuenca December 30, 2022 14:12

pcuenca approved these changes Dec 30, 2022

View reviewed changes

patrickvonplaten approved these changes Dec 30, 2022

View reviewed changes

fix comment

b29ed55

patil-suraj merged commit 62608a9 into main Dec 30, 2022

patil-suraj deleted the finetune-non-ema branch December 30, 2022 20:49

pcuenca mentioned this pull request Jan 17, 2023

Fix EMA for multi-gpu training in the unconditional example #1930

Merged

		if args.allow_tf32:
		torch.backends.cuda.matmul.allow_tf32 = True

[train_text_to_image] allow using non-ema weights for training #1834

[train_text_to_image] allow using non-ema weights for training #1834

Uh oh!

Conversation

patil-suraj commented Dec 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Dec 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pcuenca left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pcuenca Dec 27, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

patil-suraj commented Dec 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

patil-suraj Dec 30, 2022

Choose a reason for hiding this comment

Uh oh!

patil-suraj Dec 30, 2022

Choose a reason for hiding this comment

Uh oh!

pcuenca Dec 30, 2022

Choose a reason for hiding this comment

Uh oh!

patrickvonplaten Dec 30, 2022

Choose a reason for hiding this comment

Uh oh!

patil-suraj Dec 30, 2022

Choose a reason for hiding this comment

Uh oh!

patil-suraj Dec 30, 2022

Choose a reason for hiding this comment

Uh oh!

pcuenca left a comment

Choose a reason for hiding this comment

Uh oh!

patrickvonplaten left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

patil-suraj commented Dec 26, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Dec 26, 2022 •

edited

Loading

patil-suraj commented Dec 27, 2022 •

edited

Loading