Skip to content

Mixed precision is not working on dreambooth example #817

@tcapelle

Description

@tcapelle

Describe the bug

Whenever I run the code with the mixed_precision="fp16" flag it crashes.

  • It appears that something with the Grad scaler in accelerate is not working
  • It's weird for me calling optimizer.zero_grad() after the backwards, but maybe accelerate is doing some magic I am not aware.
    main()
  File "/home/tcapelle/Apps/diffusers/examples/dreambooth/train_dreambooth.py", line 557, in main
    accelerator.backward(loss)
  File "/home/tcapelle/Apps/accelerate/src/accelerate/accelerator.py", line 1005, in backward
    self.scaler.scale(loss).backward(**kwargs)
  File "/home/tcapelle/mambaforge/envs/dream/lib/python3.10/site-packages/torch/_tensor.py", line 396, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/tcapelle/mambaforge/envs/dream/lib/python3.10/site-packages/torch/autograd/__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Found dtype Half but expected Float

Reproduction

I am running the following:

(dream).../examples/dreambooth ! > python  train_dreambooth.py \                                                                                                                                                 tcapelle at cape-a100-2 (-)(main)
  --pretrained_model_name_or_path=$MODEL_NAME  \
  --instance_data_dir=$INSTANCE_DIR \
  --class_data_dir=$CLASS_DIR \
  --output_dir=$OUTPUT_DIR \
  --with_prior_preservation --prior_loss_weight=1.0 \
  --instance_prompt="a photo of heccap16 boy" \
  --class_prompt="boy" \
  --resolution=512 \
  --train_batch_size=2 \
  --gradient_accumulation_steps=1 \
  --learning_rate=5e-6 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --num_class_images=200 \
  --max_train_steps=4000 \
--mixed_precision="fp16"

GCP machine with A100 and diffusers from master

Logs

No response

System Info

- `diffusers` version: 0.5.0.dev0
- Platform: Linux-5.13.0-1027-gcp-x86_64-with-glibc2.31
- Python version: 3.10.6
- PyTorch version (GPU?): 1.12.1 (True)
- Huggingface_hub version: 0.10.1
- Transformers version: 4.23.1
- Using GPU in script?: 1 x A100 40GB
- Using distributed or parallel set-up in script?: No

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions