-
Notifications
You must be signed in to change notification settings - Fork 6.5k
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
Whenever I run the code with the mixed_precision="fp16" flag it crashes.
- It appears that something with the Grad scaler in accelerate is not working
- It's weird for me calling
optimizer.zero_grad()after the backwards, but maybe accelerate is doing some magic I am not aware.
main()
File "/home/tcapelle/Apps/diffusers/examples/dreambooth/train_dreambooth.py", line 557, in main
accelerator.backward(loss)
File "/home/tcapelle/Apps/accelerate/src/accelerate/accelerator.py", line 1005, in backward
self.scaler.scale(loss).backward(**kwargs)
File "/home/tcapelle/mambaforge/envs/dream/lib/python3.10/site-packages/torch/_tensor.py", line 396, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/tcapelle/mambaforge/envs/dream/lib/python3.10/site-packages/torch/autograd/__init__.py", line 173, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: Found dtype Half but expected Float
Reproduction
I am running the following:
(dream).../examples/dreambooth ! > python train_dreambooth.py \ tcapelle at cape-a100-2 (-)(main)
--pretrained_model_name_or_path=$MODEL_NAME \
--instance_data_dir=$INSTANCE_DIR \
--class_data_dir=$CLASS_DIR \
--output_dir=$OUTPUT_DIR \
--with_prior_preservation --prior_loss_weight=1.0 \
--instance_prompt="a photo of heccap16 boy" \
--class_prompt="boy" \
--resolution=512 \
--train_batch_size=2 \
--gradient_accumulation_steps=1 \
--learning_rate=5e-6 \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--num_class_images=200 \
--max_train_steps=4000 \
--mixed_precision="fp16"
GCP machine with A100 and diffusers from master
Logs
No response
System Info
- `diffusers` version: 0.5.0.dev0
- Platform: Linux-5.13.0-1027-gcp-x86_64-with-glibc2.31
- Python version: 3.10.6
- PyTorch version (GPU?): 1.12.1 (True)
- Huggingface_hub version: 0.10.1
- Transformers version: 4.23.1
- Using GPU in script?: 1 x A100 40GB
- Using distributed or parallel set-up in script?: No
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working