diff --git a/README.md b/README.md index ac518fb8ec76..2302137c1e72 100644 --- a/README.md +++ b/README.md @@ -182,9 +182,9 @@ image.save("astronaut_rides_horse.png") ### JAX/Flax -To use StableDiffusion on TPUs and GPUs for faster inference you can leverage JAX/Flax. +Diffusers offers a JAX / Flax implementation of Stable Diffusion for very fast inference. JAX shines specially on TPU hardware because each TPU server has 8 accelerators working in parallel, but it runs great on GPUs too. -Running the pipeline with default PNDMScheduler +Running the pipeline with the default PNDMScheduler: ```python import jax @@ -331,8 +331,25 @@ You can generate your own latents to reproduce results, or tweak your prompt on For more details, check out [the Stable Diffusion notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/stable_diffusion.ipynb) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/stable_diffusion.ipynb) and have a look into the [release notes](https://github.com/huggingface/diffusers/releases/tag/v0.2.0). - -## Examples + +## Fine-Tuning Stable Diffusion + +Fine-tuning techniques make it possible to adapt Stable Diffusion to your own dataset, or add new subjects to it. These are some of the techniques supported in `diffusers`: + +Textual Inversion is a technique for capturing novel concepts from a small number of example images in a way that can later be used to control text-to-image pipelines. It does so by learning new 'words' in the embedding space of the pipeline's text encoder. These special words can then be used within text prompts to achieve very fine-grained control of the resulting images. + +- Textual Inversion. Capture novel concepts from a small set of sample images, and associate them with new "words" in the embedding space of the text encoder. Please, refer to [our training examples](https://github.com/huggingface/diffusers/tree/main/examples/textual_inversion) or [documentation](https://huggingface.co/docs/diffusers/training/text_inversion) to try for yourself. + +- Dreambooth. Another technique to capture new concepts in Stable Diffusion. This method fine-tunes the UNet (and, optionally, also the text encoder) of the pipeline to achieve impressive results. Please, refer to [our training examples](https://github.com/huggingface/diffusers/tree/main/examples/dreambooth) and [training report](https://wandb.ai/psuraj/dreambooth/reports/Dreambooth-Training-Analysis--VmlldzoyNzk0NDc3) for additional details and training recommendations. + +- Full Stable Diffusion fine-tuning. If you have a more sizable dataset with a specific look or style, you can fine-tune Stable Diffusion so that it outputs images following those examples. This was the approach taken to create [a Pokémon Stable Diffusion model](https://huggingface.co/justinpinkney/pokemon-stable-diffusion) (by Justing Pinkney / Lambda Labs), [a Japanese specific version of Stable Diffusion](https://huggingface.co/spaces/rinna/japanese-stable-diffusion) (by [Rinna Co.](https://github.com/rinnakk/japanese-stable-diffusion/) and others. You can start at [our text-to-image fine-tuning example](https://github.com/huggingface/diffusers/tree/main/examples/text_to_image) and go from there. + + +## Stable Diffusion Community Pipelines + +The release of Stable Diffusion as an open source model has fostered a lot of interesting ideas and experimentation. Our [Community Examples folder](https://github.com/huggingface/diffusers/tree/main/examples/community) contains many ideas worth exploring, like interpolating to create animated videos, using CLIP Guidance for additional prompt fidelity, term weighting, and much more! Take a look and [contribute your own](https://huggingface.co/docs/diffusers/using-diffusers/custom_pipelines). + +## Other Examples There are many ways to try running Diffusers! Here we outline code-focused tools (primarily using `DiffusionPipeline`s and Google Colab) and interactive web-tools. diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml index 295e24674854..7e46d95a46f1 100644 --- a/docs/source/_toctree.yml +++ b/docs/source/_toctree.yml @@ -46,9 +46,11 @@ - local: training/unconditional_training title: "Unconditional Image Generation" - local: training/text_inversion - title: "Text Inversion" + title: "Textual Inversion" + - local: training/dreambooth + title: "Dreambooth" - local: training/text2image - title: "Text-to-image" + title: "Text-to-image fine-tuning" title: "Training" - sections: - local: conceptual/stable_diffusion diff --git a/docs/source/api/schedulers.mdx b/docs/source/api/schedulers.mdx index 80fbfdcb2321..3f88e563de19 100644 --- a/docs/source/api/schedulers.mdx +++ b/docs/source/api/schedulers.mdx @@ -89,7 +89,7 @@ Original implementation can be found [here](https://github.com/crowsonkb/k-diffu [[autodoc]] PNDMScheduler -#### variance exploding stochastic differential equation (SDE) scheduler +#### variance exploding stochastic differential equation (VE-SDE) scheduler Original paper can be found [here](https://arxiv.org/abs/2011.13456). @@ -99,7 +99,9 @@ Original paper can be found [here](https://arxiv.org/abs/2011.13456). Original implementation can be found [here](https://github.com/crowsonkb/v-diffusion-pytorch/blob/987f8985e38208345c1959b0ea767a625831cc9b/diffusion/sampling.py#L296). -#### variance preserving stochastic differential equation (SDE) scheduler +[[autodoc]] IPNDMScheduler + +#### variance preserving stochastic differential equation (VP-SDE) scheduler Original paper can be found [here](https://arxiv.org/abs/2011.13456). diff --git a/docs/source/conceptual/stable_diffusion.mdx b/docs/source/conceptual/stable_diffusion.mdx index c00359d2ac1a..3e87674aa83a 100644 --- a/docs/source/conceptual/stable_diffusion.mdx +++ b/docs/source/conceptual/stable_diffusion.mdx @@ -12,6 +12,4 @@ specific language governing permissions and limitations under the License. # Stable Diffusion -Under construction 🚧 - -For now please visit this [very in-detail blog post](https://huggingface.co/blog/stable_diffusion) +Please visit this [very in-detail blog post](https://huggingface.co/blog/stable_diffusion) on Stable Diffusion! diff --git a/docs/source/installation.mdx b/docs/source/installation.mdx index fa9466b6d1aa..1c9460eca8bd 100644 --- a/docs/source/installation.mdx +++ b/docs/source/installation.mdx @@ -53,7 +53,7 @@ The `main` version is useful for staying up-to-date with the latest developments For instance, if a bug has been fixed since the last official release but a new release hasn't been rolled out yet. However, this means the `main` version may not always be stable. We strive to keep the `main` version operational, and most issues are usually resolved within a few hours or a day. -If you run into a problem, please open an [Issue](https://github.com/huggingface/transformers/issues) so we can fix it even sooner! +If you run into a problem, please open an [Issue](https://github.com/huggingface/transformers/issues), so we can fix it even sooner! ## Editable install diff --git a/docs/source/optimization/fp16.mdx b/docs/source/optimization/fp16.mdx index b4ffed31c76b..f12c067ba5ee 100644 --- a/docs/source/optimization/fp16.mdx +++ b/docs/source/optimization/fp16.mdx @@ -18,9 +18,9 @@ We present some techniques and ideas to optimize 🤗 Diffusers _inference_ for | ---------------- | ------- | ------- | | original | 9.50s | x1 | | cuDNN auto-tuner | 9.37s | x1.01 | -| autocast (fp16) | 5.47s | x1.91 | -| fp16 | 3.61s | x2.91 | -| channels last | 3.30s | x2.87 | +| autocast (fp16) | 5.47s | x1.74 | +| fp16 | 3.61s | x2.63 | +| channels last | 3.30s | x2.88 | | traced UNet | 3.21s | x2.96 | diff --git a/docs/source/training/dreambooth.mdx b/docs/source/training/dreambooth.mdx new file mode 100644 index 000000000000..cf7e5dbcec41 --- /dev/null +++ b/docs/source/training/dreambooth.mdx @@ -0,0 +1,240 @@ + + +# DreamBooth fine-tuning example + +[DreamBooth](https://arxiv.org/abs/2208.12242) is a method to personalize text-to-image models like stable diffusion given just a few (3~5) images of a subject. + +![Dreambooth examples from the project's blog](https://dreambooth.github.io/DreamBooth_files/teaser_static.jpg) +_Dreambooth examples from the [project's blog](https://dreambooth.github.io)._ + +The [Dreambooth training script](https://github.com/huggingface/diffusers/tree/main/examples/dreambooth) shows how to implement this training procedure on a pre-trained Stable Diffusion model. + + + + + +Dreambooth fine-tuning is very sensitive to hyperparameters and easy to overfit. We recommend you take a look at our [in-depth analysis](https://wandb.ai/psuraj/dreambooth/reports/Dreambooth-Training-Analysis--VmlldzoyNzk0NDc3) with recommended settings for different subjects, and go from there. + + + +## Training locally + +### Installing the dependencies + +Before running the scripts, make sure to install the library's training dependencies. We also recommend to install `diffusers` from the `main` github branch. + +```bash +pip install git+https://github.com/huggingface/diffusers +pip install -U -r diffusers/examples/dreambooth/requirements.txt +``` + +Then initialize and configure a [🤗 Accelerate](https://github.com/huggingface/accelerate/) environment with: + +```bash +accelerate config +``` + +You need to accept the model license before downloading or using the weights. In this example we'll use model version `v1-4`, so you'll need to visit [its card](https://huggingface.co/CompVis/stable-diffusion-v1-4), read the license and tick the checkbox if you agree. + +You have to be a registered user in 🤗 Hugging Face Hub, and you'll also need to use an access token for the code to work. For more information on access tokens, please refer to [this section of the documentation](https://huggingface.co/docs/hub/security-tokens). + +Run the following command to authenticate your token + +```bash +huggingface-cli login +``` + +If you have already cloned the repo, then you won't need to go through these steps. Instead, you can pass the path to your local checkout to the training script and it will be loaded from there. + +### Dog toy example + +In this example we'll use [these images](https://drive.google.com/drive/folders/1BO_dyz-p65qhBRRMRA4TbZ8qW4rB99JZ) to add a new concept to Stable Diffusion using the Dreambooth process. They will be our training data. Please, download them and place them somewhere in your system. + +Then you can launch the training script using: + +```bash +export MODEL_NAME="CompVis/stable-diffusion-v1-4" +export INSTANCE_DIR="path_to_training_images" +export OUTPUT_DIR="path_to_saved_model" + +accelerate launch train_dreambooth.py \ + --pretrained_model_name_or_path=$MODEL_NAME \ + --instance_data_dir=$INSTANCE_DIR \ + --output_dir=$OUTPUT_DIR \ + --instance_prompt="a photo of sks dog" \ + --resolution=512 \ + --train_batch_size=1 \ + --gradient_accumulation_steps=1 \ + --learning_rate=5e-6 \ + --lr_scheduler="constant" \ + --lr_warmup_steps=0 \ + --max_train_steps=400 +``` + +### Training with a prior-preserving loss + +Prior preservation is used to avoid overfitting and language-drift. Please, refer to the paper to learn more about it if you are interested. For prior preservation, we use other images of the same class as part of the training process. The nice thing is that we can generate those images using the Stable Diffusion model itself! The training script will save the generated images to a local path we specify. + +According to the paper, it's recommended to generate `num_epochs * num_samples` images for prior preservation. 200-300 works well for most cases. + +```bash +export MODEL_NAME="CompVis/stable-diffusion-v1-4" +export INSTANCE_DIR="path_to_training_images" +export CLASS_DIR="path_to_class_images" +export OUTPUT_DIR="path_to_saved_model" + +accelerate launch train_dreambooth.py \ + --pretrained_model_name_or_path=$MODEL_NAME \ + --instance_data_dir=$INSTANCE_DIR \ + --class_data_dir=$CLASS_DIR \ + --output_dir=$OUTPUT_DIR \ + --with_prior_preservation --prior_loss_weight=1.0 \ + --instance_prompt="a photo of sks dog" \ + --class_prompt="a photo of dog" \ + --resolution=512 \ + --train_batch_size=1 \ + --gradient_accumulation_steps=1 \ + --learning_rate=5e-6 \ + --lr_scheduler="constant" \ + --lr_warmup_steps=0 \ + --num_class_images=200 \ + --max_train_steps=800 +``` + +### Training on a 16GB GPU + +With the help of gradient checkpointing and the 8-bit optimizer from [bitsandbytes](https://github.com/TimDettmers/bitsandbytes), it's possible to train dreambooth on a 16GB GPU. + +```bash +pip install bitsandbytes +``` + +Then pass the `--use_8bit_adam` option to the training script. + +```bash +export MODEL_NAME="CompVis/stable-diffusion-v1-4" +export INSTANCE_DIR="path_to_training_images" +export CLASS_DIR="path_to_class_images" +export OUTPUT_DIR="path_to_saved_model" + +accelerate launch train_dreambooth.py \ + --pretrained_model_name_or_path=$MODEL_NAME \ + --instance_data_dir=$INSTANCE_DIR \ + --class_data_dir=$CLASS_DIR \ + --output_dir=$OUTPUT_DIR \ + --with_prior_preservation --prior_loss_weight=1.0 \ + --instance_prompt="a photo of sks dog" \ + --class_prompt="a photo of dog" \ + --resolution=512 \ + --train_batch_size=1 \ + --gradient_accumulation_steps=2 --gradient_checkpointing \ + --use_8bit_adam \ + --learning_rate=5e-6 \ + --lr_scheduler="constant" \ + --lr_warmup_steps=0 \ + --num_class_images=200 \ + --max_train_steps=800 +``` + +### Fine-tune the text encoder in addition to the UNet + +The script also allows to fine-tune the `text_encoder` along with the `unet`. It has been observed experimentally that this gives much better results, especially on faces. Please, refer to [our report](https://wandb.ai/psuraj/dreambooth/reports/Dreambooth-Training-Analysis--VmlldzoyNzk0NDc3) for more details. + +To enable this option, pass the `--train_text_encoder` argument to the training script. + + +Training the text encoder requires additional memory, so training won't fit on a 16GB GPU. You'll need at least 24GB VRAM to use this option. + + +```bash +export MODEL_NAME="CompVis/stable-diffusion-v1-4" +export INSTANCE_DIR="path_to_training_images" +export CLASS_DIR="path_to_class_images" +export OUTPUT_DIR="path_to_saved_model" + +accelerate launch train_dreambooth.py \ + --pretrained_model_name_or_path=$MODEL_NAME \ + --train_text_encoder \ + --instance_data_dir=$INSTANCE_DIR \ + --class_data_dir=$CLASS_DIR \ + --output_dir=$OUTPUT_DIR \ + --with_prior_preservation --prior_loss_weight=1.0 \ + --instance_prompt="a photo of sks dog" \ + --class_prompt="a photo of dog" \ + --resolution=512 \ + --train_batch_size=1 \ + --use_8bit_adam + --gradient_checkpointing \ + --learning_rate=2e-6 \ + --lr_scheduler="constant" \ + --lr_warmup_steps=0 \ + --num_class_images=200 \ + --max_train_steps=800 +``` + +### Training on a 8 GB GPU: + +Using [DeepSpeed](https://www.deepspeed.ai/) it's even possible to offload some +tensors from VRAM to either CPU or NVME, allowing training to proceed with less GPU memory. + +DeepSpeed needs to be enabled with `accelerate config`. During configuration, +answer yes to "Do you want to use DeepSpeed?". Combining DeepSpeed stage 2, fp16 +mixed precision, and offloading both the model parameters and the optimizer state to CPU, it's +possible to train on under 8 GB VRAM. The drawback is that this requires more system RAM (about 25 GB). See [the DeepSpeed documentation](https://huggingface.co/docs/accelerate/usage_guides/deepspeed) for more configuration options. + +Changing the default Adam optimizer to DeepSpeed's special version of Adam +`deepspeed.ops.adam.DeepSpeedCPUAdam` gives a substantial speedup, but enabling +it requires the system's CUDA toolchain version to be the same as the one installed with PyTorch. 8-bit optimizers don't seem to be compatible with DeepSpeed at the moment. + +```bash +export MODEL_NAME="CompVis/stable-diffusion-v1-4" +export INSTANCE_DIR="path_to_training_images" +export CLASS_DIR="path_to_class_images" +export OUTPUT_DIR="path_to_saved_model" + +accelerate launch train_dreambooth.py \ + --pretrained_model_name_or_path=$MODEL_NAME \ + --instance_data_dir=$INSTANCE_DIR \ + --class_data_dir=$CLASS_DIR \ + --output_dir=$OUTPUT_DIR \ + --with_prior_preservation --prior_loss_weight=1.0 \ + --instance_prompt="a photo of sks dog" \ + --class_prompt="a photo of dog" \ + --resolution=512 \ + --train_batch_size=1 \ + --sample_batch_size=1 \ + --gradient_accumulation_steps=1 --gradient_checkpointing \ + --learning_rate=5e-6 \ + --lr_scheduler="constant" \ + --lr_warmup_steps=0 \ + --num_class_images=200 \ + --max_train_steps=800 \ + --mixed_precision=fp16 +``` + +## Inference + +Once you have trained a model, inference can be done using the `StableDiffusionPipeline`, by simply indicating the path where the model was saved. Make sure that your prompts include the special `identifier` used during training (`sks` in the previous examples). + +```python +from diffusers import StableDiffusionPipeline +import torch + +model_id = "path_to_saved_model" +pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda") + +prompt = "A photo of sks dog in a bucket" +image = pipe(prompt, num_inference_steps=50, guidance_scale=7.5).images[0] + +image.save("dog-bucket.png") +``` diff --git a/docs/source/training/overview.mdx b/docs/source/training/overview.mdx index c403298493e5..9b36117ca43a 100644 --- a/docs/source/training/overview.mdx +++ b/docs/source/training/overview.mdx @@ -12,7 +12,7 @@ specific language governing permissions and limitations under the License. # 🧨 Diffusers Training Examples -Diffusers examples are a collection of scripts to demonstrate how to effectively use the `diffusers` library +Diffusers training examples are a collection of scripts to demonstrate how to effectively use the `diffusers` library for a variety of use cases. **Note**: If you are looking for **official** examples on how to use `diffusers` for inference, @@ -36,13 +36,15 @@ Training examples show how to pretrain or fine-tune diffusion models for a varie - [Unconditional Training](./unconditional_training) - [Text-to-Image Training](./text2image) - [Text Inversion](./text_inversion) +- [Dreambooth](./dreambooth) | Task | 🤗 Accelerate | 🤗 Datasets | Colab |---|---|:---:|:---:| | [**Unconditional Image Generation**](./unconditional_training) | ✅ | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/training_example.ipynb) -| [**Text-to-Image**](./text2image) | - | - | -| [**Text-Inversion**](./text_inversion) | ✅ | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/sd_textual_inversion_training.ipynb) +| [**Text-to-Image fine-tuning**](./text2image) | ✅ | ✅ | +| [**Textual Inversion**](./text_inversion) | ✅ | - | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/sd_textual_inversion_training.ipynb) +| [**Dreambooth**](./dreambooth) | ✅ | - | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/sd_dreambooth_training.ipynb) ## Community diff --git a/docs/source/training/text2image.mdx b/docs/source/training/text2image.mdx index dcbdece429a9..1b04462f77f1 100644 --- a/docs/source/training/text2image.mdx +++ b/docs/source/training/text2image.mdx @@ -11,6 +11,128 @@ specific language governing permissions and limitations under the License. --> -# Text-to-Image Training +# Stable Diffusion text-to-image fine-tuning -Under construction 🚧 +The [`train_text_to_image.py`](https://github.com/huggingface/diffusers/tree/main/examples/textual_inversion) script shows how to fine-tune the stable diffusion model on your own dataset. + + + +The text-to-image fine-tuning script is experimental. It's easy to overfit and run into issues like catastrophic forgetting. We recommend to explore different hyperparameters to get the best results on your dataset. + + + + +## Running locally + +### Installing the dependencies + +Before running the scripts, make sure to install the library's training dependencies: + +```bash +pip install git+https://github.com/huggingface/diffusers.git +pip install -U -r requirements.txt +``` + +And initialize an [🤗Accelerate](https://github.com/huggingface/accelerate/) environment with: + +```bash +accelerate config +``` + +You need to accept the model license before downloading or using the weights. In this example we'll use model version `v1-4`, so you'll need to visit [its card](https://huggingface.co/CompVis/stable-diffusion-v1-4), read the license and tick the checkbox if you agree. + +You have to be a registered user in 🤗 Hugging Face Hub, and you'll also need to use an access token for the code to work. For more information on access tokens, please refer to [this section of the documentation](https://huggingface.co/docs/hub/security-tokens). + +Run the following command to authenticate your token + +```bash +huggingface-cli login +``` + +If you have already cloned the repo, then you won't need to go through these steps. Instead, you can pass the path to your local checkout to the training script and it will be loaded from there. + +### Hardware Requirements for Fine-tuning + +Using `gradient_checkpointing` and `mixed_precision` it should be possible to fine tune the model on a single 24GB GPU. For higher `batch_size` and faster training it's better to use GPUs with more than 30GB of GPU memory. You can also use JAX / Flax for fine-tuning on TPUs or GPUs, see [below](#flax-jax-finetuning) for details. + +### Fine-tuning Example + +The following script will launch a fine-tuning run using [Justin Pinkneys' captioned Pokemon dataset](https://huggingface.co/datasets/lambdalabs/pokemon-blip-captions), available in Hugging Face Hub. + + +```bash +export MODEL_NAME="CompVis/stable-diffusion-v1-4" +export dataset_name="lambdalabs/pokemon-blip-captions" + +accelerate launch train_text_to_image.py \ + --pretrained_model_name_or_path=$MODEL_NAME \ + --dataset_name=$dataset_name \ + --use_ema \ + --resolution=512 --center_crop --random_flip \ + --train_batch_size=1 \ + --gradient_accumulation_steps=4 \ + --gradient_checkpointing \ + --mixed_precision="fp16" \ + --max_train_steps=15000 \ + --learning_rate=1e-05 \ + --max_grad_norm=1 \ + --lr_scheduler="constant" --lr_warmup_steps=0 \ + --output_dir="sd-pokemon-model" +``` + +To run on your own training files you need to prepare the dataset according to the format required by `datasets`. You can upload your dataset to the Hub, or you can prepare a local folder with your files. [This documentation](https://huggingface.co/docs/datasets/v2.4.0/en/image_load#imagefolder-with-metadata) explains how to do it. + +You should modify the script if you wish to use custom loading logic. We have left pointers in the code in the appropriate places :) + +```bash +export MODEL_NAME="CompVis/stable-diffusion-v1-4" +export TRAIN_DIR="path_to_your_dataset" +export OUTPUT_DIR="path_to_save_model" + +accelerate launch train_text_to_image.py \ + --pretrained_model_name_or_path=$MODEL_NAME \ + --train_data_dir=$TRAIN_DIR \ + --use_ema \ + --resolution=512 --center_crop --random_flip \ + --train_batch_size=1 \ + --gradient_accumulation_steps=4 \ + --gradient_checkpointing \ + --mixed_precision="fp16" \ + --max_train_steps=15000 \ + --learning_rate=1e-05 \ + --max_grad_norm=1 \ + --lr_scheduler="constant" --lr_warmup_steps=0 \ + --output_dir=${OUTPUT_DIR} +``` + +Once training is finished the model will be saved to the `OUTPUT_DIR` specified in the command. To load the fine-tuned model for inference, just pass that path to `StableDiffusionPipeline`: + +```python +from diffusers import StableDiffusionPipeline + +model_path = "path_to_saved_model" +pipe = StableDiffusionPipeline.from_pretrained(model_path, torch_dtype=torch.float16) +pipe.to("cuda") + +image = pipe(prompt="yoda").images[0] +image.save("yoda-pokemon.png") +``` + +### Flax / JAX fine-tuning + +Thanks to [@duongna211](https://github.com/duongna21) it's possible to fine-tune Stable Diffusion using Flax! This is very efficient on TPU hardware but works great on GPUs too. You can use the [Flax training script](https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image_flax.py) like this: + +```Python +export MODEL_NAME="runwayml/stable-diffusion-v1-5" +export dataset_name="lambdalabs/pokemon-blip-captions" + +python train_text_to_image_flax.py \ + --pretrained_model_name_or_path=$MODEL_NAME \ + --dataset_name=$dataset_name \ + --resolution=512 --center_crop --random_flip \ + --train_batch_size=1 \ + --max_train_steps=15000 \ + --learning_rate=1e-05 \ + --max_grad_norm=1 \ + --output_dir="sd-pokemon-model" +``` diff --git a/docs/source/training/text_inversion.mdx b/docs/source/training/text_inversion.mdx index d44cef301ed7..7bc145299eac 100644 --- a/docs/source/training/text_inversion.mdx +++ b/docs/source/training/text_inversion.mdx @@ -49,7 +49,7 @@ The `textual_inversion.py` script [here](https://github.com/huggingface/diffuser ### Installing the dependencies -Before running the scripts, make sure to install the library's training dependencies: +Before running the scripts, make sure to install the library's training dependencies. ```bash pip install diffusers[training] accelerate transformers diff --git a/examples/README.md b/examples/README.md index 2680b638d585..56b1b81b9060 100644 --- a/examples/README.md +++ b/examples/README.md @@ -16,7 +16,7 @@ limitations under the License. # 🧨 Diffusers Examples Diffusers examples are a collection of scripts to demonstrate how to effectively use the `diffusers` library -for a variety of use cases. +for a variety of use cases involving training or fine-tuning. **Note**: If you are looking for **official** examples on how to use `diffusers` for inference, please have a look at [src/diffusers/pipelines](https://github.com/huggingface/diffusers/tree/main/src/diffusers/pipelines) @@ -38,7 +38,11 @@ Training examples show how to pretrain or fine-tune diffusion models for a varie | Task | 🤗 Accelerate | 🤗 Datasets | Colab |---|---|:---:|:---:| -| [**Unconditional Image Generation**](https://github.com/huggingface/diffusers/blob/main/examples/unconditional_image_generation/train_unconditional.py) | ✅ | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/training_example.ipynb) +| [**Unconditional Image Generation**](./unconditional_training) | ✅ | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/training_example.ipynb) +| [**Text-to-Image fine-tuning**](./text2image) | ✅ | ✅ | +| [**Textual Inversion**](./text_inversion) | ✅ | - | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/sd_textual_inversion_training.ipynb) +| [**Dreambooth**](./dreambooth) | ✅ | - | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/sd_dreambooth_training.ipynb) + ## Community diff --git a/examples/community/README.md b/examples/community/README.md index 3f92f202d8c4..e27e1d3fe429 100644 --- a/examples/community/README.md +++ b/examples/community/README.md @@ -16,6 +16,9 @@ If a community doesn't work as expected, please open an issue and ping the autho | Speech to Image | Using automatic-speech-recognition to transcribe text and Stable Diffusion to generate images | [Speech to Image](#speech-to-image) | - | [Mikail Duzenli](https://github.com/MikailINTech) | Wild Card Stable Diffusion | Stable Diffusion Pipeline that supports prompts that contain wildcard terms (indicated by surrounding double underscores), with values instantiated randomly from a corresponding txt file or a dictionary of possible values | [Wildcard Stable Diffusion](#wildcard-stable-diffusion) | - | [Shyam Sudhakaran](https://github.com/shyamsn97) | | Composable Stable Diffusion| Stable Diffusion Pipeline that supports prompts that contain "|" in prompts (as an AND condition) and weights (separated by "|" as well) to positively / negatively weight prompts. | [Composable Stable Diffusion](#composable-stable-diffusion) | - | [Mark Rich](https://github.com/MarkRich) | +| Seed Resizing Stable Diffusion| Stable Diffusion Pipeline that supports resizing an image and retaining the concepts of the 512 by 512 generation. | [Seed Resizing](#seed-resizing) | - | [Mark Rich](https://github.com/MarkRich) | + +| Multilingual Stable Diffusion| Stable Diffusion Pipeline that supports prompts in 50 different languages. | [Multilingual Stable Diffusion](#multilingual-stable-diffusion) | - | [Juan Carlos Piñeros](https://github.com/juancopi81) | To load a custom pipeline you just need to pass the `custom_pipeline` argument to `DiffusionPipeline`, as one of the files in `diffusers/examples/community`. Feel free to send a PR with your own pipelines, we will merge them quickly. @@ -371,3 +374,141 @@ for i in range(4): for i, img in enumerate(images): img.save(f"./composable_diffusion/image_{i}.png") ``` +### Seed Resizing +Test seed resizing. Originally generate an image in 512 by 512, then generate image with same seed at 512 by 592 using seed resizing. Finally, generate 512 by 592 using original stable diffusion pipeline. + +```python +import torch as th +import numpy as np +from diffusers import DiffusionPipeline + +has_cuda = th.cuda.is_available() +device = th.device('cpu' if not has_cuda else 'cuda') + +pipe = DiffusionPipeline.from_pretrained( + "CompVis/stable-diffusion-v1-4", + use_auth_token=True, + custom_pipeline="seed_resize_stable_diffusion" +).to(device) + +def dummy(images, **kwargs): + return images, False + +pipe.safety_checker = dummy + + +images = [] +th.manual_seed(0) +generator = th.Generator("cuda").manual_seed(0) + +seed = 0 +prompt = "A painting of a futuristic cop" + +width = 512 +height = 512 + +res = pipe( + prompt, + guidance_scale=7.5, + num_inference_steps=50, + height=height, + width=width, + generator=generator) +image = res.images[0] +image.save('./seed_resize/seed_resize_{w}_{h}_image.png'.format(w=width, h=height)) + + +th.manual_seed(0) +generator = th.Generator("cuda").manual_seed(0) + +pipe = DiffusionPipeline.from_pretrained( + "CompVis/stable-diffusion-v1-4", + use_auth_token=True, + custom_pipeline="/home/mark/open_source/diffusers/examples/community/" +).to(device) + +width = 512 +height = 592 + +res = pipe( + prompt, + guidance_scale=7.5, + num_inference_steps=50, + height=height, + width=width, + generator=generator) +image = res.images[0] +image.save('./seed_resize/seed_resize_{w}_{h}_image.png'.format(w=width, h=height)) + +pipe_compare = DiffusionPipeline.from_pretrained( + "CompVis/stable-diffusion-v1-4", + use_auth_token=True, + custom_pipeline="/home/mark/open_source/diffusers/examples/community/" +).to(device) + +res = pipe_compare( + prompt, + guidance_scale=7.5, + num_inference_steps=50, + height=height, + width=width, + generator=generator +) + +image = res.images[0] +image.save('./seed_resize/seed_resize_{w}_{h}_image_compare.png'.format(w=width, h=height)) +``` + +### Multilingual Stable Diffusion Pipeline + +The following code can generate an images from texts in different languages using the pre-trained [mBART-50 many-to-one multilingual machine translation model](https://huggingface.co/facebook/mbart-large-50-many-to-one-mmt) and Stable Diffusion. + +```Python +from PIL import Image +import torch +from diffusers import DiffusionPipeline +from transformers import ( + pipeline, + MBart50TokenizerFast, + MBartForConditionalGeneration, +) +device = "cuda" if torch.cuda.is_available() else "cpu" +device_dict = {"cuda": 0, "cpu": -1} +# helper function taken from: https://huggingface.co/blog/stable_diffusion +def image_grid(imgs, rows, cols): + assert len(imgs) == rows*cols + w, h = imgs[0].size + grid = Image.new('RGB', size=(cols*w, rows*h)) + grid_w, grid_h = grid.size + for i, img in enumerate(imgs): + grid.paste(img, box=(i%cols*w, i//cols*h)) + return grid +# Add language detection pipeline +language_detection_model_ckpt = "papluca/xlm-roberta-base-language-detection" +language_detection_pipeline = pipeline("text-classification", + model=language_detection_model_ckpt, + device=device_dict[device]) +# Add model for language translation +trans_tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50-many-to-one-mmt") +trans_model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-many-to-one-mmt").to(device) +diffuser_pipeline = DiffusionPipeline.from_pretrained( + "CompVis/stable-diffusion-v1-4", + custom_pipeline="multilingual_stable_diffusion", + language_detection_pipeline=language_detection_pipeline, + translation_model=trans_model, + translation_tokenizer=trans_tokenizer, + revision="fp16", + torch_dtype=torch.float16, +) +diffuser_pipeline.enable_attention_slicing() +diffuser_pipeline = diffuser_pipeline.to(device) +prompt = ["a photograph of an astronaut riding a horse", + "Una casa en la playa", + "Ein Hund, der Orange isst", + "Un restaurant parisien"] +output = diffuser_pipeline(prompt) +images = output.images +grid = image_grid(images, rows=2, cols=2) +``` +This example produces the following image: +![image](https://user-images.githubusercontent.com/4313860/198328706-295824a4-9856-4ce5-8e66-278ceb42fd29.png) diff --git a/examples/community/multilingual_stable_diffusion.py b/examples/community/multilingual_stable_diffusion.py new file mode 100644 index 000000000000..599a8c8e9c76 --- /dev/null +++ b/examples/community/multilingual_stable_diffusion.py @@ -0,0 +1,356 @@ +import inspect +from typing import Callable, List, Optional, Union + +import torch + +from diffusers import ( + AutoencoderKL, + DDIMScheduler, + DiffusionPipeline, + LMSDiscreteScheduler, + PNDMScheduler, + UNet2DConditionModel, +) +from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion import StableDiffusionPipelineOutput +from diffusers.pipelines.stable_diffusion.safety_checker import StableDiffusionSafetyChecker +from diffusers.utils import logging +from transformers import ( + CLIPFeatureExtractor, + CLIPTextModel, + CLIPTokenizer, + MBart50TokenizerFast, + MBartForConditionalGeneration, + pipeline +) + +logger = logging.get_logger(__name__) # pylint: disable=invalid-name + +def detect_language(language_detection_pipeline: pipeline, + prompt: Union[str, List[str]], + batch_size: int) -> Union[str, List[str]]: + """Detects language(s) of prompt.""" + if batch_size == 1: + preds = language_detection_pipeline(prompt, + top_k=1, + truncation=True, + max_length=128) +<<<<<<< HEAD + return preds[0]["label"] +======= + return preds[0]['label'] +>>>>>>> a1bfa548b7e251212bb5097d3fe788973761d786 + else: + detected_languages = [] + for p in prompt: + preds = language_detection_pipeline(p, + top_k=1, + truncation=True, + max_length=128) +<<<<<<< HEAD + detected_languages.append(preds[0]["label"]) +======= + detected_languages.append(preds[0]['label']) +>>>>>>> a1bfa548b7e251212bb5097d3fe788973761d786 + return detected_languages + +def translate_prompt(prompt: str, + translation_tokenizer: MBart50TokenizerFast, + translation_model: MBartForConditionalGeneration, + device: torch.device) -> str: + """Translate prompt to English.""" + encoded_prompt = translation_tokenizer(prompt, return_tensors="pt").to(device) + generated_tokens = translation_model.generate(**encoded_prompt, max_new_tokens=1000) + en_trans = translation_tokenizer.batch_decode(generated_tokens, skip_special_tokens=True) + return en_trans[0] + +class MultilingualStableDiffusion(DiffusionPipeline): + r""" + Pipeline for text-to-image generation using Stable Diffusion with multilingual support. + + This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the + library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) + + Args: + language_detection_pipeline ([`pipeline`]): + Transformers pipeline to detect prompt's language. + translation_model ([`MBartForConditionalGeneration`]): + Model to translate prompt to English, if necessary. Please refer to the + [model card](https://huggingface.co/docs/transformers/model_doc/mbart) for details. + translation_tokenizer ([`MBart50TokenizerFast`]): + Tokenizer of the translation model. + vae ([`AutoencoderKL`]): + Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations. + text_encoder ([`CLIPTextModel`]): + Frozen text-encoder. Stable Diffusion uses the text portion of + [CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModel), specifically + the [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) variant. + tokenizer (`CLIPTokenizer`): + Tokenizer of class + [CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer). + unet ([`UNet2DConditionModel`]): Conditional U-Net architecture to denoise the encoded image latents. + scheduler ([`SchedulerMixin`]): + A scheduler to be used in combination with `unet` to denoise the encoded image latens. Can be one of + [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. + safety_checker ([`StableDiffusionSafetyChecker`]): + Classification module that estimates whether generated images could be considered offensive or harmful. + Please, refer to the [model card](https://huggingface.co/CompVis/stable-diffusion-v1-4) for details. + feature_extractor ([`CLIPFeatureExtractor`]): + Model that extracts features from generated images to be used as inputs for the `safety_checker`. + """ + def __init__( + self, + language_detection_pipeline: pipeline, + translation_model: MBartForConditionalGeneration, + translation_tokenizer: MBart50TokenizerFast, + vae: AutoencoderKL, + text_encoder: CLIPTextModel, + tokenizer: CLIPTokenizer, + unet: UNet2DConditionModel, + scheduler: Union[DDIMScheduler, PNDMScheduler, LMSDiscreteScheduler], + safety_checker: StableDiffusionSafetyChecker, + feature_extractor: CLIPFeatureExtractor, + ): + super().__init__() + + if safety_checker is None: + logger.warn( + f"You have disabled the safety checker for {self.__class__} by passing `safety_checker=None`. Ensure" + " that you abide to the conditions of the Stable Diffusion license and do not expose unfiltered" + " results in services or applications open to the public. Both the diffusers team and Hugging Face" + " strongly recommend to keep the safety filter enabled in all public facing circumstances, disabling" + " it only for use-cases that involve analyzing network behavior or auditing its results. For more" + " information, please have a look at https://github.com/huggingface/diffusers/pull/254 ." + ) + + self.register_modules( + language_detection_pipeline=language_detection_pipeline, + translation_model=translation_model, + translation_tokenizer=translation_tokenizer, + vae=vae, + text_encoder=text_encoder, + tokenizer=tokenizer, + unet=unet, + scheduler=scheduler, + feature_extractor=feature_extractor, + ) + + def enable_attention_slicing(self, slice_size: Optional[Union[str, int]] = "auto"): + if slice_size == "auto": + slice_size = self.unet.config.attention_head_dim // 2 + self.unet.set_attention_slice(slice_size) + + def disable_attention_slicing(self): + self.enable_attention_slicing(None) + + @torch.no_grad() + def __call__( + self, + prompt: Union[str, List[str]], + height: int = 512, + width: int = 512, + num_inference_steps: int = 50, + guidance_scale: float = 7.5, + negative_prompt: Optional[Union[str, List[str]]] = None, + num_images_per_prompt: Optional[int] = 1, + eta: float = 0.0, + generator: Optional[torch.Generator] = None, + latents: Optional[torch.FloatTensor] = None, + output_type: Optional[str] = "pil", + return_dict: bool = True, + callback: Optional[Callable[[int, int, torch.FloatTensor], None]] = None, + callback_steps: Optional[int] = 1, + **kwargs, + ): + + if isinstance(prompt, str): + batch_size = 1 + elif isinstance(prompt, list): + batch_size = len(prompt) + else: + raise ValueError(f"`prompt` has to be of type `str` or `list` but is {type(prompt)}") + + if height % 8 != 0 or width % 8 != 0: + raise ValueError(f"`height` and `width` have to be divisible by 8 but are {height} and {width}.") + + if (callback_steps is None) or ( + callback_steps is not None and (not isinstance(callback_steps, int) or callback_steps <= 0) + ): + raise ValueError( + f"`callback_steps` has to be a positive integer but is {callback_steps} of type" + f" {type(callback_steps)}." + ) + + # detect language and translate if necessary + prompt_language = detect_language(self.language_detection_pipeline, prompt, batch_size) + if (batch_size == 1) and (prompt_language != "en"): + prompt = translate_prompt(prompt=prompt, + translation_tokenizer=self.translation_tokenizer, + translation_model=self.translation_model, + device=self.device) + if batch_size != 1: + for index in range(batch_size): + if prompt_language[index] != "en": + p = translate_prompt(prompt=prompt[index], + translation_tokenizer=self.translation_tokenizer, + translation_model=self.translation_model, + device=self.device) + prompt[index] = p + + # get prompt text embeddings + text_inputs = self.tokenizer( + prompt, + padding="max_length", + max_length=self.tokenizer.model_max_length, + return_tensors="pt", + ) + text_input_ids = text_inputs.input_ids + + if text_input_ids.shape[-1] > self.tokenizer.model_max_length: + removed_text = self.tokenizer.batch_decode(text_input_ids[:, self.tokenizer.model_max_length :]) + logger.warning( + "The following part of your input was truncated because CLIP can only handle sequences up to" + f" {self.tokenizer.model_max_length} tokens: {removed_text}" + ) + text_input_ids = text_input_ids[:, : self.tokenizer.model_max_length] + text_embeddings = self.text_encoder(text_input_ids.to(self.device))[0] + + # duplicate text embeddings for each generation per prompt, using mps friendly method + bs_embed, seq_len, _ = text_embeddings.shape + text_embeddings = text_embeddings.repeat(1, num_images_per_prompt, 1) + text_embeddings = text_embeddings.view(bs_embed * num_images_per_prompt, seq_len, -1) + + # here `guidance_scale` is defined analog to the guidance weight `w` of equation (2) + # of the Imagen paper: https://arxiv.org/pdf/2205.11487.pdf . `guidance_scale = 1` + # corresponds to doing no classifier free guidance. + do_classifier_free_guidance = guidance_scale > 1.0 + # get unconditional embeddings for classifier free guidance + if do_classifier_free_guidance: + uncond_tokens: List[str] + if negative_prompt is None: + uncond_tokens = [""] + elif type(prompt) is not type(negative_prompt): + raise TypeError( + f"`negative_prompt` should be the same type to `prompt`, but got {type(negative_prompt)} !=" + f" {type(prompt)}." + ) + elif isinstance(negative_prompt, str): + # detect language and translate it if necessary + negative_prompt_language = detect_language(self.language_detection_pipeline, negative_prompt, batch_size) + if negative_prompt_language != "en": + negative_prompt = translate_prompt(prompt=negative_prompt, + translation_tokenizer=self.translation_tokenizer, + translation_model=self.translation_model, + device=self.device) + uncond_tokens = [negative_prompt] + elif batch_size != len(negative_prompt): + raise ValueError( + f"`negative_prompt`: {negative_prompt} has batch size {len(negative_prompt)}, but `prompt`:" + f" {prompt} has batch size {batch_size}. Please make sure that passed `negative_prompt` matches" + " the batch size of `prompt`." + ) + else: + # detect language and translate it if necessary + negative_prompt_languages = detect_language(self.language_detection_pipeline, negative_prompt, batch_size) + for index in range(batch_size): + if negative_prompt_languages[index] != "en": + p = translate_prompt(prompt=negative_prompt[index], + translation_tokenizer = self.translation_tokenizer, + translation_model = self.translation_model, + device=self.device) + negative_prompt[index] = p + uncond_tokens = negative_prompt + + max_length = text_input_ids.shape[-1] + uncond_input = self.tokenizer( + uncond_tokens, + padding="max_length", + max_length=max_length, + truncation=True, + return_tensors="pt", + ) + uncond_embeddings = self.text_encoder(uncond_input.input_ids.to(self.device))[0] + + # duplicate unconditional embeddings for each generation per prompt, using mps friendly method + seq_len = uncond_embeddings.shape[1] + uncond_embeddings = uncond_embeddings.repeat(batch_size, num_images_per_prompt, 1) + uncond_embeddings = uncond_embeddings.view(batch_size * num_images_per_prompt, seq_len, -1) + + # For classifier free guidance, we need to do two forward passes. + # Here we concatenate the unconditional and text embeddings into a single batch + # to avoid doing two forward passes + text_embeddings = torch.cat([uncond_embeddings, text_embeddings]) + + # get the initial random noise unless the user supplied it + + # Unlike in other pipelines, latents need to be generated in the target device + # for 1-to-1 results reproducibility with the CompVis implementation. + # However this currently doesn't work in `mps`. + latents_shape = (batch_size * num_images_per_prompt, self.unet.in_channels, height // 8, width // 8) + latents_dtype = text_embeddings.dtype + if latents is None: + if self.device.type == "mps": + # randn does not exist on mps + latents = torch.randn(latents_shape, generator=generator, device="cpu", dtype=latents_dtype).to( + self.device + ) + else: + latents = torch.randn(latents_shape, generator=generator, device=self.device, dtype=latents_dtype) + else: + if latents.shape != latents_shape: + raise ValueError(f"Unexpected latents shape, got {latents.shape}, expected {latents_shape}") + latents = latents.to(self.device) + + # set timesteps + self.scheduler.set_timesteps(num_inference_steps) + + # Some schedulers like PNDM have timesteps as arrays + # It's more optimized to move all timesteps to correct device beforehand + timesteps_tensor = self.scheduler.timesteps.to(self.device) + + # scale the initial noise by the standard deviation required by the scheduler + latents = latents * self.scheduler.init_noise_sigma + + # prepare extra kwargs for the scheduler step, since not all schedulers have the same signature + # eta (η) is only used with the DDIMScheduler, it will be ignored for other schedulers. + # eta corresponds to η in DDIM paper: https://arxiv.org/abs/2010.02502 + # and should be between [0, 1] + accepts_eta = "eta" in set(inspect.signature(self.scheduler.step).parameters.keys()) + extra_step_kwargs = {} + if accepts_eta: + extra_step_kwargs["eta"] = eta + + for i, t in enumerate(self.progress_bar(timesteps_tensor)): + # expand the latents if we are doing classifier free guidance + latent_model_input = torch.cat([latents] * 2) if do_classifier_free_guidance else latents + latent_model_input = self.scheduler.scale_model_input(latent_model_input, t) + + # predict the noise residual + noise_pred = self.unet(latent_model_input, t, encoder_hidden_states=text_embeddings).sample + + # perform guidance + if do_classifier_free_guidance: + noise_pred_uncond, noise_pred_text = noise_pred.chunk(2) + noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond) + + # compute the previous noisy sample x_t -> x_t-1 + latents = self.scheduler.step(noise_pred, t, latents, **extra_step_kwargs).prev_sample + + # call the callback, if provided + if callback is not None and i % callback_steps == 0: + callback(i, t, latents) + + latents = 1 / 0.18215 * latents + image = self.vae.decode(latents).sample + + image = (image / 2 + 0.5).clamp(0, 1) + + # we always cast to float32 as this does not cause significant overhead and is compatible with bfloa16 + image = image.cpu().permute(0, 2, 3, 1).float().numpy() + + if output_type == "pil": + image = self.numpy_to_pil(image) + + if not return_dict: + return image + + return StableDiffusionPipelineOutput(images=image, nsfw_content_detected=None) \ No newline at end of file diff --git a/examples/community/seed_resize_stable_diffusion.py b/examples/community/seed_resize_stable_diffusion.py new file mode 100644 index 000000000000..f912663a687d --- /dev/null +++ b/examples/community/seed_resize_stable_diffusion.py @@ -0,0 +1,366 @@ +""" + modified based on diffusion library from Huggingface: https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py +""" +import inspect +from typing import Callable, List, Optional, Union + +import torch + +from diffusers.models import AutoencoderKL, UNet2DConditionModel +from diffusers.pipeline_utils import DiffusionPipeline +from diffusers.pipelines.stable_diffusion import StableDiffusionPipelineOutput +from diffusers.pipelines.stable_diffusion.safety_checker import StableDiffusionSafetyChecker +from diffusers.schedulers import DDIMScheduler, LMSDiscreteScheduler, PNDMScheduler +from diffusers.utils import logging +from transformers import CLIPFeatureExtractor, CLIPTextModel, CLIPTokenizer + + +logger = logging.get_logger(__name__) # pylint: disable=invalid-name + + +class SeedResizeStableDiffusionPipeline(DiffusionPipeline): + r""" + Pipeline for text-to-image generation using Stable Diffusion. + + This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the + library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) + + Args: + vae ([`AutoencoderKL`]): + Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations. + text_encoder ([`CLIPTextModel`]): + Frozen text-encoder. Stable Diffusion uses the text portion of + [CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModel), specifically + the [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) variant. + tokenizer (`CLIPTokenizer`): + Tokenizer of class + [CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer). + unet ([`UNet2DConditionModel`]): Conditional U-Net architecture to denoise the encoded image latents. + scheduler ([`SchedulerMixin`]): + A scheduler to be used in combination with `unet` to denoise the encoded image latens. Can be one of + [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. + safety_checker ([`StableDiffusionSafetyChecker`]): + Classification module that estimates whether generated images could be considered offensive or harmful. + Please, refer to the [model card](https://huggingface.co/CompVis/stable-diffusion-v1-4) for details. + feature_extractor ([`CLIPFeatureExtractor`]): + Model that extracts features from generated images to be used as inputs for the `safety_checker`. + """ + + def __init__( + self, + vae: AutoencoderKL, + text_encoder: CLIPTextModel, + tokenizer: CLIPTokenizer, + unet: UNet2DConditionModel, + scheduler: Union[DDIMScheduler, PNDMScheduler, LMSDiscreteScheduler], + safety_checker: StableDiffusionSafetyChecker, + feature_extractor: CLIPFeatureExtractor, + ): + super().__init__() + self.register_modules( + vae=vae, + text_encoder=text_encoder, + tokenizer=tokenizer, + unet=unet, + scheduler=scheduler, + safety_checker=safety_checker, + feature_extractor=feature_extractor, + ) + + def enable_attention_slicing(self, slice_size: Optional[Union[str, int]] = "auto"): + r""" + Enable sliced attention computation. + + When this option is enabled, the attention module will split the input tensor in slices, to compute attention + in several steps. This is useful to save some memory in exchange for a small speed decrease. + + Args: + slice_size (`str` or `int`, *optional*, defaults to `"auto"`): + When `"auto"`, halves the input to the attention heads, so attention will be computed in two steps. If + a number is provided, uses as many slices as `attention_head_dim // slice_size`. In this case, + `attention_head_dim` must be a multiple of `slice_size`. + """ + if slice_size == "auto": + # half the attention head size is usually a good trade-off between + # speed and memory + slice_size = self.unet.config.attention_head_dim // 2 + self.unet.set_attention_slice(slice_size) + + def disable_attention_slicing(self): + r""" + Disable sliced attention computation. If `enable_attention_slicing` was previously invoked, this method will go + back to computing attention in one step. + """ + # set slice_size = `None` to disable `attention slicing` + self.enable_attention_slicing(None) + + @torch.no_grad() + def __call__( + self, + prompt: Union[str, List[str]], + height: int = 512, + width: int = 512, + num_inference_steps: int = 50, + guidance_scale: float = 7.5, + negative_prompt: Optional[Union[str, List[str]]] = None, + num_images_per_prompt: Optional[int] = 1, + eta: float = 0.0, + generator: Optional[torch.Generator] = None, + latents: Optional[torch.FloatTensor] = None, + output_type: Optional[str] = "pil", + return_dict: bool = True, + callback: Optional[Callable[[int, int, torch.FloatTensor], None]] = None, + callback_steps: Optional[int] = 1, + text_embeddings: Optional[torch.FloatTensor] = None, + **kwargs, + ): + r""" + Function invoked when calling the pipeline for generation. + + Args: + prompt (`str` or `List[str]`): + The prompt or prompts to guide the image generation. + height (`int`, *optional*, defaults to 512): + The height in pixels of the generated image. + width (`int`, *optional*, defaults to 512): + The width in pixels of the generated image. + num_inference_steps (`int`, *optional*, defaults to 50): + The number of denoising steps. More denoising steps usually lead to a higher quality image at the + expense of slower inference. + guidance_scale (`float`, *optional*, defaults to 7.5): + Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). + `guidance_scale` is defined as `w` of equation 2. of [Imagen + Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale > + 1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`, + usually at the expense of lower image quality. + negative_prompt (`str` or `List[str]`, *optional*): + The prompt or prompts not to guide the image generation. Ignored when not using guidance (i.e., ignored + if `guidance_scale` is less than `1`). + num_images_per_prompt (`int`, *optional*, defaults to 1): + The number of images to generate per prompt. + eta (`float`, *optional*, defaults to 0.0): + Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to + [`schedulers.DDIMScheduler`], will be ignored for others. + generator (`torch.Generator`, *optional*): + A [torch generator](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make generation + deterministic. + latents (`torch.FloatTensor`, *optional*): + Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image + generation. Can be used to tweak the same generation with different prompts. If not provided, a latents + tensor will ge generated by sampling using the supplied random `generator`. + output_type (`str`, *optional*, defaults to `"pil"`): + The output format of the generate image. Choose between + [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`. + return_dict (`bool`, *optional*, defaults to `True`): + Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a + plain tuple. + callback (`Callable`, *optional*): + A function that will be called every `callback_steps` steps during inference. The function will be + called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. + callback_steps (`int`, *optional*, defaults to 1): + The frequency at which the `callback` function will be called. If not specified, the callback will be + called at every step. + + Returns: + [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`: + [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] if `return_dict` is True, otherwise a `tuple. + When returning a tuple, the first element is a list with the generated images, and the second element is a + list of `bool`s denoting whether the corresponding generated image likely represents "not-safe-for-work" + (nsfw) content, according to the `safety_checker`. + """ + + if isinstance(prompt, str): + batch_size = 1 + elif isinstance(prompt, list): + batch_size = len(prompt) + else: + raise ValueError(f"`prompt` has to be of type `str` or `list` but is {type(prompt)}") + + if height % 8 != 0 or width % 8 != 0: + raise ValueError(f"`height` and `width` have to be divisible by 8 but are {height} and {width}.") + + if (callback_steps is None) or ( + callback_steps is not None and (not isinstance(callback_steps, int) or callback_steps <= 0) + ): + raise ValueError( + f"`callback_steps` has to be a positive integer but is {callback_steps} of type" + f" {type(callback_steps)}." + ) + + # get prompt text embeddings + text_inputs = self.tokenizer( + prompt, + padding="max_length", + max_length=self.tokenizer.model_max_length, + return_tensors="pt", + ) + text_input_ids = text_inputs.input_ids + + if text_input_ids.shape[-1] > self.tokenizer.model_max_length: + removed_text = self.tokenizer.batch_decode(text_input_ids[:, self.tokenizer.model_max_length :]) + logger.warning( + "The following part of your input was truncated because CLIP can only handle sequences up to" + f" {self.tokenizer.model_max_length} tokens: {removed_text}" + ) + text_input_ids = text_input_ids[:, : self.tokenizer.model_max_length] + + if text_embeddings is None: + text_embeddings = self.text_encoder(text_input_ids.to(self.device))[0] + + # duplicate text embeddings for each generation per prompt, using mps friendly method + bs_embed, seq_len, _ = text_embeddings.shape + text_embeddings = text_embeddings.repeat(1, num_images_per_prompt, 1) + text_embeddings = text_embeddings.view(bs_embed * num_images_per_prompt, seq_len, -1) + + # here `guidance_scale` is defined analog to the guidance weight `w` of equation (2) + # of the Imagen paper: https://arxiv.org/pdf/2205.11487.pdf . `guidance_scale = 1` + # corresponds to doing no classifier free guidance. + do_classifier_free_guidance = guidance_scale > 1.0 + # get unconditional embeddings for classifier free guidance + if do_classifier_free_guidance: + uncond_tokens: List[str] + if negative_prompt is None: + uncond_tokens = [""] + elif type(prompt) is not type(negative_prompt): + raise TypeError( + f"`negative_prompt` should be the same type to `prompt`, but got {type(negative_prompt)} !=" + f" {type(prompt)}." + ) + elif isinstance(negative_prompt, str): + uncond_tokens = [negative_prompt] + elif batch_size != len(negative_prompt): + raise ValueError( + f"`negative_prompt`: {negative_prompt} has batch size {len(negative_prompt)}, but `prompt`:" + f" {prompt} has batch size {batch_size}. Please make sure that passed `negative_prompt` matches" + " the batch size of `prompt`." + ) + else: + uncond_tokens = negative_prompt + + max_length = text_input_ids.shape[-1] + uncond_input = self.tokenizer( + uncond_tokens, + padding="max_length", + max_length=max_length, + truncation=True, + return_tensors="pt", + ) + uncond_embeddings = self.text_encoder(uncond_input.input_ids.to(self.device))[0] + + # duplicate unconditional embeddings for each generation per prompt, using mps friendly method + seq_len = uncond_embeddings.shape[1] + uncond_embeddings = uncond_embeddings.repeat(batch_size, num_images_per_prompt, 1) + uncond_embeddings = uncond_embeddings.view(batch_size * num_images_per_prompt, seq_len, -1) + + # For classifier free guidance, we need to do two forward passes. + # Here we concatenate the unconditional and text embeddings into a single batch + # to avoid doing two forward passes + text_embeddings = torch.cat([uncond_embeddings, text_embeddings]) + + # get the initial random noise unless the user supplied it + + # Unlike in other pipelines, latents need to be generated in the target device + # for 1-to-1 results reproducibility with the CompVis implementation. + # However this currently doesn't work in `mps`. + latents_shape = (batch_size * num_images_per_prompt, self.unet.in_channels, height // 8, width // 8) + latents_shape_reference = (batch_size * num_images_per_prompt, self.unet.in_channels, 64, 64) + latents_dtype = text_embeddings.dtype + if latents is None: + if self.device.type == "mps": + # randn does not exist on mps + latents_reference = torch.randn( + latents_shape_reference, generator=generator, device="cpu", dtype=latents_dtype + ).to(self.device) + latents = torch.randn(latents_shape, generator=generator, device="cpu", dtype=latents_dtype).to( + self.device + ) + else: + latents_reference = torch.randn( + latents_shape_reference, generator=generator, device=self.device, dtype=latents_dtype + ) + latents = torch.randn(latents_shape, generator=generator, device=self.device, dtype=latents_dtype) + else: + if latents_reference.shape != latents_shape: + raise ValueError(f"Unexpected latents shape, got {latents.shape}, expected {latents_shape}") + latents_reference = latents_reference.to(self.device) + latents = latents.to(self.device) + + # This is the key part of the pipeline where we + # try to ensure that the generated images w/ the same seed + # but different sizes actually result in similar images + dx = (latents_shape[3] - latents_shape_reference[3]) // 2 + dy = (latents_shape[2] - latents_shape_reference[2]) // 2 + w = latents_shape_reference[3] if dx >= 0 else latents_shape_reference[3] + 2 * dx + h = latents_shape_reference[2] if dy >= 0 else latents_shape_reference[2] + 2 * dy + tx = 0 if dx < 0 else dx + ty = 0 if dy < 0 else dy + dx = max(-dx, 0) + dy = max(-dy, 0) + # import pdb + # pdb.set_trace() + latents[:, :, ty : ty + h, tx : tx + w] = latents_reference[:, :, dy : dy + h, dx : dx + w] + + # set timesteps + self.scheduler.set_timesteps(num_inference_steps) + + # Some schedulers like PNDM have timesteps as arrays + # It's more optimized to move all timesteps to correct device beforehand + timesteps_tensor = self.scheduler.timesteps.to(self.device) + + # scale the initial noise by the standard deviation required by the scheduler + latents = latents * self.scheduler.init_noise_sigma + + # prepare extra kwargs for the scheduler step, since not all schedulers have the same signature + # eta (η) is only used with the DDIMScheduler, it will be ignored for other schedulers. + # eta corresponds to η in DDIM paper: https://arxiv.org/abs/2010.02502 + # and should be between [0, 1] + accepts_eta = "eta" in set(inspect.signature(self.scheduler.step).parameters.keys()) + extra_step_kwargs = {} + if accepts_eta: + extra_step_kwargs["eta"] = eta + + for i, t in enumerate(self.progress_bar(timesteps_tensor)): + # expand the latents if we are doing classifier free guidance + latent_model_input = torch.cat([latents] * 2) if do_classifier_free_guidance else latents + latent_model_input = self.scheduler.scale_model_input(latent_model_input, t) + + # predict the noise residual + noise_pred = self.unet(latent_model_input, t, encoder_hidden_states=text_embeddings).sample + + # perform guidance + if do_classifier_free_guidance: + noise_pred_uncond, noise_pred_text = noise_pred.chunk(2) + noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond) + + # compute the previous noisy sample x_t -> x_t-1 + latents = self.scheduler.step(noise_pred, t, latents, **extra_step_kwargs).prev_sample + + # call the callback, if provided + if callback is not None and i % callback_steps == 0: + callback(i, t, latents) + + latents = 1 / 0.18215 * latents + image = self.vae.decode(latents).sample + + image = (image / 2 + 0.5).clamp(0, 1) + + # we always cast to float32 as this does not cause significant overhead and is compatible with bfloa16 + image = image.cpu().permute(0, 2, 3, 1).float().numpy() + + if self.safety_checker is not None: + safety_checker_input = self.feature_extractor(self.numpy_to_pil(image), return_tensors="pt").to( + self.device + ) + image, has_nsfw_concept = self.safety_checker( + images=image, clip_input=safety_checker_input.pixel_values.to(text_embeddings.dtype) + ) + else: + has_nsfw_concept = None + + if output_type == "pil": + image = self.numpy_to_pil(image) + + if not return_dict: + return (image, has_nsfw_concept) + + return StableDiffusionPipelineOutput(images=image, nsfw_content_detected=has_nsfw_concept) diff --git a/examples/text_to_image/README.md b/examples/text_to_image/README.md index f7c8cdcb4ccb..170ed384f13c 100644 --- a/examples/text_to_image/README.md +++ b/examples/text_to_image/README.md @@ -4,7 +4,7 @@ The `train_text_to_image.py` script shows how to fine-tune stable diffusion mode ___Note___: -___This script is experimental. The script fine-tunes the whole model and often times the model overifits and runs into issues like catastrophic forgetting. It's recommended to try different hyperparamters to get the best result on your dataset.___ +___This script is experimental. The script fine-tunes the whole model and often times the model overfits and runs into issues like catastrophic forgetting. It's recommended to try different hyperparamters to get the best result on your dataset.___ ## Running locally with PyTorch diff --git a/examples/text_to_image/train_text_to_image_flax.py b/examples/text_to_image/train_text_to_image_flax.py index ab8d2b7ee2a3..cacfacef498b 100644 --- a/examples/text_to_image/train_text_to_image_flax.py +++ b/examples/text_to_image/train_text_to_image_flax.py @@ -371,11 +371,11 @@ def collate_fn(examples): train_dataset, shuffle=True, collate_fn=collate_fn, batch_size=total_train_batch_size, drop_last=True ) - weight_dtype = torch.float32 + weight_dtype = jnp.float32 if args.mixed_precision == "fp16": - weight_dtype = torch.float16 + weight_dtype = jnp.float16 elif args.mixed_precision == "bf16": - weight_dtype = torch.bfloat16 + weight_dtype = jnp.bfloat16 # Load models and create wrapper for stable diffusion tokenizer = CLIPTokenizer.from_pretrained(args.pretrained_model_name_or_path, subfolder="tokenizer") diff --git a/setup.py b/setup.py index b4c1613ba115..6f0742e83ee5 100644 --- a/setup.py +++ b/setup.py @@ -94,6 +94,7 @@ "modelcards>=0.1.4", "numpy", "onnxruntime", + "parameterized", "pytest", "pytest-timeout", "pytest-xdist", @@ -181,6 +182,7 @@ def run(self): "accelerate", "datasets", "onnxruntime", + "parameterized", "pytest", "pytest-timeout", "pytest-xdist", diff --git a/src/diffusers/dependency_versions_table.py b/src/diffusers/dependency_versions_table.py index 8b10d70a26f7..64e55e932c1c 100644 --- a/src/diffusers/dependency_versions_table.py +++ b/src/diffusers/dependency_versions_table.py @@ -18,6 +18,7 @@ "modelcards": "modelcards>=0.1.4", "numpy": "numpy", "onnxruntime": "onnxruntime", + "parameterized": "parameterized", "pytest": "pytest", "pytest-timeout": "pytest-timeout", "pytest-xdist": "pytest-xdist", diff --git a/src/diffusers/models/README.md b/src/diffusers/models/README.md index e786fe518f03..80fe0bc38140 100644 --- a/src/diffusers/models/README.md +++ b/src/diffusers/models/README.md @@ -1,11 +1,3 @@ # Models -- Models: Neural network that models $p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t)$ (see image below) and is trained end-to-end to denoise a noisy input to an image. Examples: UNet, Conditioned UNet, 3D UNet, Transformer UNet - -## API - -TODO(Suraj, Patrick) - -## Examples - -TODO(Suraj, Patrick) +For more detail on the models, please refer to the [docs](https://huggingface.co/docs/diffusers/api/models). \ No newline at end of file diff --git a/src/diffusers/models/resnet.py b/src/diffusers/models/resnet.py index fbd78b512a6b..7bb5416adf24 100644 --- a/src/diffusers/models/resnet.py +++ b/src/diffusers/models/resnet.py @@ -48,6 +48,10 @@ def forward(self, hidden_states, output_size=None): if dtype == torch.bfloat16: hidden_states = hidden_states.to(torch.float32) + # upsample_nearest_nhwc fails with large batch sizes. see https://github.com/huggingface/diffusers/issues/984 + if hidden_states.shape[0] >= 64: + hidden_states = hidden_states.contiguous() + # if `output_size` is passed we force the interpolation output # size and do not make use of `scale_factor=2` if output_size is None: @@ -376,6 +380,10 @@ def forward(self, input_tensor, temb): hidden_states = self.nonlinearity(hidden_states) if self.upsample is not None: + # upsample_nearest_nhwc fails with large batch sizes. see https://github.com/huggingface/diffusers/issues/984 + if hidden_states.shape[0] >= 64: + input_tensor = input_tensor.contiguous() + hidden_states = hidden_states.contiguous() input_tensor = self.upsample(input_tensor) hidden_states = self.upsample(hidden_states) elif self.downsample is not None: diff --git a/src/diffusers/pipeline_flax_utils.py b/src/diffusers/pipeline_flax_utils.py index 3c23693b40ef..e96c0c7467f3 100644 --- a/src/diffusers/pipeline_flax_utils.py +++ b/src/diffusers/pipeline_flax_utils.py @@ -444,7 +444,11 @@ def numpy_to_pil(images): if images.ndim == 3: images = images[None, ...] images = (images * 255).round().astype("uint8") - pil_images = [Image.fromarray(image) for image in images] + if images.shape[-1] == 1: + # special case for grayscale (single channel) images + pil_images = [Image.fromarray(image.squeeze(), mode="L") for image in images] + else: + pil_images = [Image.fromarray(image) for image in images] return pil_images diff --git a/src/diffusers/pipeline_utils.py b/src/diffusers/pipeline_utils.py index c9c58a748831..c0a44363a2f9 100644 --- a/src/diffusers/pipeline_utils.py +++ b/src/diffusers/pipeline_utils.py @@ -625,7 +625,11 @@ def numpy_to_pil(images): if images.ndim == 3: images = images[None, ...] images = (images * 255).round().astype("uint8") - pil_images = [Image.fromarray(image) for image in images] + if images.shape[-1] == 1: + # special case for grayscale (single channel) images + pil_images = [Image.fromarray(image.squeeze(), mode="L") for image in images] + else: + pil_images = [Image.fromarray(image) for image in images] return pil_images diff --git a/src/diffusers/pipelines/README.md b/src/diffusers/pipelines/README.md index 66d3a8815d4e..86048eb5a052 100644 --- a/src/diffusers/pipelines/README.md +++ b/src/diffusers/pipelines/README.md @@ -30,19 +30,20 @@ If you are looking for *official* training examples, please have a look at [exam The following table summarizes all officially supported pipelines, their corresponding paper, and if available a colab notebook to directly try them out. -| Pipeline | Paper | Tasks | Colab -|---|---|:---:|:---:| -| [ddpm](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/ddpm) | [**Denoising Diffusion Probabilistic Models**](https://arxiv.org/abs/2006.11239) | *Unconditional Image Generation* | -| [ddim](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/ddim) | [**Denoising Diffusion Implicit Models**](https://arxiv.org/abs/2010.02502) | *Unconditional Image Generation* | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/training_example.ipynb) -| [latent_diffusion](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/latent_diffusion) | [**High-Resolution Image Synthesis with Latent Diffusion Models**](https://arxiv.org/abs/2112.10752)| *Text-to-Image Generation* | -| [latent_diffusion_uncond](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/latent_diffusion_uncond) | [**High-Resolution Image Synthesis with Latent Diffusion Models**](https://arxiv.org/abs/2112.10752) | *Unconditional Image Generation* | -| [pndm](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/pndm) | [**Pseudo Numerical Methods for Diffusion Models on Manifolds**](https://arxiv.org/abs/2202.09778) | *Unconditional Image Generation* | -| [score_sde_ve](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/score_sde_ve) | [**Score-Based Generative Modeling through Stochastic Differential Equations**](https://openreview.net/forum?id=PxTIG12RRHS) | *Unconditional Image Generation* | -| [score_sde_vp](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/score_sde_vp) | [**Score-Based Generative Modeling through Stochastic Differential Equations**](https://openreview.net/forum?id=PxTIG12RRHS) | *Unconditional Image Generation* | -| [stable_diffusion](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion) | [**Stable Diffusion**](https://stability.ai/blog/stable-diffusion-public-release) | *Text-to-Image Generation* | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/training_example.ipynb) -| [stable_diffusion](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion) | [**Stable Diffusion**](https://stability.ai/blog/stable-diffusion-public-release) | *Image-to-Image Text-Guided Generation* | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/image_2_image_using_diffusers.ipynb) -| [stable_diffusion](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion) | [**Stable Diffusion**](https://stability.ai/blog/stable-diffusion-public-release) | *Text-Guided Image Inpainting* | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/in_painting_with_stable_diffusion_using_diffusers.ipynb) -| [stochastic_karras_ve](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stochastic_karras_ve) | [**Elucidating the Design Space of Diffusion-Based Generative Models**](https://arxiv.org/abs/2206.00364) | *Unconditional Image Generation* | +| Pipeline | Source | Tasks | Colab +|-------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------|:---:|:---:| +| [dance diffusion](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/dance_diffusion) | [**Dance Diffusion**](https://github.com/Harmonai-org/sample-generator) | *Unconditional Audio Generation* | +| [ddpm](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/ddpm) | [**Denoising Diffusion Probabilistic Models**](https://arxiv.org/abs/2006.11239) | *Unconditional Image Generation* | +| [ddim](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/ddim) | [**Denoising Diffusion Implicit Models**](https://arxiv.org/abs/2010.02502) | *Unconditional Image Generation* | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/training_example.ipynb) +| [latent_diffusion](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/latent_diffusion) | [**High-Resolution Image Synthesis with Latent Diffusion Models**](https://arxiv.org/abs/2112.10752) | *Text-to-Image Generation* | +| [latent_diffusion_uncond](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/latent_diffusion_uncond) | [**High-Resolution Image Synthesis with Latent Diffusion Models**](https://arxiv.org/abs/2112.10752) | *Unconditional Image Generation* | +| [pndm](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/pndm) | [**Pseudo Numerical Methods for Diffusion Models on Manifolds**](https://arxiv.org/abs/2202.09778) | *Unconditional Image Generation* | +| [score_sde_ve](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/score_sde_ve) | [**Score-Based Generative Modeling through Stochastic Differential Equations**](https://openreview.net/forum?id=PxTIG12RRHS) | *Unconditional Image Generation* | +| [score_sde_vp](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/score_sde_vp) | [**Score-Based Generative Modeling through Stochastic Differential Equations**](https://openreview.net/forum?id=PxTIG12RRHS) | *Unconditional Image Generation* | +| [stable_diffusion](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion) | [**Stable Diffusion**](https://stability.ai/blog/stable-diffusion-public-release) | *Text-to-Image Generation* | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/training_example.ipynb) +| [stable_diffusion](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion) | [**Stable Diffusion**](https://stability.ai/blog/stable-diffusion-public-release) | *Image-to-Image Text-Guided Generation* | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/image_2_image_using_diffusers.ipynb) +| [stable_diffusion](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion) | [**Stable Diffusion**](https://stability.ai/blog/stable-diffusion-public-release) | *Text-Guided Image Inpainting* | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/in_painting_with_stable_diffusion_using_diffusers.ipynb) +| [stochastic_karras_ve](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stochastic_karras_ve) | [**Elucidating the Design Space of Diffusion-Based Generative Models**](https://arxiv.org/abs/2206.00364) | *Unconditional Image Generation* | **Note**: Pipelines are simple examples of how to play around with the diffusion systems as described in the corresponding papers. However, most of them can be adapted to use different scheduler components or even different model components. Some pipeline examples are shown in the [Examples](#examples) below. diff --git a/src/diffusers/pipelines/latent_diffusion/pipeline_latent_diffusion.py b/src/diffusers/pipelines/latent_diffusion/pipeline_latent_diffusion.py index 62a5785beb2f..d8948862845d 100644 --- a/src/diffusers/pipelines/latent_diffusion/pipeline_latent_diffusion.py +++ b/src/diffusers/pipelines/latent_diffusion/pipeline_latent_diffusion.py @@ -662,6 +662,8 @@ def custom_forward(*inputs): class LDMBertModel(LDMBertPreTrainedModel): + _no_split_modules = [] + def __init__(self, config: LDMBertConfig): super().__init__(config) self.model = LDMBertEncoder(config) diff --git a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py index e80cc1360b33..2a33074a2473 100644 --- a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py +++ b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py @@ -208,7 +208,6 @@ def __call__( list of `bool`s denoting whether the corresponding generated image likely represents "not-safe-for-work" (nsfw) content, according to the `safety_checker`. """ - if isinstance(prompt, str): batch_size = 1 elif isinstance(prompt, list): diff --git a/src/diffusers/schedulers/README.md b/src/diffusers/schedulers/README.md index 6a01c503a909..9494e357fd43 100644 --- a/src/diffusers/schedulers/README.md +++ b/src/diffusers/schedulers/README.md @@ -1,17 +1,3 @@ # Schedulers -- Schedulers are the algorithms to use diffusion models in inference as well as for training. They include the noise schedules and define algorithm-specific diffusion steps. -- Schedulers can be used interchangeable between diffusion models in inference to find the preferred trade-off between speed and generation quality. -- Schedulers are available in PyTorch and Jax. - -## API - -- Schedulers should provide one or more `def step(...)` functions that should be called iteratively to unroll the diffusion loop during -the forward pass. -- Schedulers should be framework specific. - -## Examples - -- The DDPM scheduler was proposed in [Denoising Diffusion Probabilistic Models](https://arxiv.org/abs/2006.11239) and can be found in [scheduling_ddpm.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/schedulers/scheduling_ddpm.py). An example of how to use this scheduler can be found in [pipeline_ddpm.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/ddpm/pipeline_ddpm.py). -- The DDIM scheduler was proposed in [Denoising Diffusion Implicit Models](https://arxiv.org/abs/2010.02502) and can be found in [scheduling_ddim.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/schedulers/scheduling_ddim.py). An example of how to use this scheduler can be found in [pipeline_ddim.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/ddim/pipeline_ddim.py). -- The PNDM scheduler was proposed in [Pseudo Numerical Methods for Diffusion Models on Manifolds](https://arxiv.org/abs/2202.09778) and can be found in [scheduling_pndm.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/schedulers/scheduling_pndm.py). An example of how to use this scheduler can be found in [pipeline_pndm.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/pndm/pipeline_pndm.py). +For more information on the schedulers, please refer to the [docs](https://huggingface.co/docs/diffusers/api/schedulers). \ No newline at end of file diff --git a/src/diffusers/utils/__init__.py b/src/diffusers/utils/__init__.py index 51798e2ad765..12d731128385 100644 --- a/src/diffusers/utils/__init__.py +++ b/src/diffusers/utils/__init__.py @@ -40,7 +40,16 @@ if is_torch_available(): - from .testing_utils import floats_tensor, load_image, parse_flag_from_env, slow, torch_device + from .testing_utils import ( + floats_tensor, + load_image, + load_numpy, + parse_flag_from_env, + require_torch_gpu, + slow, + torch_all_close, + torch_device, + ) logger = get_logger(__name__) diff --git a/src/diffusers/utils/testing_utils.py b/src/diffusers/utils/testing_utils.py index 700faf209926..bd3b08d54a1c 100644 --- a/src/diffusers/utils/testing_utils.py +++ b/src/diffusers/utils/testing_utils.py @@ -4,11 +4,14 @@ import random import re import unittest +import urllib.parse from distutils.util import strtobool -from io import StringIO +from io import BytesIO, StringIO from pathlib import Path from typing import Union +import numpy as np + import PIL.Image import PIL.ImageOps import requests @@ -34,6 +37,14 @@ torch_device = "mps" if (mps_backend_registered and torch.backends.mps.is_available()) else torch_device +def torch_all_close(a, b, *args, **kwargs): + if not is_torch_available(): + raise ValueError("PyTorch needs to be installed to use this function.") + if not torch.allclose(a, b, *args, **kwargs): + assert False, f"Max diff is absolute {(a - b).abs().max()}. Diff tensor is {(a - b).abs()}." + return True + + def get_tests_dir(append_path=None): """ Args: @@ -157,6 +168,19 @@ def load_image(image: Union[str, PIL.Image.Image]) -> PIL.Image.Image: return image +def load_numpy(path) -> np.ndarray: + if not path.startswith("http://") or path.startswith("https://"): + path = os.path.join( + "https://huggingface.co/datasets/fusing/diffusers-testing/resolve/main", urllib.parse.quote(path) + ) + + response = requests.get(path) + response.raise_for_status() + array = np.load(BytesIO(response.content)) + + return array + + # --- pytest conf functions --- # # to avoid multiple invocation from tests/conftest.py and examples/conftest.py - make sure it's called only once diff --git a/tests/models/__init__.py b/tests/models/__init__.py new file mode 100644 index 000000000000..e69de29bb2d1 diff --git a/tests/test_models_unet_1d.py b/tests/models/test_models_unet_1d.py similarity index 98% rename from tests/test_models_unet_1d.py rename to tests/models/test_models_unet_1d.py index c274ce419211..286c7525e2ca 100644 --- a/tests/test_models_unet_1d.py +++ b/tests/models/test_models_unet_1d.py @@ -28,7 +28,7 @@ class UnetModel1DTests(unittest.TestCase): @slow def test_unet_1d_maestro(self): model_id = "harmonai/maestro-150k" - model = UNet1DModel.from_pretrained(model_id, subfolder="unet") + model = UNet1DModel.from_pretrained(model_id, subfolder="unet", device_map="auto") model.to(torch_device) sample_size = 65536 diff --git a/tests/test_models_unet_2d.py b/tests/models/test_models_unet_2d.py similarity index 58% rename from tests/test_models_unet_2d.py rename to tests/models/test_models_unet_2d.py index 393d47887e6d..548588918c88 100644 --- a/tests/test_models_unet_2d.py +++ b/tests/models/test_models_unet_2d.py @@ -21,11 +21,13 @@ import torch from diffusers import UNet2DConditionModel, UNet2DModel -from diffusers.utils import floats_tensor, slow, torch_device +from diffusers.utils import floats_tensor, load_numpy, logging, require_torch_gpu, slow, torch_all_close, torch_device +from parameterized import parameterized -from .test_modeling_common import ModelTesterMixin +from ..test_modeling_common import ModelTesterMixin +logger = logging.get_logger(__name__) torch.backends.cuda.matmul.allow_tf32 = False @@ -66,28 +68,6 @@ def prepare_init_args_and_inputs_for_common(self): return init_dict, inputs_dict -# TODO(Patrick) - Re-add this test after having correctly added the final VE checkpoints -# def test_output_pretrained(self): -# model = UNet2DModel.from_pretrained("fusing/ddpm_dummy_update", subfolder="unet") -# model.eval() -# -# torch.manual_seed(0) -# if torch.cuda.is_available(): -# torch.cuda.manual_seed_all(0) -# -# noise = torch.randn(1, model.config.in_channels, model.config.sample_size, model.config.sample_size) -# time_step = torch.tensor([10]) -# -# with torch.no_grad(): -# output = model(noise, time_step).sample -# -# output_slice = output[0, -1, -3:, -3:].flatten() -# fmt: off -# expected_output_slice = torch.tensor([0.2891, -0.1899, 0.2595, -0.6214, 0.0968, -0.2622, 0.4688, 0.1311, 0.0053]) -# fmt: on -# self.assertTrue(torch.allclose(output_slice, expected_output_slice, rtol=1e-2)) - - class UNetLDMModelTests(ModelTesterMixin, unittest.TestCase): model_class = UNet2DModel @@ -170,12 +150,14 @@ def test_from_pretrained_accelerate_wont_change_results(self): torch.cuda.empty_cache() gc.collect() - model_normal_load, _ = UNet2DModel.from_pretrained("fusing/unet-ldm-dummy-update", output_loading_info=True) + model_normal_load, _ = UNet2DModel.from_pretrained( + "fusing/unet-ldm-dummy-update", output_loading_info=True, device_map="auto" + ) model_normal_load.to(torch_device) model_normal_load.eval() arr_normal_load = model_normal_load(noise, time_step)["sample"] - assert torch.allclose(arr_accelerate, arr_normal_load, rtol=1e-3) + assert torch_all_close(arr_accelerate, arr_normal_load, rtol=1e-3) @unittest.skipIf(torch_device != "cuda", "This test is supposed to run on GPU") def test_memory_footprint_gets_reduced(self): @@ -226,7 +208,7 @@ def test_output_pretrained(self): expected_output_slice = torch.tensor([-13.3258, -20.1100, -15.9873, -17.6617, -23.0596, -17.9419, -13.3675, -16.1889, -12.3800]) # fmt: on - self.assertTrue(torch.allclose(output_slice, expected_output_slice, rtol=1e-3)) + self.assertTrue(torch_all_close(output_slice, expected_output_slice, rtol=1e-3)) class UNet2DConditionModelTests(ModelTesterMixin, unittest.TestCase): @@ -306,32 +288,7 @@ def test_gradient_checkpointing(self): named_params = dict(model.named_parameters()) named_params_2 = dict(model_2.named_parameters()) for name, param in named_params.items(): - self.assertTrue(torch.allclose(param.grad.data, named_params_2[name].grad.data, atol=5e-5)) - - -# TODO(Patrick) - Re-add this test after having cleaned up LDM -# def test_output_pretrained_spatial_transformer(self): -# model = UNetLDMModel.from_pretrained("fusing/unet-ldm-dummy-spatial") -# model.eval() -# -# torch.manual_seed(0) -# if torch.cuda.is_available(): -# torch.cuda.manual_seed_all(0) -# -# noise = torch.randn(1, model.config.in_channels, model.config.sample_size, model.config.sample_size) -# context = torch.ones((1, 16, 64), dtype=torch.float32) -# time_step = torch.tensor([10] * noise.shape[0]) -# -# with torch.no_grad(): -# output = model(noise, time_step, context=context) -# -# output_slice = output[0, -1, -3:, -3:].flatten() -# fmt: off -# expected_output_slice = torch.tensor([61.3445, 56.9005, 29.4339, 59.5497, 60.7375, 34.1719, 48.1951, 42.6569, 25.0890]) -# fmt: on -# -# self.assertTrue(torch.allclose(output_slice, expected_output_slice, atol=1e-3)) -# + self.assertTrue(torch_all_close(param.grad.data, named_params_2[name].grad.data, atol=5e-5)) class NCSNppModelTests(ModelTesterMixin, unittest.TestCase): @@ -383,7 +340,9 @@ def prepare_init_args_and_inputs_for_common(self): @slow def test_from_pretrained_hub(self): - model, loading_info = UNet2DModel.from_pretrained("google/ncsnpp-celebahq-256", output_loading_info=True) + model, loading_info = UNet2DModel.from_pretrained( + "google/ncsnpp-celebahq-256", output_loading_info=True, device_map="auto" + ) self.assertIsNotNone(model) self.assertEqual(len(loading_info["missing_keys"]), 0) @@ -397,7 +356,7 @@ def test_from_pretrained_hub(self): @slow def test_output_pretrained_ve_mid(self): - model = UNet2DModel.from_pretrained("google/ncsnpp-celebahq-256") + model = UNet2DModel.from_pretrained("google/ncsnpp-celebahq-256", device_map="auto") model.to(torch_device) torch.manual_seed(0) @@ -419,7 +378,7 @@ def test_output_pretrained_ve_mid(self): expected_output_slice = torch.tensor([-4836.2231, -6487.1387, -3816.7969, -7964.9253, -10966.2842, -20043.6016, 8137.0571, 2340.3499, 544.6114]) # fmt: on - self.assertTrue(torch.allclose(output_slice, expected_output_slice, rtol=1e-2)) + self.assertTrue(torch_all_close(output_slice, expected_output_slice, rtol=1e-2)) def test_output_pretrained_ve_large(self): model = UNet2DModel.from_pretrained("fusing/ncsnpp-ffhq-ve-dummy-update") @@ -444,8 +403,194 @@ def test_output_pretrained_ve_large(self): expected_output_slice = torch.tensor([-0.0325, -0.0900, -0.0869, -0.0332, -0.0725, -0.0270, -0.0101, 0.0227, 0.0256]) # fmt: on - self.assertTrue(torch.allclose(output_slice, expected_output_slice, rtol=1e-2)) + self.assertTrue(torch_all_close(output_slice, expected_output_slice, rtol=1e-2)) def test_forward_with_norm_groups(self): # not required for this model pass + + +@slow +class UNet2DConditionModelIntegrationTests(unittest.TestCase): + def get_file_format(self, seed, shape): + return f"gaussian_noise_s={seed}_shape={'_'.join([str(s) for s in shape])}.npy" + + def tearDown(self): + # clean up the VRAM after each test + super().tearDown() + gc.collect() + torch.cuda.empty_cache() + + def get_latents(self, seed=0, shape=(4, 4, 64, 64), fp16=False): + dtype = torch.float16 if fp16 else torch.float32 + image = torch.from_numpy(load_numpy(self.get_file_format(seed, shape))).to(torch_device).to(dtype) + return image + + def get_unet_model(self, fp16=False, model_id="CompVis/stable-diffusion-v1-4"): + revision = "fp16" if fp16 else None + torch_dtype = torch.float16 if fp16 else torch.float32 + + model = UNet2DConditionModel.from_pretrained( + model_id, subfolder="unet", torch_dtype=torch_dtype, revision=revision, device_map="auto" + ) + model.to(torch_device).eval() + + return model + + def get_encoder_hidden_states(self, seed=0, shape=(4, 77, 768), fp16=False): + dtype = torch.float16 if fp16 else torch.float32 + hidden_states = torch.from_numpy(load_numpy(self.get_file_format(seed, shape))).to(torch_device).to(dtype) + return hidden_states + + @parameterized.expand( + [ + # fmt: off + [33, 4, [-0.4424, 0.1510, -0.1937, 0.2118, 0.3746, -0.3957, 0.0160, -0.0435]], + [47, 0.55, [-0.1508, 0.0379, -0.3075, 0.2540, 0.3633, -0.0821, 0.1719, -0.0207]], + [21, 0.89, [-0.6479, 0.6364, -0.3464, 0.8697, 0.4443, -0.6289, -0.0091, 0.1778]], + [9, 1000, [0.8888, -0.5659, 0.5834, -0.7469, 1.1912, -0.3923, 1.1241, -0.4424]], + # fmt: on + ] + ) + def test_compvis_sd_v1_4(self, seed, timestep, expected_slice): + model = self.get_unet_model(model_id="CompVis/stable-diffusion-v1-4") + latents = self.get_latents(seed) + encoder_hidden_states = self.get_encoder_hidden_states(seed) + + with torch.no_grad(): + sample = model(latents, timestep=timestep, encoder_hidden_states=encoder_hidden_states).sample + + assert sample.shape == latents.shape + + output_slice = sample[-1, -2:, -2:, :2].flatten().float().cpu() + expected_output_slice = torch.tensor(expected_slice) + + assert torch_all_close(output_slice, expected_output_slice, atol=1e-3) + + @parameterized.expand( + [ + # fmt: off + [83, 4, [-0.2323, -0.1304, 0.0813, -0.3093, -0.0919, -0.1571, -0.1125, -0.5806]], + [17, 0.55, [-0.0831, -0.2443, 0.0901, -0.0919, 0.3396, 0.0103, -0.3743, 0.0701]], + [8, 0.89, [-0.4863, 0.0859, 0.0875, -0.1658, 0.9199, -0.0114, 0.4839, 0.4639]], + [3, 1000, [-0.5649, 0.2402, -0.5518, 0.1248, 1.1328, -0.2443, -0.0325, -1.0078]], + # fmt: on + ] + ) + @require_torch_gpu + def test_compvis_sd_v1_4_fp16(self, seed, timestep, expected_slice): + model = self.get_unet_model(model_id="CompVis/stable-diffusion-v1-4", fp16=True) + latents = self.get_latents(seed, fp16=True) + encoder_hidden_states = self.get_encoder_hidden_states(seed, fp16=True) + + with torch.no_grad(): + sample = model(latents, timestep=timestep, encoder_hidden_states=encoder_hidden_states).sample + + assert sample.shape == latents.shape + + output_slice = sample[-1, -2:, -2:, :2].flatten().float().cpu() + expected_output_slice = torch.tensor(expected_slice) + + assert torch_all_close(output_slice, expected_output_slice, atol=5e-3) + + @parameterized.expand( + [ + # fmt: off + [33, 4, [-0.4430, 0.1570, -0.1867, 0.2376, 0.3205, -0.3681, 0.0525, -0.0722]], + [47, 0.55, [-0.1415, 0.0129, -0.3136, 0.2257, 0.3430, -0.0536, 0.2114, -0.0436]], + [21, 0.89, [-0.7091, 0.6664, -0.3643, 0.9032, 0.4499, -0.6541, 0.0139, 0.1750]], + [9, 1000, [0.8878, -0.5659, 0.5844, -0.7442, 1.1883, -0.3927, 1.1192, -0.4423]], + # fmt: on + ] + ) + def test_compvis_sd_v1_5(self, seed, timestep, expected_slice): + model = self.get_unet_model(model_id="runwayml/stable-diffusion-v1-5") + latents = self.get_latents(seed) + encoder_hidden_states = self.get_encoder_hidden_states(seed) + + with torch.no_grad(): + sample = model(latents, timestep=timestep, encoder_hidden_states=encoder_hidden_states).sample + + assert sample.shape == latents.shape + + output_slice = sample[-1, -2:, -2:, :2].flatten().float().cpu() + expected_output_slice = torch.tensor(expected_slice) + + assert torch_all_close(output_slice, expected_output_slice, atol=1e-3) + + @parameterized.expand( + [ + # fmt: off + [83, 4, [-0.2695, -0.1669, 0.0073, -0.3181, -0.1187, -0.1676, -0.1395, -0.5972]], + [17, 0.55, [-0.1290, -0.2588, 0.0551, -0.0916, 0.3286, 0.0238, -0.3669, 0.0322]], + [8, 0.89, [-0.5283, 0.1198, 0.0870, -0.1141, 0.9189, -0.0150, 0.5474, 0.4319]], + [3, 1000, [-0.5601, 0.2411, -0.5435, 0.1268, 1.1338, -0.2427, -0.0280, -1.0020]], + # fmt: on + ] + ) + @require_torch_gpu + def test_compvis_sd_v1_5_fp16(self, seed, timestep, expected_slice): + model = self.get_unet_model(model_id="runwayml/stable-diffusion-v1-5", fp16=True) + latents = self.get_latents(seed, fp16=True) + encoder_hidden_states = self.get_encoder_hidden_states(seed, fp16=True) + + with torch.no_grad(): + sample = model(latents, timestep=timestep, encoder_hidden_states=encoder_hidden_states).sample + + assert sample.shape == latents.shape + + output_slice = sample[-1, -2:, -2:, :2].flatten().float().cpu() + expected_output_slice = torch.tensor(expected_slice) + + assert torch_all_close(output_slice, expected_output_slice, atol=5e-3) + + @parameterized.expand( + [ + # fmt: off + [33, 4, [-0.7639, 0.0106, -0.1615, -0.3487, -0.0423, -0.7972, 0.0085, -0.4858]], + [47, 0.55, [-0.6564, 0.0795, -1.9026, -0.6258, 1.8235, 1.2056, 1.2169, 0.9073]], + [21, 0.89, [0.0327, 0.4399, -0.6358, 0.3417, 0.4120, -0.5621, -0.0397, -1.0430]], + [9, 1000, [0.1600, 0.7303, -1.0556, -0.3515, -0.7440, -1.2037, -1.8149, -1.8931]], + # fmt: on + ] + ) + def test_compvis_sd_inpaint(self, seed, timestep, expected_slice): + model = self.get_unet_model(model_id="runwayml/stable-diffusion-inpainting") + latents = self.get_latents(seed, shape=(4, 9, 64, 64)) + encoder_hidden_states = self.get_encoder_hidden_states(seed) + + with torch.no_grad(): + sample = model(latents, timestep=timestep, encoder_hidden_states=encoder_hidden_states).sample + + assert sample.shape == (4, 4, 64, 64) + + output_slice = sample[-1, -2:, -2:, :2].flatten().float().cpu() + expected_output_slice = torch.tensor(expected_slice) + + assert torch_all_close(output_slice, expected_output_slice, atol=1e-3) + + @parameterized.expand( + [ + # fmt: off + [83, 4, [-0.1047, -1.7227, 0.1067, 0.0164, -0.5698, -0.4172, -0.1388, 1.1387]], + [17, 0.55, [0.0975, -0.2856, -0.3508, -0.4600, 0.3376, 0.2930, -0.2747, -0.7026]], + [8, 0.89, [-0.0952, 0.0183, -0.5825, -0.1981, 0.1131, 0.4668, -0.0395, -0.3486]], + [3, 1000, [0.4790, 0.4949, -1.0732, -0.7158, 0.7959, -0.9478, 0.1105, -0.9741]], + # fmt: on + ] + ) + @require_torch_gpu + def test_compvis_sd_inpaint_fp16(self, seed, timestep, expected_slice): + model = self.get_unet_model(model_id="runwayml/stable-diffusion-inpainting", fp16=True) + latents = self.get_latents(seed, shape=(4, 9, 64, 64), fp16=True) + encoder_hidden_states = self.get_encoder_hidden_states(seed, fp16=True) + + with torch.no_grad(): + sample = model(latents, timestep=timestep, encoder_hidden_states=encoder_hidden_states).sample + + assert sample.shape == (4, 4, 64, 64) + + output_slice = sample[-1, -2:, -2:, :2].flatten().float().cpu() + expected_output_slice = torch.tensor(expected_slice) + + assert torch_all_close(output_slice, expected_output_slice, atol=5e-3) diff --git a/tests/models/test_models_vae.py b/tests/models/test_models_vae.py new file mode 100644 index 000000000000..f6333d6cd906 --- /dev/null +++ b/tests/models/test_models_vae.py @@ -0,0 +1,303 @@ +# coding=utf-8 +# Copyright 2022 HuggingFace Inc. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import gc +import unittest + +import torch + +from diffusers import AutoencoderKL +from diffusers.modeling_utils import ModelMixin +from diffusers.utils import floats_tensor, load_numpy, require_torch_gpu, slow, torch_all_close, torch_device +from parameterized import parameterized + +from ..test_modeling_common import ModelTesterMixin + + +torch.backends.cuda.matmul.allow_tf32 = False + + +class AutoencoderKLTests(ModelTesterMixin, unittest.TestCase): + model_class = AutoencoderKL + + @property + def dummy_input(self): + batch_size = 4 + num_channels = 3 + sizes = (32, 32) + + image = floats_tensor((batch_size, num_channels) + sizes).to(torch_device) + + return {"sample": image} + + @property + def input_shape(self): + return (3, 32, 32) + + @property + def output_shape(self): + return (3, 32, 32) + + def prepare_init_args_and_inputs_for_common(self): + init_dict = { + "block_out_channels": [32, 64], + "in_channels": 3, + "out_channels": 3, + "down_block_types": ["DownEncoderBlock2D", "DownEncoderBlock2D"], + "up_block_types": ["UpDecoderBlock2D", "UpDecoderBlock2D"], + "latent_channels": 4, + } + inputs_dict = self.dummy_input + return init_dict, inputs_dict + + def test_forward_signature(self): + pass + + def test_training(self): + pass + + def test_from_pretrained_hub(self): + model, loading_info = AutoencoderKL.from_pretrained("fusing/autoencoder-kl-dummy", output_loading_info=True) + self.assertIsNotNone(model) + self.assertEqual(len(loading_info["missing_keys"]), 0) + + model.to(torch_device) + image = model(**self.dummy_input) + + assert image is not None, "Make sure output is not None" + + def test_output_pretrained(self): + model = AutoencoderKL.from_pretrained("fusing/autoencoder-kl-dummy") + model = model.to(torch_device) + model.eval() + + # One-time warmup pass (see #372) + if torch_device == "mps" and isinstance(model, ModelMixin): + image = torch.randn(1, model.config.in_channels, model.config.sample_size, model.config.sample_size) + image = image.to(torch_device) + with torch.no_grad(): + _ = model(image, sample_posterior=True).sample + generator = torch.manual_seed(0) + else: + generator = torch.Generator(device=torch_device).manual_seed(0) + + image = torch.randn( + 1, + model.config.in_channels, + model.config.sample_size, + model.config.sample_size, + generator=torch.manual_seed(0), + ) + image = image.to(torch_device) + with torch.no_grad(): + output = model(image, sample_posterior=True, generator=generator).sample + + output_slice = output[0, -1, -3:, -3:].flatten().cpu() + + # Since the VAE Gaussian prior's generator is seeded on the appropriate device, + # the expected output slices are not the same for CPU and GPU. + if torch_device == "mps": + expected_output_slice = torch.tensor( + [ + -4.0078e-01, + -3.8323e-04, + -1.2681e-01, + -1.1462e-01, + 2.0095e-01, + 1.0893e-01, + -8.8247e-02, + -3.0361e-01, + -9.8644e-03, + ] + ) + elif torch_device == "cpu": + expected_output_slice = torch.tensor( + [-0.1352, 0.0878, 0.0419, -0.0818, -0.1069, 0.0688, -0.1458, -0.4446, -0.0026] + ) + else: + expected_output_slice = torch.tensor( + [-0.2421, 0.4642, 0.2507, -0.0438, 0.0682, 0.3160, -0.2018, -0.0727, 0.2485] + ) + + self.assertTrue(torch_all_close(output_slice, expected_output_slice, rtol=1e-2)) + + +@slow +class AutoencoderKLIntegrationTests(unittest.TestCase): + def get_file_format(self, seed, shape): + return f"gaussian_noise_s={seed}_shape={'_'.join([str(s) for s in shape])}.npy" + + def tearDown(self): + # clean up the VRAM after each test + super().tearDown() + gc.collect() + torch.cuda.empty_cache() + + def get_sd_image(self, seed=0, shape=(4, 3, 512, 512), fp16=False): + dtype = torch.float16 if fp16 else torch.float32 + image = torch.from_numpy(load_numpy(self.get_file_format(seed, shape))).to(torch_device).to(dtype) + return image + + def get_sd_vae_model(self, model_id="CompVis/stable-diffusion-v1-4", fp16=False): + revision = "fp16" if fp16 else None + torch_dtype = torch.float16 if fp16 else torch.float32 + + model = AutoencoderKL.from_pretrained( + model_id, subfolder="vae", torch_dtype=torch_dtype, revision=revision, device_map="auto" + ) + model.to(torch_device).eval() + + return model + + def get_generator(self, seed=0): + return torch.Generator(device=torch_device).manual_seed(seed) + + @parameterized.expand( + [ + # fmt: off + [33, [-0.1603, 0.9878, -0.0495, -0.0790, -0.2709, 0.8375, -0.2060, -0.0824]], + [47, [-0.2376, 0.1168, 0.1332, -0.4840, -0.2508, -0.0791, -0.0493, -0.4089]], + # fmt: on + ] + ) + def test_stable_diffusion(self, seed, expected_slice): + model = self.get_sd_vae_model() + image = self.get_sd_image(seed) + generator = self.get_generator(seed) + + with torch.no_grad(): + sample = model(image, generator=generator, sample_posterior=True).sample + + assert sample.shape == image.shape + + output_slice = sample[-1, -2:, -2:, :2].flatten().float().cpu() + expected_output_slice = torch.tensor(expected_slice) + + assert torch_all_close(output_slice, expected_output_slice, atol=1e-3) + + @parameterized.expand( + [ + # fmt: off + [33, [-0.0513, 0.0289, 1.3799, 0.2166, -0.2573, -0.0871, 0.5103, -0.0999]], + [47, [-0.4128, -0.1320, -0.3704, 0.1965, -0.4116, -0.2332, -0.3340, 0.2247]], + # fmt: on + ] + ) + @require_torch_gpu + def test_stable_diffusion_fp16(self, seed, expected_slice): + model = self.get_sd_vae_model(fp16=True) + image = self.get_sd_image(seed, fp16=True) + generator = self.get_generator(seed) + + with torch.no_grad(): + sample = model(image, generator=generator, sample_posterior=True).sample + + assert sample.shape == image.shape + + output_slice = sample[-1, -2:, :2, -2:].flatten().float().cpu() + expected_output_slice = torch.tensor(expected_slice) + + assert torch_all_close(output_slice, expected_output_slice, atol=1e-2) + + @parameterized.expand( + [ + # fmt: off + [33, [-0.1609, 0.9866, -0.0487, -0.0777, -0.2716, 0.8368, -0.2055, -0.0814]], + [47, [-0.2377, 0.1147, 0.1333, -0.4841, -0.2506, -0.0805, -0.0491, -0.4085]], + # fmt: on + ] + ) + def test_stable_diffusion_mode(self, seed, expected_slice): + model = self.get_sd_vae_model() + image = self.get_sd_image(seed) + + with torch.no_grad(): + sample = model(image).sample + + assert sample.shape == image.shape + + output_slice = sample[-1, -2:, -2:, :2].flatten().float().cpu() + expected_output_slice = torch.tensor(expected_slice) + + assert torch_all_close(output_slice, expected_output_slice, atol=1e-3) + + @parameterized.expand( + [ + # fmt: off + [13, [-0.2051, -0.1803, -0.2311, -0.2114, -0.3292, -0.3574, -0.2953, -0.3323]], + [37, [-0.2632, -0.2625, -0.2199, -0.2741, -0.4539, -0.4990, -0.3720, -0.4925]], + # fmt: on + ] + ) + @require_torch_gpu + def test_stable_diffusion_decode(self, seed, expected_slice): + model = self.get_sd_vae_model() + encoding = self.get_sd_image(seed, shape=(3, 4, 64, 64)) + + with torch.no_grad(): + sample = model.decode(encoding).sample + + assert list(sample.shape) == [3, 3, 512, 512] + + output_slice = sample[-1, -2:, :2, -2:].flatten().cpu() + expected_output_slice = torch.tensor(expected_slice) + + assert torch_all_close(output_slice, expected_output_slice, atol=1e-3) + + @parameterized.expand( + [ + # fmt: off + [27, [-0.0369, 0.0207, -0.0776, -0.0682, -0.1747, -0.1930, -0.1465, -0.2039]], + [16, [-0.1628, -0.2134, -0.2747, -0.2642, -0.3774, -0.4404, -0.3687, -0.4277]], + # fmt: on + ] + ) + def test_stable_diffusion_decode_fp16(self, seed, expected_slice): + model = self.get_sd_vae_model(fp16=True) + encoding = self.get_sd_image(seed, shape=(3, 4, 64, 64), fp16=True) + + with torch.no_grad(): + sample = model.decode(encoding).sample + + assert list(sample.shape) == [3, 3, 512, 512] + + output_slice = sample[-1, -2:, :2, -2:].flatten().float().cpu() + expected_output_slice = torch.tensor(expected_slice) + + assert torch_all_close(output_slice, expected_output_slice, atol=5e-3) + + @parameterized.expand( + [ + # fmt: off + [33, [-0.3001, 0.0918, -2.6984, -3.9720, -3.2099, -5.0353, 1.7338, -0.2065, 3.4267]], + [47, [-1.5030, -4.3871, -6.0355, -9.1157, -1.6661, -2.7853, 2.1607, -5.0823, 2.5633]], + # fmt: on + ] + ) + def test_stable_diffusion_encode_sample(self, seed, expected_slice): + model = self.get_sd_vae_model() + image = self.get_sd_image(seed) + generator = self.get_generator(seed) + + with torch.no_grad(): + dist = model.encode(image).latent_dist + sample = dist.sample(generator=generator) + + assert list(sample.shape) == [image.shape[0], 4] + [i // 8 for i in image.shape[2:]] + + output_slice = sample[0, -1, -3:, -3:].flatten().cpu() + expected_output_slice = torch.tensor(expected_slice) + + assert torch_all_close(output_slice, expected_output_slice, atol=1e-3) diff --git a/tests/test_models_vae_flax.py b/tests/models/test_models_vae_flax.py similarity index 94% rename from tests/test_models_vae_flax.py rename to tests/models/test_models_vae_flax.py index e5c56b61a5a4..8fedb85eccfc 100644 --- a/tests/test_models_vae_flax.py +++ b/tests/models/test_models_vae_flax.py @@ -4,7 +4,7 @@ from diffusers.utils import is_flax_available from diffusers.utils.testing_utils import require_flax -from .test_modeling_common_flax import FlaxModelTesterMixin +from ..test_modeling_common_flax import FlaxModelTesterMixin if is_flax_available(): diff --git a/tests/test_models_vq.py b/tests/models/test_models_vq.py similarity index 98% rename from tests/test_models_vq.py rename to tests/models/test_models_vq.py index 9a2094d46cb4..f58e90469885 100644 --- a/tests/test_models_vq.py +++ b/tests/models/test_models_vq.py @@ -20,7 +20,7 @@ from diffusers import VQModel from diffusers.utils import floats_tensor, torch_device -from .test_modeling_common import ModelTesterMixin +from ..test_modeling_common import ModelTesterMixin torch.backends.cuda.matmul.allow_tf32 = False diff --git a/tests/pipelines/dance_diffusion/test_dance_diffusion.py b/tests/pipelines/dance_diffusion/test_dance_diffusion.py index 72e67e4479d2..737d1c57d154 100644 --- a/tests/pipelines/dance_diffusion/test_dance_diffusion.py +++ b/tests/pipelines/dance_diffusion/test_dance_diffusion.py @@ -86,7 +86,7 @@ def tearDown(self): def test_dance_diffusion(self): device = torch_device - pipe = DanceDiffusionPipeline.from_pretrained("harmonai/maestro-150k") + pipe = DanceDiffusionPipeline.from_pretrained("harmonai/maestro-150k", device_map="auto") pipe = pipe.to(device) pipe.set_progress_bar_config(disable=None) @@ -103,7 +103,9 @@ def test_dance_diffusion(self): def test_dance_diffusion_fp16(self): device = torch_device - pipe = DanceDiffusionPipeline.from_pretrained("harmonai/maestro-150k", torch_dtype=torch.float16) + pipe = DanceDiffusionPipeline.from_pretrained( + "harmonai/maestro-150k", torch_dtype=torch.float16, device_map="auto" + ) pipe = pipe.to(device) pipe.set_progress_bar_config(disable=None) diff --git a/tests/pipelines/ddim/test_ddim.py b/tests/pipelines/ddim/test_ddim.py index 4445fe7feecf..41280a9bb4f0 100644 --- a/tests/pipelines/ddim/test_ddim.py +++ b/tests/pipelines/ddim/test_ddim.py @@ -78,7 +78,7 @@ class DDIMPipelineIntegrationTests(unittest.TestCase): def test_inference_ema_bedroom(self): model_id = "google/ddpm-ema-bedroom-256" - unet = UNet2DModel.from_pretrained(model_id) + unet = UNet2DModel.from_pretrained(model_id, device_map="auto") scheduler = DDIMScheduler.from_config(model_id) ddpm = DDIMPipeline(unet=unet, scheduler=scheduler) @@ -97,7 +97,7 @@ def test_inference_ema_bedroom(self): def test_inference_cifar10(self): model_id = "google/ddpm-cifar10-32" - unet = UNet2DModel.from_pretrained(model_id) + unet = UNet2DModel.from_pretrained(model_id, device_map="auto") scheduler = DDIMScheduler() ddim = DDIMPipeline(unet=unet, scheduler=scheduler) diff --git a/tests/pipelines/ddpm/test_ddpm.py b/tests/pipelines/ddpm/test_ddpm.py index c58e2db38f21..c6b24309ced0 100644 --- a/tests/pipelines/ddpm/test_ddpm.py +++ b/tests/pipelines/ddpm/test_ddpm.py @@ -38,7 +38,7 @@ class DDPMPipelineIntegrationTests(unittest.TestCase): def test_inference_cifar10(self): model_id = "google/ddpm-cifar10-32" - unet = UNet2DModel.from_pretrained(model_id) + unet = UNet2DModel.from_pretrained(model_id, device_map="auto") scheduler = DDPMScheduler.from_config(model_id) ddpm = DDPMPipeline(unet=unet, scheduler=scheduler) diff --git a/tests/pipelines/karras_ve/test_karras_ve.py b/tests/pipelines/karras_ve/test_karras_ve.py index 1fafa1cb40fa..caca35f693cf 100644 --- a/tests/pipelines/karras_ve/test_karras_ve.py +++ b/tests/pipelines/karras_ve/test_karras_ve.py @@ -70,7 +70,7 @@ def test_inference(self): class KarrasVePipelineIntegrationTests(unittest.TestCase): def test_inference(self): model_id = "google/ncsnpp-celebahq-256" - model = UNet2DModel.from_pretrained(model_id) + model = UNet2DModel.from_pretrained(model_id, device_map="auto") scheduler = KarrasVeScheduler() pipe = KarrasVePipeline(unet=model, scheduler=scheduler) diff --git a/tests/pipelines/latent_diffusion/test_latent_diffusion.py b/tests/pipelines/latent_diffusion/test_latent_diffusion.py index 085cdb4e766b..beb30a24f6d0 100644 --- a/tests/pipelines/latent_diffusion/test_latent_diffusion.py +++ b/tests/pipelines/latent_diffusion/test_latent_diffusion.py @@ -121,7 +121,7 @@ def test_inference_text2img(self): @require_torch class LDMTextToImagePipelineIntegrationTests(unittest.TestCase): def test_inference_text2img(self): - ldm = LDMTextToImagePipeline.from_pretrained("CompVis/ldm-text2im-large-256") + ldm = LDMTextToImagePipeline.from_pretrained("CompVis/ldm-text2im-large-256", device_map="auto") ldm.to(torch_device) ldm.set_progress_bar_config(disable=None) @@ -138,7 +138,7 @@ def test_inference_text2img(self): assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2 def test_inference_text2img_fast(self): - ldm = LDMTextToImagePipeline.from_pretrained("CompVis/ldm-text2im-large-256") + ldm = LDMTextToImagePipeline.from_pretrained("CompVis/ldm-text2im-large-256", device_map="auto") ldm.to(torch_device) ldm.set_progress_bar_config(disable=None) diff --git a/tests/pipelines/pndm/test_pndm.py b/tests/pipelines/pndm/test_pndm.py index 5d9212223e6e..02f5b1fe4fb9 100644 --- a/tests/pipelines/pndm/test_pndm.py +++ b/tests/pipelines/pndm/test_pndm.py @@ -71,7 +71,7 @@ class PNDMPipelineIntegrationTests(unittest.TestCase): def test_inference_cifar10(self): model_id = "google/ddpm-cifar10-32" - unet = UNet2DModel.from_pretrained(model_id) + unet = UNet2DModel.from_pretrained(model_id, device_map="auto") scheduler = PNDMScheduler() pndm = PNDMPipeline(unet=unet, scheduler=scheduler) diff --git a/tests/pipelines/score_sde_ve/test_score_sde_ve.py b/tests/pipelines/score_sde_ve/test_score_sde_ve.py index 55dcc1cea1e0..e2f7bc22eca2 100644 --- a/tests/pipelines/score_sde_ve/test_score_sde_ve.py +++ b/tests/pipelines/score_sde_ve/test_score_sde_ve.py @@ -72,7 +72,7 @@ def test_inference(self): class ScoreSdeVePipelineIntegrationTests(unittest.TestCase): def test_inference(self): model_id = "google/ncsnpp-church-256" - model = UNet2DModel.from_pretrained(model_id) + model = UNet2DModel.from_pretrained(model_id, device_map="auto") scheduler = ScoreSdeVeScheduler.from_config(model_id) diff --git a/tests/pipelines/stable_diffusion/test_stable_diffusion.py b/tests/pipelines/stable_diffusion/test_stable_diffusion.py index a81710987814..6b23aca33bc6 100644 --- a/tests/pipelines/stable_diffusion/test_stable_diffusion.py +++ b/tests/pipelines/stable_diffusion/test_stable_diffusion.py @@ -528,7 +528,7 @@ def tearDown(self): def test_stable_diffusion(self): # make sure here that pndm scheduler skips prk - sd_pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-1") + sd_pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-1", device_map="auto") sd_pipe = sd_pipe.to(torch_device) sd_pipe.set_progress_bar_config(disable=None) @@ -548,7 +548,7 @@ def test_stable_diffusion(self): assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2 def test_stable_diffusion_fast_ddim(self): - sd_pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-1") + sd_pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-1", device_map="auto") sd_pipe = sd_pipe.to(torch_device) sd_pipe.set_progress_bar_config(disable=None) @@ -576,7 +576,7 @@ def test_stable_diffusion_fast_ddim(self): def test_lms_stable_diffusion_pipeline(self): model_id = "CompVis/stable-diffusion-v1-1" - pipe = StableDiffusionPipeline.from_pretrained(model_id).to(torch_device) + pipe = StableDiffusionPipeline.from_pretrained(model_id, device_map="auto").to(torch_device) pipe.set_progress_bar_config(disable=None) scheduler = LMSDiscreteScheduler.from_config(model_id, subfolder="scheduler") pipe.scheduler = scheduler @@ -595,9 +595,10 @@ def test_lms_stable_diffusion_pipeline(self): def test_stable_diffusion_memory_chunking(self): torch.cuda.reset_peak_memory_stats() model_id = "CompVis/stable-diffusion-v1-4" - pipe = StableDiffusionPipeline.from_pretrained(model_id, revision="fp16", torch_dtype=torch.float16).to( - torch_device + pipe = StableDiffusionPipeline.from_pretrained( + model_id, revision="fp16", torch_dtype=torch.float16, device_map="auto" ) + pipe.to(torch_device) pipe.set_progress_bar_config(disable=None) prompt = "a photograph of an astronaut riding a horse" @@ -633,9 +634,10 @@ def test_stable_diffusion_memory_chunking(self): def test_stable_diffusion_text2img_pipeline_fp16(self): torch.cuda.reset_peak_memory_stats() model_id = "CompVis/stable-diffusion-v1-4" - pipe = StableDiffusionPipeline.from_pretrained(model_id, revision="fp16", torch_dtype=torch.float16).to( - torch_device + pipe = StableDiffusionPipeline.from_pretrained( + model_id, revision="fp16", device_map="auto", torch_dtype=torch.float16 ) + pipe = pipe.to(torch_device) pipe.set_progress_bar_config(disable=None) prompt = "a photograph of an astronaut riding a horse" @@ -670,6 +672,7 @@ def test_stable_diffusion_text2img_pipeline(self): pipe = StableDiffusionPipeline.from_pretrained( model_id, safety_checker=None, + device_map="auto", ) pipe.to(torch_device) pipe.set_progress_bar_config(disable=None) @@ -711,7 +714,7 @@ def test_callback_fn(step: int, timestep: int, latents: torch.FloatTensor) -> No test_callback_fn.has_been_called = False pipe = StableDiffusionPipeline.from_pretrained( - "CompVis/stable-diffusion-v1-4", revision="fp16", torch_dtype=torch.float16 + "CompVis/stable-diffusion-v1-4", revision="fp16", torch_dtype=torch.float16, device_map="auto" ) pipe = pipe.to(torch_device) pipe.set_progress_bar_config(disable=None) @@ -737,7 +740,7 @@ def test_stable_diffusion_accelerate_auto_device(self): start_time = time.time() pipeline_normal_load = StableDiffusionPipeline.from_pretrained( - pipeline_id, revision="fp16", torch_dtype=torch.float16, use_auth_token=True + pipeline_id, revision="fp16", torch_dtype=torch.float16 ) pipeline_normal_load.to(torch_device) normal_load_time = time.time() - start_time diff --git a/tests/pipelines/stable_diffusion/test_stable_diffusion_img2img.py b/tests/pipelines/stable_diffusion/test_stable_diffusion_img2img.py index 409bb225567a..7cbb9bc51e81 100644 --- a/tests/pipelines/stable_diffusion/test_stable_diffusion_img2img.py +++ b/tests/pipelines/stable_diffusion/test_stable_diffusion_img2img.py @@ -488,6 +488,7 @@ def test_stable_diffusion_img2img_pipeline(self): pipe = StableDiffusionImg2ImgPipeline.from_pretrained( model_id, safety_checker=None, + device_map="auto", ) pipe.to(torch_device) pipe.set_progress_bar_config(disable=None) @@ -529,6 +530,7 @@ def test_stable_diffusion_img2img_pipeline_k_lms(self): model_id, scheduler=lms, safety_checker=None, + device_map="auto", ) pipe.to(torch_device) pipe.set_progress_bar_config(disable=None) @@ -580,7 +582,7 @@ def test_callback_fn(step: int, timestep: int, latents: torch.FloatTensor) -> No init_image = init_image.resize((768, 512)) pipe = StableDiffusionImg2ImgPipeline.from_pretrained( - "CompVis/stable-diffusion-v1-4", revision="fp16", torch_dtype=torch.float16 + "CompVis/stable-diffusion-v1-4", revision="fp16", torch_dtype=torch.float16, device_map="auto" ) pipe.to(torch_device) pipe.set_progress_bar_config(disable=None) diff --git a/tests/pipelines/stable_diffusion/test_stable_diffusion_inpaint.py b/tests/pipelines/stable_diffusion/test_stable_diffusion_inpaint.py index 14b4990784e4..b577436f4e5d 100644 --- a/tests/pipelines/stable_diffusion/test_stable_diffusion_inpaint.py +++ b/tests/pipelines/stable_diffusion/test_stable_diffusion_inpaint.py @@ -288,6 +288,7 @@ def test_stable_diffusion_inpaint_pipeline(self): pipe = StableDiffusionInpaintPipeline.from_pretrained( model_id, safety_checker=None, + device_map="auto", ) pipe.to(torch_device) pipe.set_progress_bar_config(disable=None) @@ -329,6 +330,7 @@ def test_stable_diffusion_inpaint_pipeline_fp16(self): revision="fp16", torch_dtype=torch.float16, safety_checker=None, + device_map="auto", ) pipe.to(torch_device) pipe.set_progress_bar_config(disable=None) @@ -366,7 +368,9 @@ def test_stable_diffusion_inpaint_pipeline_pndm(self): pndm = PNDMScheduler(beta_start=0.00085, beta_end=0.012, beta_schedule="scaled_linear", skip_prk_steps=True) model_id = "runwayml/stable-diffusion-inpainting" - pipe = StableDiffusionInpaintPipeline.from_pretrained(model_id, safety_checker=None, scheduler=pndm) + pipe = StableDiffusionInpaintPipeline.from_pretrained( + model_id, safety_checker=None, scheduler=pndm, device_map="auto" + ) pipe.to(torch_device) pipe.set_progress_bar_config(disable=None) pipe.enable_attention_slicing() diff --git a/tests/pipelines/stable_diffusion/test_stable_diffusion_inpaint_legacy.py b/tests/pipelines/stable_diffusion/test_stable_diffusion_inpaint_legacy.py index 697da4c921df..90dbf3b63604 100644 --- a/tests/pipelines/stable_diffusion/test_stable_diffusion_inpaint_legacy.py +++ b/tests/pipelines/stable_diffusion/test_stable_diffusion_inpaint_legacy.py @@ -368,6 +368,7 @@ def test_stable_diffusion_inpaint_legacy_pipeline(self): pipe = StableDiffusionInpaintPipeline.from_pretrained( model_id, safety_checker=None, + device_map="auto", ) pipe.to(torch_device) pipe.set_progress_bar_config(disable=None) @@ -413,6 +414,7 @@ def test_stable_diffusion_inpaint_legacy_pipeline_k_lms(self): model_id, scheduler=lms, safety_checker=None, + device_map="auto", ) pipe.to(torch_device) pipe.set_progress_bar_config(disable=None) @@ -469,7 +471,7 @@ def test_callback_fn(step: int, timestep: int, latents: torch.FloatTensor) -> No ) pipe = StableDiffusionInpaintPipeline.from_pretrained( - "CompVis/stable-diffusion-v1-4", revision="fp16", torch_dtype=torch.float16 + "CompVis/stable-diffusion-v1-4", revision="fp16", torch_dtype=torch.float16, device_map="auto" ) pipe.to(torch_device) pipe.set_progress_bar_config(disable=None) diff --git a/tests/test_models_vae.py b/tests/test_models_vae.py deleted file mode 100644 index 49610f848364..000000000000 --- a/tests/test_models_vae.py +++ /dev/null @@ -1,132 +0,0 @@ -# coding=utf-8 -# Copyright 2022 HuggingFace Inc. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import unittest - -import torch - -from diffusers import AutoencoderKL -from diffusers.modeling_utils import ModelMixin -from diffusers.utils import floats_tensor, torch_device - -from .test_modeling_common import ModelTesterMixin - - -torch.backends.cuda.matmul.allow_tf32 = False - - -class AutoencoderKLTests(ModelTesterMixin, unittest.TestCase): - model_class = AutoencoderKL - - @property - def dummy_input(self): - batch_size = 4 - num_channels = 3 - sizes = (32, 32) - - image = floats_tensor((batch_size, num_channels) + sizes).to(torch_device) - - return {"sample": image} - - @property - def input_shape(self): - return (3, 32, 32) - - @property - def output_shape(self): - return (3, 32, 32) - - def prepare_init_args_and_inputs_for_common(self): - init_dict = { - "block_out_channels": [32, 64], - "in_channels": 3, - "out_channels": 3, - "down_block_types": ["DownEncoderBlock2D", "DownEncoderBlock2D"], - "up_block_types": ["UpDecoderBlock2D", "UpDecoderBlock2D"], - "latent_channels": 4, - } - inputs_dict = self.dummy_input - return init_dict, inputs_dict - - def test_forward_signature(self): - pass - - def test_training(self): - pass - - def test_from_pretrained_hub(self): - model, loading_info = AutoencoderKL.from_pretrained("fusing/autoencoder-kl-dummy", output_loading_info=True) - self.assertIsNotNone(model) - self.assertEqual(len(loading_info["missing_keys"]), 0) - - model.to(torch_device) - image = model(**self.dummy_input) - - assert image is not None, "Make sure output is not None" - - def test_output_pretrained(self): - model = AutoencoderKL.from_pretrained("fusing/autoencoder-kl-dummy") - model = model.to(torch_device) - model.eval() - - # One-time warmup pass (see #372) - if torch_device == "mps" and isinstance(model, ModelMixin): - image = torch.randn(1, model.config.in_channels, model.config.sample_size, model.config.sample_size) - image = image.to(torch_device) - with torch.no_grad(): - _ = model(image, sample_posterior=True).sample - generator = torch.manual_seed(0) - else: - generator = torch.Generator(device=torch_device).manual_seed(0) - - image = torch.randn( - 1, - model.config.in_channels, - model.config.sample_size, - model.config.sample_size, - generator=torch.manual_seed(0), - ) - image = image.to(torch_device) - with torch.no_grad(): - output = model(image, sample_posterior=True, generator=generator).sample - - output_slice = output[0, -1, -3:, -3:].flatten().cpu() - - # Since the VAE Gaussian prior's generator is seeded on the appropriate device, - # the expected output slices are not the same for CPU and GPU. - if torch_device == "mps": - expected_output_slice = torch.tensor( - [ - -4.0078e-01, - -3.8323e-04, - -1.2681e-01, - -1.1462e-01, - 2.0095e-01, - 1.0893e-01, - -8.8247e-02, - -3.0361e-01, - -9.8644e-03, - ] - ) - elif torch_device == "cpu": - expected_output_slice = torch.tensor( - [-0.1352, 0.0878, 0.0419, -0.0818, -0.1069, 0.0688, -0.1458, -0.4446, -0.0026] - ) - else: - expected_output_slice = torch.tensor( - [-0.2421, 0.4642, 0.2507, -0.0438, 0.0682, 0.3160, -0.2018, -0.0727, 0.2485] - ) - - self.assertTrue(torch.allclose(output_slice, expected_output_slice, rtol=1e-2)) diff --git a/tests/test_pipelines.py b/tests/test_pipelines.py index 3cc94962c38f..e355a19493ed 100644 --- a/tests/test_pipelines.py +++ b/tests/test_pipelines.py @@ -77,6 +77,7 @@ def test_load_custom_pipeline(self): pipeline = DiffusionPipeline.from_pretrained( "google/ddpm-cifar10-32", custom_pipeline="hf-internal-testing/diffusers-dummy-pipeline" ) + pipeline = pipeline.to(torch_device) # NOTE that `"CustomPipeline"` is not a class that is defined in this library, but solely on the Hub # under https://huggingface.co/hf-internal-testing/diffusers-dummy-pipeline/blob/main/pipeline.py#L24 assert pipeline.__class__.__name__ == "CustomPipeline" @@ -85,6 +86,7 @@ def test_run_custom_pipeline(self): pipeline = DiffusionPipeline.from_pretrained( "google/ddpm-cifar10-32", custom_pipeline="hf-internal-testing/diffusers-dummy-pipeline" ) + pipeline = pipeline.to(torch_device) images, output_str = pipeline(num_inference_steps=2, output_type="np") assert images[0].shape == (1, 32, 32, 3) @@ -96,6 +98,7 @@ def test_local_custom_pipeline(self): pipeline = DiffusionPipeline.from_pretrained( "google/ddpm-cifar10-32", custom_pipeline=local_custom_pipeline_path ) + pipeline = pipeline.to(torch_device) images, output_str = pipeline(num_inference_steps=2, output_type="np") assert pipeline.__class__.__name__ == "CustomLocalPipeline" @@ -108,7 +111,7 @@ def test_local_custom_pipeline(self): def test_load_pipeline_from_git(self): clip_model_id = "laion/CLIP-ViT-B-32-laion2B-s34B-b79K" - feature_extractor = CLIPFeatureExtractor.from_pretrained(clip_model_id) + feature_extractor = CLIPFeatureExtractor.from_pretrained(clip_model_id, device_map="auto") clip_model = CLIPModel.from_pretrained(clip_model_id, torch_dtype=torch.float16) pipeline = DiffusionPipeline.from_pretrained( @@ -118,6 +121,7 @@ def test_load_pipeline_from_git(self): feature_extractor=feature_extractor, torch_dtype=torch.float16, revision="fp16", + device_map="auto", ) pipeline.enable_attention_slicing() pipeline = pipeline.to(torch_device) @@ -312,7 +316,9 @@ def tearDown(self): def test_smart_download(self): model_id = "hf-internal-testing/unet-pipeline-dummy" with tempfile.TemporaryDirectory() as tmpdirname: - _ = DiffusionPipeline.from_pretrained(model_id, cache_dir=tmpdirname, force_download=True) + _ = DiffusionPipeline.from_pretrained( + model_id, cache_dir=tmpdirname, force_download=True, device_map="auto" + ) local_repo_name = "--".join(["models"] + model_id.split("/")) snapshot_dir = os.path.join(tmpdirname, local_repo_name, "snapshots") snapshot_dir = os.path.join(snapshot_dir, os.listdir(snapshot_dir)[0]) @@ -335,7 +341,9 @@ def test_warning_unused_kwargs(self): logger = logging.get_logger("diffusers.pipeline_utils") with tempfile.TemporaryDirectory() as tmpdirname: with CaptureLogger(logger) as cap_logger: - DiffusionPipeline.from_pretrained(model_id, not_used=True, cache_dir=tmpdirname, force_download=True) + DiffusionPipeline.from_pretrained( + model_id, not_used=True, cache_dir=tmpdirname, force_download=True, device_map="auto" + ) assert cap_logger.out == "Keyword arguments {'not_used': True} not recognized.\n" @@ -358,7 +366,7 @@ def test_from_pretrained_save_pretrained(self): with tempfile.TemporaryDirectory() as tmpdirname: ddpm.save_pretrained(tmpdirname) - new_ddpm = DDPMPipeline.from_pretrained(tmpdirname) + new_ddpm = DDPMPipeline.from_pretrained(tmpdirname, device_map="auto") new_ddpm.to(torch_device) generator = torch.manual_seed(0) @@ -374,11 +382,12 @@ def test_from_pretrained_hub(self): scheduler = DDPMScheduler(num_train_timesteps=10) - ddpm = DDPMPipeline.from_pretrained(model_path, scheduler=scheduler) - ddpm.to(torch_device) + ddpm = DDPMPipeline.from_pretrained(model_path, scheduler=scheduler, device_map="auto") + ddpm = ddpm.to(torch_device) ddpm.set_progress_bar_config(disable=None) - ddpm_from_hub = DiffusionPipeline.from_pretrained(model_path, scheduler=scheduler) - ddpm_from_hub.to(torch_device) + + ddpm_from_hub = DiffusionPipeline.from_pretrained(model_path, scheduler=scheduler, device_map="auto") + ddpm_from_hub = ddpm_from_hub.to(torch_device) ddpm_from_hub.set_progress_bar_config(disable=None) generator = torch.manual_seed(0) @@ -395,13 +404,15 @@ def test_from_pretrained_hub_pass_model(self): scheduler = DDPMScheduler(num_train_timesteps=10) # pass unet into DiffusionPipeline - unet = UNet2DModel.from_pretrained(model_path) - ddpm_from_hub_custom_model = DiffusionPipeline.from_pretrained(model_path, unet=unet, scheduler=scheduler) - ddpm_from_hub_custom_model.to(torch_device) + unet = UNet2DModel.from_pretrained(model_path, device_map="auto") + ddpm_from_hub_custom_model = DiffusionPipeline.from_pretrained( + model_path, unet=unet, scheduler=scheduler, device_map="auto" + ) + ddpm_from_hub_custom_model = ddpm_from_hub_custom_model.to(torch_device) ddpm_from_hub_custom_model.set_progress_bar_config(disable=None) - ddpm_from_hub = DiffusionPipeline.from_pretrained(model_path, scheduler=scheduler) - ddpm_from_hub.to(torch_device) + ddpm_from_hub = DiffusionPipeline.from_pretrained(model_path, scheduler=scheduler, device_map="auto") + ddpm_from_hub = ddpm_from_hub.to(torch_device) ddpm_from_hub_custom_model.set_progress_bar_config(disable=None) generator = torch.manual_seed(0) @@ -415,7 +426,7 @@ def test_from_pretrained_hub_pass_model(self): def test_output_format(self): model_path = "google/ddpm-cifar10-32" - pipe = DDIMPipeline.from_pretrained(model_path) + pipe = DDIMPipeline.from_pretrained(model_path, device_map="auto") pipe.to(torch_device) pipe.set_progress_bar_config(disable=None) @@ -437,7 +448,7 @@ def test_output_format(self): def test_ddpm_ddim_equality(self): model_id = "google/ddpm-cifar10-32" - unet = UNet2DModel.from_pretrained(model_id) + unet = UNet2DModel.from_pretrained(model_id, device_map="auto") ddpm_scheduler = DDPMScheduler() ddim_scheduler = DDIMScheduler() @@ -461,7 +472,7 @@ def test_ddpm_ddim_equality(self): def test_ddpm_ddim_equality_batched(self): model_id = "google/ddpm-cifar10-32" - unet = UNet2DModel.from_pretrained(model_id) + unet = UNet2DModel.from_pretrained(model_id, device_map="auto") ddpm_scheduler = DDPMScheduler() ddim_scheduler = DDIMScheduler()