diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml index a2a08b52db42..b56d9c094dab 100644 --- a/docs/source/en/_toctree.yml +++ b/docs/source/en/_toctree.yml @@ -113,27 +113,35 @@ - sections: - local: optimization/opt_overview title: Overview - - local: optimization/fp16 - title: Memory and Speed - - local: optimization/torch2.0 - title: Torch2.0 support - - local: using-diffusers/stable_diffusion_jax_how_to - title: Stable Diffusion in JAX/Flax - - local: optimization/xformers - title: xFormers - - local: optimization/onnx - title: ONNX - - local: optimization/open_vino - title: OpenVINO - - local: optimization/coreml - title: Core ML - - local: optimization/mps - title: MPS - - local: optimization/habana - title: Habana Gaudi - - local: optimization/tome - title: Token Merging - title: Optimization/Special Hardware + - sections: + - local: optimization/fp16 + title: Speed up inference + - local: optimization/memory + title: Reduce memory usage + - local: optimization/torch2.0 + title: Torch 2.0 + - local: optimization/xformers + title: xFormers + - local: optimization/tome + title: Token merging + title: General optimizations + - sections: + - local: using-diffusers/stable_diffusion_jax_how_to + title: JAX/Flax + - local: optimization/onnx + title: ONNX + - local: optimization/open_vino + title: OpenVINO + - local: optimization/coreml + title: Core ML + title: Optimized model types + - sections: + - local: optimization/mps + title: Metal Performance Shaders (MPS) + - local: optimization/habana + title: Habana Gaudi + title: Optimized hardware + title: Optimization - sections: - local: conceptual/philosophy title: Philosophy diff --git a/docs/source/en/optimization/fp16.md b/docs/source/en/optimization/fp16.md index 063ad8b305ae..2ac16786eb46 100644 --- a/docs/source/en/optimization/fp16.md +++ b/docs/source/en/optimization/fp16.md @@ -10,13 +10,19 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o specific language governing permissions and limitations under the License. --> -# Memory and speed +# Speed up inference -We present some techniques and ideas to optimize πŸ€— Diffusers _inference_ for memory or speed. As a general rule, we recommend the use of [xFormers](https://github.com/facebookresearch/xformers) for memory efficient attention, please see the recommended [installation instructions](xformers). +There are several ways to optimize πŸ€— Diffusers for inference speed. As a general rule of thumb, we recommend using either [xFormers](xformers) or `torch.nn.functional.scaled_dot_product_attention` in PyTorch 2.0 for their memory-efficient attention. -We'll discuss how the following settings impact performance and memory. + + +In many cases, optimizing for speed or memory leads to improved performance in the other, so you should try to optimize for both whenever you can. This guide focuses on inference speed, but you can learn more about preserving memory in the [Reduce memory usage](memory) guide. + + + +The results below are obtained from generating a single 512x512 image from the prompt `a photo of an astronaut riding a horse on mars` with 50 DDIM steps on a Nvidia Titan RTX, demonstrating the speed-up you can expect. -| | Latency | Speedup | +| | latency | speed-up | | ---------------- | ------- | ------- | | original | 9.50s | x1 | | fp16 | 3.61s | x2.63 | @@ -24,15 +30,9 @@ We'll discuss how the following settings impact performance and memory. | traced UNet | 3.21s | x2.96 | | memory efficient attention | 2.63s | x3.61 | - - obtained on NVIDIA TITAN RTX by generating a single image of size 512x512 from - the prompt "a photo of an astronaut riding a horse on mars" with 50 DDIM - steps. - - -### Use tf32 instead of fp32 (on Ampere and later CUDA devices) +## Use TensorFloat-32 -On Ampere and later CUDA devices matrix multiplications and convolutions can use the TensorFloat32 (TF32) mode for faster but slightly less accurate computations. By default PyTorch enables TF32 mode for convolutions but not matrix multiplications, and unless a network requires full float32 precision we recommend enabling this setting for matrix multiplications, too. It can significantly speed up computations with typically negligible loss of numerical accuracy. You can read more about it [here](https://huggingface.co/docs/transformers/v4.18.0/en/performance#tf32). All you need to do is to add this before your inference: +On Ampere and later CUDA devices, matrix multiplications and convolutions can use the [TensorFloat-32 (TF32)](https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/) mode for faster, but slightly less accurate computations. By default, PyTorch enables TF32 mode for convolutions but not matrix multiplications. Unless your network requires full float32 precision, we recommend enabling TF32 for matrix multiplications. It can significantly speeds up computations with typically negligible loss in numerical accuracy. ```python import torch @@ -40,9 +40,11 @@ import torch torch.backends.cuda.matmul.allow_tf32 = True ``` -## Half precision weights +You can learn more about TF32 in the [Mixed precision training](https://huggingface.co/docs/transformers/en/perf_train_gpu_one#tf32) guide. -To save more GPU memory and get more speed, you can load and run the model weights directly in half precision. This involves loading the float16 version of the weights, which was saved to a branch named `fp16`, and telling PyTorch to use the `float16` type when loading them: +## Half-precision weights + +To save GPU memory and get more speed, try loading and running the model weights directly in half-precision or float16: ```Python import torch @@ -61,351 +63,6 @@ image = pipe(prompt).images[0] - It is strongly discouraged to make use of [`torch.autocast`](https://pytorch.org/docs/stable/amp.html#torch.autocast) in any of the pipelines as it can lead to black images and is always slower than using pure - float16 precision. +Don't use [`torch.autocast`](https://pytorch.org/docs/stable/amp.html#torch.autocast) in any of the pipelines as it can lead to black images and is always slower than pure float16 precision. - - -## Sliced VAE decode for larger batches - -To decode large batches of images with limited VRAM, or to enable batches with 32 images or more, you can use sliced VAE decode that decodes the batch latents one image at a time. - -You likely want to couple this with [`~StableDiffusionPipeline.enable_xformers_memory_efficient_attention`] to further minimize memory use. - -To perform the VAE decode one image at a time, invoke [`~StableDiffusionPipeline.enable_vae_slicing`] in your pipeline before inference. For example: - -```Python -import torch -from diffusers import StableDiffusionPipeline - -pipe = StableDiffusionPipeline.from_pretrained( - "runwayml/stable-diffusion-v1-5", - torch_dtype=torch.float16, - use_safetensors=True, -) -pipe = pipe.to("cuda") - -prompt = "a photo of an astronaut riding a horse on mars" -pipe.enable_vae_slicing() -images = pipe([prompt] * 32).images -``` - -You may see a small performance boost in VAE decode on multi-image batches. There should be no performance impact on single-image batches. - - -## Tiled VAE decode and encode for large images - -Tiled VAE processing makes it possible to work with large images on limited VRAM. For example, generating 4k images in 8GB of VRAM. Tiled VAE decoder splits the image into overlapping tiles, decodes the tiles, and blends the outputs to make the final image. - -You want to couple this with [`~StableDiffusionPipeline.enable_xformers_memory_efficient_attention`] to further minimize memory use. - -To use tiled VAE processing, invoke [`~StableDiffusionPipeline.enable_vae_tiling`] in your pipeline before inference. For example: - -```python -import torch -from diffusers import StableDiffusionPipeline, UniPCMultistepScheduler - -pipe = StableDiffusionPipeline.from_pretrained( - "runwayml/stable-diffusion-v1-5", - torch_dtype=torch.float16, - use_safetensors=True, -) -pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config) -pipe = pipe.to("cuda") -prompt = "a beautiful landscape photograph" -pipe.enable_vae_tiling() -pipe.enable_xformers_memory_efficient_attention() - -image = pipe([prompt], width=3840, height=2224, num_inference_steps=20).images[0] -``` - -The output image will have some tile-to-tile tone variation from the tiles having separate decoders, but you shouldn't see sharp seams between the tiles. The tiling is turned off for images that are 512x512 or smaller. - - - -## Offloading to CPU with accelerate for memory savings - -For additional memory savings, you can offload the weights to CPU and only load them to GPU when performing the forward pass. - -To perform CPU offloading, all you have to do is invoke [`~StableDiffusionPipeline.enable_sequential_cpu_offload`]: - -```Python -import torch -from diffusers import StableDiffusionPipeline - -pipe = StableDiffusionPipeline.from_pretrained( - "runwayml/stable-diffusion-v1-5", - torch_dtype=torch.float16, - use_safetensors=True, -) - -prompt = "a photo of an astronaut riding a horse on mars" -pipe.enable_sequential_cpu_offload() -image = pipe(prompt).images[0] -``` - -And you can get the memory consumption to < 3GB. - -Note that this method works at the submodule level, not on whole models. This is the best way to minimize memory consumption, but inference is much slower due to the iterative nature of the process. The UNet component of the pipeline runs several times (as many as `num_inference_steps`); each time, the different submodules of the UNet are sequentially onloaded and then offloaded as they are needed, so the number of memory transfers is large. - - -Consider using model offloading as another point in the optimization space: it will be much faster, but memory savings won't be as large. - - -It is also possible to chain offloading with attention slicing for minimal memory consumption (< 2GB). - -```Python -import torch -from diffusers import StableDiffusionPipeline - -pipe = StableDiffusionPipeline.from_pretrained( - "runwayml/stable-diffusion-v1-5", - torch_dtype=torch.float16, - use_safetensors=True, -) - -prompt = "a photo of an astronaut riding a horse on mars" -pipe.enable_sequential_cpu_offload() - -image = pipe(prompt).images[0] -``` - -**Note**: When using `enable_sequential_cpu_offload()`, it is important to **not** move the pipeline to CUDA beforehand or else the gain in memory consumption will only be minimal. See [this issue](https://github.com/huggingface/diffusers/issues/1934) for more information. - -**Note**: `enable_sequential_cpu_offload()` is a stateful operation that installs hooks on the models. - - - -## Model offloading for fast inference and memory savings - -[Sequential CPU offloading](#sequential_offloading), as discussed in the previous section, preserves a lot of memory but makes inference slower, because submodules are moved to GPU as needed, and immediately returned to CPU when a new module runs. - -Full-model offloading is an alternative that moves whole models to the GPU, instead of handling each model's constituent _modules_. This results in a negligible impact on inference time (compared with moving the pipeline to `cuda`), while still providing some memory savings. - -In this scenario, only one of the main components of the pipeline (typically: text encoder, unet and vae) -will be in the GPU while the others wait in the CPU. Components like the UNet that run for multiple iterations will stay on GPU until they are no longer needed. - -This feature can be enabled by invoking `enable_model_cpu_offload()` on the pipeline, as shown below. - -```Python -import torch -from diffusers import StableDiffusionPipeline - -pipe = StableDiffusionPipeline.from_pretrained( - "runwayml/stable-diffusion-v1-5", - torch_dtype=torch.float16, - use_safetensors=True, -) - -prompt = "a photo of an astronaut riding a horse on mars" -pipe.enable_model_cpu_offload() -image = pipe(prompt).images[0] -``` - -This is also compatible with attention slicing for additional memory savings. - -```Python -import torch -from diffusers import StableDiffusionPipeline - -pipe = StableDiffusionPipeline.from_pretrained( - "runwayml/stable-diffusion-v1-5", - torch_dtype=torch.float16, - use_safetensors=True, -) - -prompt = "a photo of an astronaut riding a horse on mars" -pipe.enable_model_cpu_offload() - -image = pipe(prompt).images[0] -``` - - -This feature requires `accelerate` version 0.17.0 or larger. - - -**Note**: `enable_model_cpu_offload()` is a stateful operation that installs hooks on the models and state on the pipeline. In order to properly offload -models after they are called, it is required that the entire pipeline is run and models are called in the order the pipeline expects them to be. Exercise caution -if models are re-used outside the context of the pipeline after hooks have been installed. See [accelerate](https://huggingface.co/docs/accelerate/v0.18.0/en/package_reference/big_modeling#accelerate.hooks.remove_hook_from_module) -for further docs on removing hooks. - -## Using Channels Last memory format - -Channels last memory format is an alternative way of ordering NCHW tensors in memory preserving dimensions ordering. Channels last tensors ordered in such a way that channels become the densest dimension (aka storing images pixel-per-pixel). Since not all operators currently support channels last format it may result in a worst performance, so it's better to try it and see if it works for your model. - -For example, in order to set the UNet model in our pipeline to use channels last format, we can use the following: - -```python -print(pipe.unet.conv_out.state_dict()["weight"].stride()) # (2880, 9, 3, 1) -pipe.unet.to(memory_format=torch.channels_last) # in-place operation -print( - pipe.unet.conv_out.state_dict()["weight"].stride() -) # (2880, 1, 960, 320) having a stride of 1 for the 2nd dimension proves that it works -``` - -## Tracing - -Tracing runs an example input tensor through your model, and captures the operations that are invoked as that input makes its way through the model's layers so that an executable or `ScriptFunction` is returned that will be optimized using just-in-time compilation. - -To trace our UNet model, we can use the following: - -```python -import time -import torch -from diffusers import StableDiffusionPipeline -import functools - -# torch disable grad -torch.set_grad_enabled(False) - -# set variables -n_experiments = 2 -unet_runs_per_experiment = 50 - - -# load inputs -def generate_inputs(): - sample = torch.randn(2, 4, 64, 64).half().cuda() - timestep = torch.rand(1).half().cuda() * 999 - encoder_hidden_states = torch.randn(2, 77, 768).half().cuda() - return sample, timestep, encoder_hidden_states - - -pipe = StableDiffusionPipeline.from_pretrained( - "runwayml/stable-diffusion-v1-5", - torch_dtype=torch.float16, - use_safetensors=True, -).to("cuda") -unet = pipe.unet -unet.eval() -unet.to(memory_format=torch.channels_last) # use channels_last memory format -unet.forward = functools.partial(unet.forward, return_dict=False) # set return_dict=False as default - -# warmup -for _ in range(3): - with torch.inference_mode(): - inputs = generate_inputs() - orig_output = unet(*inputs) - -# trace -print("tracing..") -unet_traced = torch.jit.trace(unet, inputs) -unet_traced.eval() -print("done tracing") - - -# warmup and optimize graph -for _ in range(5): - with torch.inference_mode(): - inputs = generate_inputs() - orig_output = unet_traced(*inputs) - - -# benchmarking -with torch.inference_mode(): - for _ in range(n_experiments): - torch.cuda.synchronize() - start_time = time.time() - for _ in range(unet_runs_per_experiment): - orig_output = unet_traced(*inputs) - torch.cuda.synchronize() - print(f"unet traced inference took {time.time() - start_time:.2f} seconds") - for _ in range(n_experiments): - torch.cuda.synchronize() - start_time = time.time() - for _ in range(unet_runs_per_experiment): - orig_output = unet(*inputs) - torch.cuda.synchronize() - print(f"unet inference took {time.time() - start_time:.2f} seconds") - -# save the model -unet_traced.save("unet_traced.pt") -``` - -Then we can replace the `unet` attribute of the pipeline with the traced model like the following - -```python -from diffusers import StableDiffusionPipeline -import torch -from dataclasses import dataclass - - -@dataclass -class UNet2DConditionOutput: - sample: torch.FloatTensor - - -pipe = StableDiffusionPipeline.from_pretrained( - "runwayml/stable-diffusion-v1-5", - torch_dtype=torch.float16, - use_safetensors=True, -).to("cuda") - -# use jitted unet -unet_traced = torch.jit.load("unet_traced.pt") - - -# del pipe.unet -class TracedUNet(torch.nn.Module): - def __init__(self): - super().__init__() - self.in_channels = pipe.unet.in_channels - self.device = pipe.unet.device - - def forward(self, latent_model_input, t, encoder_hidden_states): - sample = unet_traced(latent_model_input, t, encoder_hidden_states)[0] - return UNet2DConditionOutput(sample=sample) - - -pipe.unet = TracedUNet() - -with torch.inference_mode(): - image = pipe([prompt] * 1, num_inference_steps=50).images[0] -``` - - -## Memory Efficient Attention - -Recent work on optimizing the bandwitdh in the attention block has generated huge speed ups and gains in GPU memory usage. The most recent being Flash Attention from @tridao: [code](https://github.com/HazyResearch/flash-attention), [paper](https://arxiv.org/pdf/2205.14135.pdf). - -Here are the speedups we obtain on a few Nvidia GPUs when running the inference at 512x512 with a batch size of 1 (one prompt): - -| GPU | Base Attention FP16 | Memory Efficient Attention FP16 | -|------------------ |--------------------- |--------------------------------- | -| NVIDIA Tesla T4 | 3.5it/s | 5.5it/s | -| NVIDIA 3060 RTX | 4.6it/s | 7.8it/s | -| NVIDIA A10G | 8.88it/s | 15.6it/s | -| NVIDIA RTX A6000 | 11.7it/s | 21.09it/s | -| NVIDIA TITAN RTX | 12.51it/s | 18.22it/s | -| A100-SXM4-40GB | 18.6it/s | 29.it/s | -| A100-SXM-80GB | 18.7it/s | 29.5it/s | - -To leverage it just make sure you have: - - - -If you have PyTorch 2.0 installed, you shouldn't use xFormers! - - - - - PyTorch > 1.12 - - Cuda available - - [Installed the xformers library](xformers). -```python -from diffusers import DiffusionPipeline -import torch - -pipe = DiffusionPipeline.from_pretrained( - "runwayml/stable-diffusion-v1-5", - torch_dtype=torch.float16, - use_safetensors=True, -).to("cuda") - -pipe.enable_xformers_memory_efficient_attention() - -with torch.inference_mode(): - sample = pipe("a small cat") - -# optional: You can disable it via -# pipe.disable_xformers_memory_efficient_attention() -``` + \ No newline at end of file diff --git a/docs/source/en/optimization/habana.md b/docs/source/en/optimization/habana.md index 24846615c95c..c78c8ca3a1be 100644 --- a/docs/source/en/optimization/habana.md +++ b/docs/source/en/optimization/habana.md @@ -10,25 +10,22 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o specific language governing permissions and limitations under the License. --> -# How to use Stable Diffusion on Habana Gaudi +# Habana Gaudi -πŸ€— Diffusers is compatible with Habana Gaudi through πŸ€— [Optimum Habana](https://huggingface.co/docs/optimum/habana/usage_guides/stable_diffusion). +πŸ€— Diffusers is compatible with Habana Gaudi through πŸ€— [Optimum](https://huggingface.co/docs/optimum/habana/usage_guides/stable_diffusion). Follow the [installation](https://docs.habana.ai/en/latest/Installation_Guide/index.html) guide to install the SynapseAI and Gaudi drivers, and then install Optimum Habana: -## Requirements +```bash +python -m pip install --upgrade-strategy eager optimum[habana] +``` -- Optimum Habana 1.6 or later, [here](https://huggingface.co/docs/optimum/habana/installation) is how to install it. -- SynapseAI 1.10. +To generate images with Stable Diffusion 1 and 2 on Gaudi, you need to instantiate two instances: +- [`~optimum.habana.diffusers.GaudiStableDiffusionPipeline`], a pipeline for text-to-image generation. +- [`~optimum.habana.diffusers.GaudiDDIMScheduler`], a Gaudi-optimized scheduler. -## Inference Pipeline +When you initialize the pipeline, you have to specify `use_habana=True` to deploy it on HPUs and to get the fastest possible generation, you should enable **HPU graphs** with `use_hpu_graphs=True`. -To generate images with Stable Diffusion 1 and 2 on Gaudi, you need to instantiate two instances: -- A pipeline with [`GaudiStableDiffusionPipeline`](https://huggingface.co/docs/optimum/habana/package_reference/stable_diffusion_pipeline). This pipeline supports *text-to-image generation*. -- A scheduler with [`GaudiDDIMScheduler`](https://huggingface.co/docs/optimum/habana/package_reference/stable_diffusion_pipeline#optimum.habana.diffusers.GaudiDDIMScheduler). This scheduler has been optimized for Habana Gaudi. - -When initializing the pipeline, you have to specify `use_habana=True` to deploy it on HPUs. -Furthermore, in order to get the fastest possible generations you should enable **HPU graphs** with `use_hpu_graphs=True`. -Finally, you will need to specify a [Gaudi configuration](https://huggingface.co/docs/optimum/habana/package_reference/gaudi_config) which can be downloaded from the [Hugging Face Hub](https://huggingface.co/Habana). +Finally, specify a [`~optimum.habana.GaudiConfig`] which can be downloaded from the [Habana](https://huggingface.co/Habana) organization on the Hub. ```python from optimum.habana import GaudiConfig @@ -45,7 +42,8 @@ pipeline = GaudiStableDiffusionPipeline.from_pretrained( ) ``` -You can then call the pipeline to generate images by batches from one or several prompts: +Now you can call the pipeline to generate images by batches from one or several prompts: + ```python outputs = pipeline( prompt=[ @@ -57,21 +55,21 @@ outputs = pipeline( ) ``` -For more information, check out Optimum Habana's [documentation](https://huggingface.co/docs/optimum/habana/usage_guides/stable_diffusion) and the [example](https://github.com/huggingface/optimum-habana/tree/main/examples/stable-diffusion) provided in the official Github repository. +For more information, check out πŸ€— Optimum Habana's [documentation](https://huggingface.co/docs/optimum/habana/usage_guides/stable_diffusion) and the [example](https://github.com/huggingface/optimum-habana/tree/main/examples/stable-diffusion) provided in the official Github repository. ## Benchmark -Here are the latencies for Habana first-generation Gaudi and Gaudi2 with the [Habana/stable-diffusion](https://huggingface.co/Habana/stable-diffusion) and [Habana/stable-diffusion-2](https://huggingface.co/Habana/stable-diffusion-2) Gaudi configurations (mixed precision bf16/fp32): +We benchmarked Habana's first-generation Gaudi and Gaudi2 with the [Habana/stable-diffusion](https://huggingface.co/Habana/stable-diffusion) and [Habana/stable-diffusion-2](https://huggingface.co/Habana/stable-diffusion-2) Gaudi configurations (mixed precision bf16/fp32) to demonstrate their performance. -- [Stable Diffusion v1.5](https://huggingface.co/runwayml/stable-diffusion-v1-5) (512x512 resolution): +For [Stable Diffusion v1.5](https://huggingface.co/runwayml/stable-diffusion-v1-5) on 512x512 images: -| | Latency (batch size = 1) | Throughput (batch size = 8) | +| | Latency (batch size = 1) | Throughput | | ---------------------- |:------------------------:|:---------------------------:| -| first-generation Gaudi | 3.80s | 0.308 images/s | -| Gaudi2 | 1.33s | 1.081 images/s | +| first-generation Gaudi | 3.80s | 0.308 images/s (batch size = 8) | +| Gaudi2 | 1.33s | 1.081 images/s (batch size = 8) | -- [Stable Diffusion v2.1](https://huggingface.co/stabilityai/stable-diffusion-2-1) (768x768 resolution): +For [Stable Diffusion v2.1](https://huggingface.co/stabilityai/stable-diffusion-2-1) on 768x768 images: | | Latency (batch size = 1) | Throughput | | ---------------------- |:------------------------:|:-------------------------------:| diff --git a/docs/source/en/optimization/memory.md b/docs/source/en/optimization/memory.md new file mode 100644 index 000000000000..25c621231dcd --- /dev/null +++ b/docs/source/en/optimization/memory.md @@ -0,0 +1,367 @@ +# Reduce memory usage + +A barrier to using diffusion models is the large amount of memory required. To overcome this challenge, there are several memory-reducing techniques you can use to run even some of the largest models on free-tier or consumer GPUs. Some of these techniques can even be combined to further reduce memory usage. + + + +In many cases, optimizing for memory or speed leads to improved performance in the other, so you should try to optimize for both whenever you can. This guide focuses on minimizing memory usage, but you can also learn more about how to [Speed up inference](fp16). + + + +The results below are obtained from generating a single 512x512 image from the prompt a photo of an astronaut riding a horse on mars with 50 DDIM steps on a Nvidia Titan RTX, demonstrating the speed-up you can expect as a result of reduced memory consumption. + +| | latency | speed-up | +| ---------------- | ------- | ------- | +| original | 9.50s | x1 | +| fp16 | 3.61s | x2.63 | +| channels last | 3.30s | x2.88 | +| traced UNet | 3.21s | x2.96 | +| memory-efficient attention | 2.63s | x3.61 | + + +## Sliced VAE + +Sliced VAE enables decoding large batches of images with limited VRAM or batches with 32 images or more by decoding the batches of latents one image at a time. You'll likely want to couple this with [`~ModelMixin.enable_xformers_memory_efficient_attention`] to further reduce memory use. + +To use sliced VAE, call [`~StableDiffusionPipeline.enable_vae_slicing`] on your pipeline before inference: + +```python +import torch +from diffusers import StableDiffusionPipeline + +pipe = StableDiffusionPipeline.from_pretrained( + "runwayml/stable-diffusion-v1-5", + torch_dtype=torch.float16, + use_safetensors=True, +) +pipe = pipe.to("cuda") + +prompt = "a photo of an astronaut riding a horse on mars" +pipe.enable_vae_slicing() +images = pipe([prompt] * 32).images +``` + +You may see a small performance boost in VAE decoding on multi-image batches, and there should be no performance impact on single-image batches. + +## Tiled VAE + +Tiled VAE processing also enables working with large images on limited VRAM (for example, generating 4k images on 8GB of VRAM) by splitting the image into overlapping tiles, decoding the tiles, and then blending the outputs together to compose the final image. You should also used tiled VAE with [`~ModelMixin.enable_xformers_memory_efficient_attention`] to further reduce memory use. + +To use tiled VAE processing, call [`~StableDiffusionPipeline.enable_vae_tiling`] on your pipeline before inference: + +```python +import torch +from diffusers import StableDiffusionPipeline, UniPCMultistepScheduler + +pipe = StableDiffusionPipeline.from_pretrained( + "runwayml/stable-diffusion-v1-5", + torch_dtype=torch.float16, + use_safetensors=True, +) +pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config) +pipe = pipe.to("cuda") +prompt = "a beautiful landscape photograph" +pipe.enable_vae_tiling() +pipe.enable_xformers_memory_efficient_attention() + +image = pipe([prompt], width=3840, height=2224, num_inference_steps=20).images[0] +``` + +The output image has some tile-to-tile tone variation because the tiles are decoded separately, but you shouldn't see any sharp and obvious seams between the tiles. Tiling is turned off for images that are 512x512 or smaller. + +## CPU offloading + +Offloading the weights to the CPU and only loading them on the GPU when performing the forward pass can also save memory. Often, this technique can reduce memory consumption to less than 3GB. + +To perform CPU offloading, call [`~StableDiffusionPipeline.enable_sequential_cpu_offload`]: + +```Python +import torch +from diffusers import StableDiffusionPipeline + +pipe = StableDiffusionPipeline.from_pretrained( + "runwayml/stable-diffusion-v1-5", + torch_dtype=torch.float16, + use_safetensors=True, +) + +prompt = "a photo of an astronaut riding a horse on mars" +pipe.enable_sequential_cpu_offload() +image = pipe(prompt).images[0] +``` + +CPU offloading works on submodules rather than whole models. This is the best way to minimize memory consumption, but inference is much slower due to the iterative nature of the diffusion process. The UNet component of the pipeline runs several times (as many as `num_inference_steps`); each time, the different UNet submodules are sequentially onloaded and offloaded as needed, resulting in a large number of memory transfers. + + + +Consider using [model offloading](#model-offloading) if you want to optimize for speed because it is much faster. The tradeoff is your memory savings won't be as large. + + + +CPU offloading can also be chained with attention slicing to reduce memory consumption to less than 2GB. + +```Python +import torch +from diffusers import StableDiffusionPipeline + +pipe = StableDiffusionPipeline.from_pretrained( + "runwayml/stable-diffusion-v1-5", + torch_dtype=torch.float16, + use_safetensors=True, +) + +prompt = "a photo of an astronaut riding a horse on mars" +pipe.enable_sequential_cpu_offload() + +image = pipe(prompt).images[0] +``` + + + +When using [`~StableDiffusionPipeline.enable_sequential_cpu_offload`], don't move the pipeline to CUDA beforehand or else the gain in memory consumption will only be minimal (see this [issue](https://github.com/huggingface/diffusers/issues/1934) for more information). + +[`~StableDiffusionPipeline.enable_sequential_cpu_offload`] is a stateful operation that installs hooks on the models. + + + +## Model offloading + + + +Model offloading requires πŸ€— Accelerate version 0.17.0 or higher. + + + +[Sequential CPU offloading](#cpu-offloading) preserves a lot of memory but it makes inference slower because submodules are moved to GPU as needed, and they're immediately returned to the CPU when a new module runs. + +Full-model offloading is an alternative that moves whole models to the GPU, instead of handling each model's constituent *submodules*. There is a negligible impact on inference time (compared with moving the pipeline to `cuda`), and it still provides some memory savings. + +During model offloading, only one of the main components of the pipeline (typically the text encoder, UNet and VAE) +is placed on the GPU while the others wait on the CPU. Components like the UNet that run for multiple iterations stay on the GPU until they're no longer needed. + +Enable model offloading by calling [`~StableDiffusionPipeline.enable_model_cpu_offload`] on the pipeline: + +```Python +import torch +from diffusers import StableDiffusionPipeline + +pipe = StableDiffusionPipeline.from_pretrained( + "runwayml/stable-diffusion-v1-5", + torch_dtype=torch.float16, + use_safetensors=True, +) + +prompt = "a photo of an astronaut riding a horse on mars" +pipe.enable_model_cpu_offload() +image = pipe(prompt).images[0] +``` + +Model offloading can also be combined with attention slicing for additional memory savings. + +```Python +import torch +from diffusers import StableDiffusionPipeline + +pipe = StableDiffusionPipeline.from_pretrained( + "runwayml/stable-diffusion-v1-5", + torch_dtype=torch.float16, + use_safetensors=True, +) + +prompt = "a photo of an astronaut riding a horse on mars" +pipe.enable_model_cpu_offload() + +image = pipe(prompt).images[0] +``` + + + +In order to properly offload models after they're called, it is required to run the entire pipeline and models are called in the pipeline's expected order. Exercise caution if models are reused outside the context of the pipeline after hooks have been installed. See [Removing Hooks](https://huggingface.co/docs/accelerate/en/package_reference/big_modeling#accelerate.hooks.remove_hook_from_module) +for more information. + +[`~StableDiffusionPipeline.enable_model_cpu_offload`] is a stateful operation that installs hooks on the models and state on the pipeline. + + + +## Channels-last memory format + +The channels-last memory format is an alternative way of ordering NCHW tensors in memory to preserve dimension ordering. Channels-last tensors are ordered in such a way that the channels become the densest dimension (storing images pixel-per-pixel). Since not all operators currently support the channels-last format, it may result in worst performance but you should still try and see if it works for your model. + +For example, to set the pipeline's UNet to use the channels-last format: + +```python +print(pipe.unet.conv_out.state_dict()["weight"].stride()) # (2880, 9, 3, 1) +pipe.unet.to(memory_format=torch.channels_last) # in-place operation +print( + pipe.unet.conv_out.state_dict()["weight"].stride() +) # (2880, 1, 960, 320) having a stride of 1 for the 2nd dimension proves that it works +``` + +## Tracing + +Tracing runs an example input tensor through the model and captures the operations that are performed on it as that input makes its way through the model's layers. The executable or `ScriptFunction` that is returned is optimized with just-in-time compilation. + +To trace a UNet: + +```python +import time +import torch +from diffusers import StableDiffusionPipeline +import functools + +# torch disable grad +torch.set_grad_enabled(False) + +# set variables +n_experiments = 2 +unet_runs_per_experiment = 50 + + +# load inputs +def generate_inputs(): + sample = torch.randn(2, 4, 64, 64).half().cuda() + timestep = torch.rand(1).half().cuda() * 999 + encoder_hidden_states = torch.randn(2, 77, 768).half().cuda() + return sample, timestep, encoder_hidden_states + + +pipe = StableDiffusionPipeline.from_pretrained( + "runwayml/stable-diffusion-v1-5", + torch_dtype=torch.float16, + use_safetensors=True, +).to("cuda") +unet = pipe.unet +unet.eval() +unet.to(memory_format=torch.channels_last) # use channels_last memory format +unet.forward = functools.partial(unet.forward, return_dict=False) # set return_dict=False as default + +# warmup +for _ in range(3): + with torch.inference_mode(): + inputs = generate_inputs() + orig_output = unet(*inputs) + +# trace +print("tracing..") +unet_traced = torch.jit.trace(unet, inputs) +unet_traced.eval() +print("done tracing") + + +# warmup and optimize graph +for _ in range(5): + with torch.inference_mode(): + inputs = generate_inputs() + orig_output = unet_traced(*inputs) + + +# benchmarking +with torch.inference_mode(): + for _ in range(n_experiments): + torch.cuda.synchronize() + start_time = time.time() + for _ in range(unet_runs_per_experiment): + orig_output = unet_traced(*inputs) + torch.cuda.synchronize() + print(f"unet traced inference took {time.time() - start_time:.2f} seconds") + for _ in range(n_experiments): + torch.cuda.synchronize() + start_time = time.time() + for _ in range(unet_runs_per_experiment): + orig_output = unet(*inputs) + torch.cuda.synchronize() + print(f"unet inference took {time.time() - start_time:.2f} seconds") + +# save the model +unet_traced.save("unet_traced.pt") +``` + +Replace the `unet` attribute of the pipeline with the traced model: + +```python +from diffusers import StableDiffusionPipeline +import torch +from dataclasses import dataclass + + +@dataclass +class UNet2DConditionOutput: + sample: torch.FloatTensor + + +pipe = StableDiffusionPipeline.from_pretrained( + "runwayml/stable-diffusion-v1-5", + torch_dtype=torch.float16, + use_safetensors=True, +).to("cuda") + +# use jitted unet +unet_traced = torch.jit.load("unet_traced.pt") + + +# del pipe.unet +class TracedUNet(torch.nn.Module): + def __init__(self): + super().__init__() + self.in_channels = pipe.unet.in_channels + self.device = pipe.unet.device + + def forward(self, latent_model_input, t, encoder_hidden_states): + sample = unet_traced(latent_model_input, t, encoder_hidden_states)[0] + return UNet2DConditionOutput(sample=sample) + + +pipe.unet = TracedUNet() + +with torch.inference_mode(): + image = pipe([prompt] * 1, num_inference_steps=50).images[0] +``` + +## Memory-efficient attention + +Recent work on optimizing bandwidth in the attention block has generated huge speed-ups and reductions in GPU memory usage. The most recent type of memory-efficient attention is [Flash Attention](https://arxiv.org/pdf/2205.14135.pdf) (you can check out the original code at [HazyResearch/flash-attention](https://github.com/HazyResearch/flash-attention)). + +The table below details the speed-ups from a few different Nvidia GPUs when running inference on image sizes of 512x512 and a batch size of 1 (one prompt): + +| GPU | base attention (fp16) | memory-efficient attention (fp16) | +|------------------|-----------------------|-----------------------------------| +| NVIDIA Tesla T4 | 3.5it/s | 5.5it/s | +| NVIDIA 3060 RTX | 4.6it/s | 7.8it/s | +| NVIDIA A10G | 8.88it/s | 15.6it/s | +| NVIDIA RTX A6000 | 11.7it/s | 21.09it/s | +| NVIDIA TITAN RTX | 12.51it/s | 18.22it/s | +| A100-SXM4-40GB | 18.6it/s | 29.it/s | +| A100-SXM-80GB | 18.7it/s | 29.5it/s | + + + +If you have PyTorch 2.0 installed, you shouldn't use xFormers! + + + +To use Flash Attention, install the following: + +- PyTorch > 1.12 +- CUDA available +- [xFormers](xformers) + +Then call [`~ModelMixin.enable_xformers_memory_efficient_attention`] on the pipeline: + +```python +from diffusers import DiffusionPipeline +import torch + +pipe = DiffusionPipeline.from_pretrained( + "runwayml/stable-diffusion-v1-5", + torch_dtype=torch.float16, + use_safetensors=True, +).to("cuda") + +pipe.enable_xformers_memory_efficient_attention() + +with torch.inference_mode(): + sample = pipe("a small cat") + +# optional: You can disable it via +# pipe.disable_xformers_memory_efficient_attention() +``` diff --git a/docs/source/en/optimization/mps.md b/docs/source/en/optimization/mps.md index 3be8c621ee3e..138c85b51184 100644 --- a/docs/source/en/optimization/mps.md +++ b/docs/source/en/optimization/mps.md @@ -10,29 +10,16 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o specific language governing permissions and limitations under the License. --> -# How to use Stable Diffusion in Apple Silicon (M1/M2) +# Metal Performance Shaders (MPS) -πŸ€— Diffusers is compatible with Apple silicon for Stable Diffusion inference, using the PyTorch `mps` device. These are the steps you need to follow to use your M1 or M2 computer with Stable Diffusion. +πŸ€— Diffusers is compatible with Apple silicon (M1/M2 chips) using the PyTorch [`mps`](https://pytorch.org/docs/stable/notes/mps.html) device, which uses the Metal framework to leverage the GPU on MacOS devices. You'll need to have: -## Requirements +- macOS computer with Apple silicon (M1/M2) hardware +- macOS 12.6 or later (13.0 or later recommended) +- arm64 version of Python +- [PyTorch 2.0](https://pytorch.org/get-started/locally/) (recommended) or 1.13 (minimum version supported for `mps`) -- Mac computer with Apple silicon (M1/M2) hardware. -- macOS 12.6 or later (13.0 or later recommended). -- arm64 version of Python. -- PyTorch 2.0 (recommended) or 1.13 (minimum version supported for `mps`). You can install it with `pip` or `conda` using the instructions in https://pytorch.org/get-started/locally/. - - -## Inference Pipeline - -The snippet below demonstrates how to use the `mps` backend using the familiar `to()` interface to move the Stable Diffusion pipeline to your M1 or M2 device. - - - -**If you are using PyTorch 1.13** you need to "prime" the pipeline using an additional one-time pass through it. This is a temporary workaround for a weird issue we detected: the first inference pass produces slightly different results than subsequent ones. You only need to do this pass once, and it's ok to use just one inference step and discard the result. - - - -We strongly recommend you use PyTorch 2 or better, as it solves a number of problems like the one described in the previous tip. +The `mps` backend uses PyTorch's `.to()` interface to move the Stable Diffusion pipeline on to your M1 or M2 device: ```python from diffusers import DiffusionPipeline @@ -44,24 +31,41 @@ pipe = pipe.to("mps") pipe.enable_attention_slicing() prompt = "a photo of an astronaut riding a horse on mars" +``` + + + +Generating multiple prompts in a batch can [crash](https://github.com/huggingface/diffusers/issues/363) or fail to work reliably. We believe this is related to the [`mps`](https://github.com/pytorch/pytorch/issues/84039) backend in PyTorch. While this is being investigated, you should iterate instead of batching. + + + +If you're using **PyTorch 1.13**, you need to "prime" the pipeline with an additional one-time pass through it. This is a temporary workaround for an issue where the first inference pass produces slightly different results than subsequent ones. You only need to do this pass once, and after just one inference step you can discard the result. + +```diff + from diffusers import DiffusionPipeline -# First-time "warmup" pass if PyTorch version is 1.13 (see explanation above) -_ = pipe(prompt, num_inference_steps=1) + pipe = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5").to("mps") + pipe.enable_attention_slicing() + + prompt = "a photo of an astronaut riding a horse on mars" +# First-time "warmup" pass if PyTorch version is 1.13 ++ _ = pipe(prompt, num_inference_steps=1) # Results match those from the CPU device after the warmup pass. -image = pipe(prompt).images[0] + image = pipe(prompt).images[0] ``` -## Performance Recommendations +## Troubleshoot -M1/M2 performance is very sensitive to memory pressure. The system will automatically swap if it needs to, but performance will degrade significantly when it does. +M1/M2 performance is very sensitive to memory pressure. When this occurs, the system automatically swaps if it needs to which significantly degrades performance. -We recommend you use _attention slicing_ to reduce memory pressure during inference and prevent swapping, particularly if your computer has less than 64 GB of system RAM, or if you generate images at non-standard resolutions larger than 512 Γ— 512 pixels. Attention slicing performs the costly attention operation in multiple steps instead of all at once. It usually has a performance impact of ~20% in computers without universal memory, but we have observed _better performance_ in most Apple Silicon computers, unless you have 64 GB or more. +To prevent this from happening, we recommend *attention slicing* to reduce memory pressure during inference and prevent swapping. This is especially relevant if your computer has less than 64GB of system RAM, or if you generate images at non-standard resolutions larger than 512Γ—512 pixels. Call the [`~DiffusionPipeline.enable_attention_slicing`] function on your pipeline: -```python +```py +from diffusers import DiffusionPipeline + +pipeline = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, variant="fp16", use_safetensors=True).to("mps") pipeline.enable_attention_slicing() ``` -## Known Issues - -- Generating multiple prompts in a batch [crashes or doesn't work reliably](https://github.com/huggingface/diffusers/issues/363). We believe this is related to the [`mps` backend in PyTorch](https://github.com/pytorch/pytorch/issues/84039). This is being resolved, but for now we recommend to iterate instead of batching. +Attention slicing performs the costly attention operation in multiple steps instead of all at once. It usually improves performance by ~20% in computers without universal memory, but we've observed *better performance* in most Apple silicon computers unless you have 64GB of RAM or more. diff --git a/docs/source/en/optimization/onnx.md b/docs/source/en/optimization/onnx.md index 2e39fabd1e27..20104b555543 100644 --- a/docs/source/en/optimization/onnx.md +++ b/docs/source/en/optimization/onnx.md @@ -11,23 +11,19 @@ specific language governing permissions and limitations under the License. --> -# How to use ONNX Runtime for inference +# ONNX Runtime -πŸ€— [Optimum](https://github.com/huggingface/optimum) provides a Stable Diffusion pipeline compatible with ONNX Runtime. +πŸ€— [Optimum](https://github.com/huggingface/optimum) provides a Stable Diffusion pipeline compatible with ONNX Runtime. You'll need to install πŸ€— Optimum with the following command for ONNX Runtime support: -## Installation - -Install πŸ€— Optimum with the following command for ONNX Runtime support: - -``` +```bash pip install optimum["onnxruntime"] ``` -## Stable Diffusion +This guide will show you how to use the Stable Diffusion and Stable Diffusion XL (SDXL) pipelines with ONNX Runtime. -### Inference +## Stable Diffusion -To load an ONNX model and run inference with ONNX Runtime, you need to replace [`StableDiffusionPipeline`] with `ORTStableDiffusionPipeline`. In case you want to load a PyTorch model and convert it to the ONNX format on-the-fly, you can set `export=True`. +To load and run inference, use the [`~optimum.onnxruntime.ORTStableDiffusionPipeline`]. If you want to load a PyTorch model and convert it to the ONNX format on-the-fly, set `export=True`: ```python from optimum.onnxruntime import ORTStableDiffusionPipeline @@ -39,14 +35,20 @@ image = pipeline(prompt).images[0] pipeline.save_pretrained("./onnx-stable-diffusion-v1-5") ``` -If you want to export the pipeline in the ONNX format offline and later use it for inference, -you can use the [`optimum-cli export`](https://huggingface.co/docs/optimum/main/en/exporters/onnx/usage_guides/export_a_model#exporting-a-model-to-onnx-using-the-cli) command: + + +Generating multiple prompts in a batch seems to take too much memory. While we look into it, you may need to iterate instead of batching. + + + +To export the pipeline in the ONNX format offline and use it later for inference, +use the [`optimum-cli export`](https://huggingface.co/docs/optimum/main/en/exporters/onnx/usage_guides/export_a_model#exporting-a-model-to-onnx-using-the-cli) command: ```bash optimum-cli export onnx --model runwayml/stable-diffusion-v1-5 sd_v15_onnx/ ``` -Then perform inference: +Then to perform inference (you don't have to specify `export=True` again): ```python from optimum.onnxruntime import ORTStableDiffusionPipeline @@ -57,36 +59,15 @@ prompt = "sailing ship in storm by Leonardo da Vinci" image = pipeline(prompt).images[0] ``` -Notice that we didn't have to specify `export=True` above. -
-You can find more examples in [optimum documentation](https://huggingface.co/docs/optimum/). - - -### Supported tasks - -| Task | Loading Class | -|--------------------------------------|--------------------------------------| -| `text-to-image` | `ORTStableDiffusionPipeline` | -| `image-to-image` | `ORTStableDiffusionImg2ImgPipeline` | -| `inpaint` | `ORTStableDiffusionInpaintPipeline` | +You can find more examples in πŸ€— Optimum [documentation](https://huggingface.co/docs/optimum/), and Stable Diffusion is supported for text-to-image, image-to-image, and inpainting. ## Stable Diffusion XL -### Export - -To export your model to ONNX, you can use the [Optimum CLI](https://huggingface.co/docs/optimum/main/en/exporters/onnx/usage_guides/export_a_model#exporting-a-model-to-onnx-using-the-cli) as follows : - -```bash -optimum-cli export onnx --model stabilityai/stable-diffusion-xl-base-1.0 --task stable-diffusion-xl sd_xl_onnx/ -``` - -### Inference - -Here is an example of how you can load a SDXL ONNX model from [stabilityai/stable-diffusion-xl-base-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0) and run inference with ONNX Runtime : +To load and run inference with SDXL, use the [`~optimum.onnxruntime.ORTStableDiffusionXLPipeline`]: ```python from optimum.onnxruntime import ORTStableDiffusionXLPipeline @@ -97,13 +78,10 @@ prompt = "sailing ship in storm by Leonardo da Vinci" image = pipeline(prompt).images[0] ``` -### Supported tasks - -| Task | Loading Class | -|--------------------------------------|--------------------------------------| -| `text-to-image` | `ORTStableDiffusionXLPipeline` | -| `image-to-image` | `ORTStableDiffusionXLImg2ImgPipeline`| +To export the pipeline in the ONNX format and use it later for inference, use the [`optimum-cli export`](https://huggingface.co/docs/optimum/main/en/exporters/onnx/usage_guides/export_a_model#exporting-a-model-to-onnx-using-the-cli) command: -## Known Issues +```bash +optimum-cli export onnx --model stabilityai/stable-diffusion-xl-base-1.0 --task stable-diffusion-xl sd_xl_onnx/ +``` -- Generating multiple prompts in a batch seems to take too much memory. While we look into it, you may need to iterate instead of batching. +SDXL in the ONNX format is supported for text-to-image and image-to-image. diff --git a/docs/source/en/optimization/open_vino.md b/docs/source/en/optimization/open_vino.md index a820fb62f949..606c2207bcda 100644 --- a/docs/source/en/optimization/open_vino.md +++ b/docs/source/en/optimization/open_vino.md @@ -11,26 +11,21 @@ specific language governing permissions and limitations under the License. --> -# How to use OpenVINO for inference +# OpenVINO -πŸ€— [Optimum](https://github.com/huggingface/optimum-intel) provides Stable Diffusion pipelines compatible with OpenVINO. You can now easily perform inference with OpenVINO Runtime on a variety of Intel processors ([see](https://docs.openvino.ai/latest/openvino_docs_OV_UG_supported_plugins_Supported_Devices.html) the full list of supported devices). +πŸ€— [Optimum](https://github.com/huggingface/optimum-intel) provides Stable Diffusion pipelines compatible with OpenVINO to perform inference on a variety of Intel processors (see the [full list]((https://docs.openvino.ai/latest/openvino_docs_OV_UG_supported_plugins_Supported_Devices.html)) of supported devices). -## Installation - -Install πŸ€— Optimum Intel with the following command: +You'll need to install πŸ€— Optimum Intel with the `--upgrade-strategy eager` option to ensure [`optimum-intel`](https://github.com/huggingface/optimum-intel) is using the latest version: ``` pip install --upgrade-strategy eager optimum["openvino"] ``` -The `--upgrade-strategy eager` option is needed to ensure [`optimum-intel`](https://github.com/huggingface/optimum-intel) is upgraded to its latest version. - +This guide will show you how to use the Stable Diffusion and Stable Diffusion XL (SDXL) pipelines with OpenVINO. ## Stable Diffusion -### Inference - -To load an OpenVINO model and run inference with OpenVINO Runtime, you need to replace `StableDiffusionPipeline` with `OVStableDiffusionPipeline`. In case you want to load a PyTorch model and convert it to the OpenVINO format on-the-fly, you can set `export=True`. +To load and run inference, use the [`~optimum.intel.OVStableDiffusionPipeline`]. If you want to load a PyTorch model and convert it to the OpenVINO format on-the-fly, set `export=True`: ```python from optimum.intel import OVStableDiffusionPipeline @@ -44,7 +39,7 @@ image = pipeline(prompt).images[0] pipeline.save_pretrained("openvino-sd-v1-5") ``` -To further speed up inference, the model can be statically reshaped : +To further speed-up inference, statically reshape the model. If you change any parameters such as the outputs height or width, you’ll need to statically reshape your model again. ```python # Define the shapes related to the inputs and desired outputs @@ -62,30 +57,15 @@ image = pipeline( num_images_per_prompt=num_images, ).images[0] ``` - -In case you want to change any parameters such as the outputs height or width, you’ll need to statically reshape your model once again. -
- -### Supported tasks - -| Task | Loading Class | -|--------------------------------------|--------------------------------------| -| `text-to-image` | `OVStableDiffusionPipeline` | -| `image-to-image` | `OVStableDiffusionImg2ImgPipeline` | -| `inpaint` | `OVStableDiffusionInpaintPipeline` | - -You can find more examples in the optimum [documentation](https://huggingface.co/docs/optimum/intel/inference#stable-diffusion). - +You can find more examples in the πŸ€— Optimum [documentation](https://huggingface.co/docs/optimum/intel/inference#stable-diffusion), and Stable Diffusion is supported for text-to-image, image-to-image, and inpainting. ## Stable Diffusion XL -### Inference - -Here is an example of how you can load a SDXL OpenVINO model from [stabilityai/stable-diffusion-xl-base-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0) and run inference with OpenVINO Runtime : +To load and run inference with SDXL, use the [`~optimum.intel.OVStableDiffusionXLPipeline`]: ```python from optimum.intel import OVStableDiffusionXLPipeline @@ -96,15 +76,6 @@ prompt = "sailing ship in storm by Rembrandt" image = pipeline(prompt).images[0] ``` -To further speed up inference, the model can be statically reshaped as showed above. -You can find more examples in the optimum [documentation](https://huggingface.co/docs/optimum/intel/inference#stable-diffusion-xl). - -### Supported tasks - -| Task | Loading Class | -|--------------------------------------|--------------------------------------| -| `text-to-image` | `OVStableDiffusionXLPipeline` | -| `image-to-image` | `OVStableDiffusionXLImg2ImgPipeline` | - - +To further speed-up inference, [statically reshape](#stable-diffusion) the model as shown in the Stable Diffusion section. +You can find more examples in the πŸ€— Optimum [documentation](https://huggingface.co/docs/optimum/intel/inference#stable-diffusion-xl), and running SDXL in OpenVINO is supported for text-to-image and image-to-image. diff --git a/docs/source/en/optimization/opt_overview.md b/docs/source/en/optimization/opt_overview.md index 8d8386f85f43..1f809bb011ce 100644 --- a/docs/source/en/optimization/opt_overview.md +++ b/docs/source/en/optimization/opt_overview.md @@ -12,6 +12,6 @@ specific language governing permissions and limitations under the License. # Overview -Generating high-quality outputs is computationally intensive, especially during each iterative step where you go from a noisy output to a less noisy output. One of 🧨 Diffuser's goal is to make this technology widely accessible to everyone, which includes enabling fast inference on consumer and specialized hardware. +Generating high-quality outputs is computationally intensive, especially during each iterative step where you go from a noisy output to a less noisy output. One of πŸ€— Diffuser's goal is to make this technology widely accessible to everyone, which includes enabling fast inference on consumer and specialized hardware. -This section will cover tips and tricks - like half-precision weights and sliced attention - for optimizing inference speed and reducing memory-consumption. You can also learn how to speed up your PyTorch code with [`torch.compile`](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) or [ONNX Runtime](https://onnxruntime.ai/docs/), and enable memory-efficient attention with [xFormers](https://facebookresearch.github.io/xformers/). There are also guides for running inference on specific hardware like Apple Silicon, and Intel or Habana processors. \ No newline at end of file +This section will cover tips and tricks - like half-precision weights and sliced attention - for optimizing inference speed and reducing memory-consumption. You'll also learn how to speed up your PyTorch code with [`torch.compile`](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) or [ONNX Runtime](https://onnxruntime.ai/docs/), and enable memory-efficient attention with [xFormers](https://facebookresearch.github.io/xformers/). There are also guides for running inference on specific hardware like Apple Silicon, and Intel or Habana processors. \ No newline at end of file diff --git a/docs/source/en/optimization/tome.md b/docs/source/en/optimization/tome.md index c2158f539a65..66d69c6900cc 100644 --- a/docs/source/en/optimization/tome.md +++ b/docs/source/en/optimization/tome.md @@ -10,35 +10,39 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o specific language governing permissions and limitations under the License. --> -# Token Merging +# Token merging -Token Merging (introduced in [Token Merging: Your ViT But Faster](https://arxiv.org/abs/2210.09461)) works by merging the redundant tokens / patches progressively in the forward pass of a Transformer-based network. It can speed up the inference latency of the underlying network. +[Token merging](https://huggingface.co/papers/2303.17604) (ToMe) merges redundant tokens/patches progressively in the forward pass of a Transformer-based network which can speed-up the inference latency of [`StableDiffusionPipeline`]. -After Token Merging (ToMe) was released, the authors released [Token Merging for Fast Stable Diffusion](https://arxiv.org/abs/2303.17604), which introduced a version of ToMe which is more compatible with Stable Diffusion. We can use ToMe to gracefully speed up the inference latency of a [`DiffusionPipeline`]. This doc discusses how to apply ToMe to the [`StableDiffusionPipeline`], the expected speedups, and the qualitative aspects of using ToMe on the [`StableDiffusionPipeline`]. - -## Using ToMe - -The authors of ToMe released a convenient Python library called [`tomesd`](https://github.com/dbolya/tomesd) that lets us apply ToMe to a [`DiffusionPipeline`] like so: +You can use ToMe from the [`tomesd`](https://github.com/dbolya/tomesd) library with the [`apply_patch`](https://github.com/dbolya/tomesd?tab=readme-ov-file#usage) function: ```diff from diffusers import StableDiffusionPipeline import tomesd pipeline = StableDiffusionPipeline.from_pretrained( - "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16 + "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True, ).to("cuda") + tomesd.apply_patch(pipeline, ratio=0.5) image = pipeline("a photo of an astronaut riding a horse on mars").images[0] ``` -And that’s it! +The `apply_patch` function exposes a number of [arguments](https://github.com/dbolya/tomesd#usage) to help strike a balance between pipeline inference speed and the quality of the generated tokens. The most important argument is `ratio` which controls the number of tokens that are merged during the forward pass. + +As reported in the [paper](https://huggingface.co/papers/2303.17604), ToMe can greatly preserve the quality of the generated images while boosting inference speed. By increasing the `ratio`, you can speed-up inference even further, but at the cost of some degraded image quality. + +To test the quality of the generated images, we sampled a few prompts from [Parti Prompts](https://parti.research.google/) and performed inference with the [`StableDiffusionPipeline`] with the following settings: + +
+ +
-`tomesd.apply_patch()` exposes [a number of arguments](https://github.com/dbolya/tomesd#usage) to let us strike a balance between the pipeline inference speed and the quality of the generated tokens. Amongst those arguments, the most important one is `ratio`. `ratio` controls the number of tokens that will be merged during the forward pass. For more details on `tomesd`, please refer to the original repository https://github.com/dbolya/tomesd and [the paper](https://arxiv.org/abs/2303.17604). +We didn’t notice any significant decrease in the quality of the generated samples, and you can check out the generated samples in this [WandB report](https://wandb.ai/sayakpaul/tomesd-results/runs/23j4bj3i?workspace=). If you're interested in reproducing this experiment, use this [script](https://gist.github.com/sayakpaul/8cac98d7f22399085a060992f411ecbd). -## Benchmarking `tomesd` with `StableDiffusionPipeline` +## Benchmarks -We benchmarked the impact of using `tomesd` on [`StableDiffusionPipeline`] along with [xformers](https://huggingface.co/docs/diffusers/optimization/xformers) across different image resolutions. We used A100 and V100 as our test GPU devices with the following development environment (with Python 3.8.5): +We also benchmarked the impact of `tomesd` on the [`StableDiffusionPipeline`] with [xFormers](https://huggingface.co/docs/diffusers/optimization/xformers) enabled across several image resolutions. The results are obtained from A100 and V100 GPUs in the following development environment: ```bash - `diffusers` version: 0.15.1 @@ -51,66 +55,35 @@ We benchmarked the impact of using `tomesd` on [`StableDiffusionPipeline`] along - tomesd version: 0.1.2 ``` -We used this script for benchmarking: [https://gist.github.com/sayakpaul/27aec6bca7eb7b0e0aa4112205850335](https://gist.github.com/sayakpaul/27aec6bca7eb7b0e0aa4112205850335). Following are our findings: - -### A100 - -| Resolution | Batch size | Vanilla | ToMe | ToMe + xFormers | ToMe speedup (%) | ToMe + xFormers speedup (%) | -| --- | --- | --- | --- | --- | --- | --- | -| 512 | 10 | 6.88 | 5.26 | 4.69 | 23.54651163 | 31.83139535 | -| | | | | | | | -| 768 | 10 | OOM | 14.71 | 11 | | | -| | 8 | OOM | 11.56 | 8.84 | | | -| | 4 | OOM | 5.98 | 4.66 | | | -| | 2 | 4.99 | 3.24 | 3.1 | 35.07014028 | 37.8757515 | -| | 1 | 3.29 | 2.24 | 2.03 | 31.91489362 | 38.29787234 | -| | | | | | | | -| 1024 | 10 | OOM | OOM | OOM | | | -| | 8 | OOM | OOM | OOM | | | -| | 4 | OOM | 12.51 | 9.09 | | | -| | 2 | OOM | 6.52 | 4.96 | | | -| | 1 | 6.4 | 3.61 | 2.81 | 43.59375 | 56.09375 | - -***The timings reported here are in seconds. Speedups are calculated over the `Vanilla` timings.*** - -### V100 - -| Resolution | Batch size | Vanilla | ToMe | ToMe + xFormers | ToMe speedup (%) | ToMe + xFormers speedup (%) | -| --- | --- | --- | --- | --- | --- | --- | -| 512 | 10 | OOM | 10.03 | 9.29 | | | -| | 8 | OOM | 8.05 | 7.47 | | | -| | 4 | 5.7 | 4.3 | 3.98 | 24.56140351 | 30.1754386 | -| | 2 | 3.14 | 2.43 | 2.27 | 22.61146497 | 27.70700637 | -| | 1 | 1.88 | 1.57 | 1.57 | 16.4893617 | 16.4893617 | -| | | | | | | | -| 768 | 10 | OOM | OOM | 23.67 | | | -| | 8 | OOM | OOM | 18.81 | | | -| | 4 | OOM | 11.81 | 9.7 | | | -| | 2 | OOM | 6.27 | 5.2 | | | -| | 1 | 5.43 | 3.38 | 2.82 | 37.75322284 | 48.06629834 | -| | | | | | | | -| 1024 | 10 | OOM | OOM | OOM | | | -| | 8 | OOM | OOM | OOM | | | -| | 4 | OOM | OOM | 19.35 | | | -| | 2 | OOM | 13 | 10.78 | | | -| | 1 | OOM | 6.66 | 5.54 | | | - -As seen in the tables above, the speedup with `tomesd` becomes more pronounced for larger image resolutions. It is also interesting to note that with `tomesd`, it becomes possible to run the pipeline on a higher resolution, like 1024x1024. - -It might be possible to speed up inference even further with [`torch.compile()`](https://huggingface.co/docs/diffusers/optimization/torch2.0). - -## Quality - -As reported in [the paper](https://arxiv.org/abs/2303.17604), ToMe can preserve the quality of the generated images to a great extent while speeding up inference. By increasing the `ratio`, it is possible to further speed up inference, but that might come at the cost of a deterioration in the image quality. - -To test the quality of the generated samples using our setup, we sampled a few prompts from the β€œParti Prompts” (introduced in [Parti](https://parti.research.google/)) and performed inference with the [`StableDiffusionPipeline`] in the following settings: - -- Vanilla [`StableDiffusionPipeline`] -- [`StableDiffusionPipeline`] + ToMe -- [`StableDiffusionPipeline`] + ToMe + xformers - -We didn’t notice any significant decrease in the quality of the generated samples. Here are samples: - -![tome-samples](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/tome/tome_samples.png) - -You can check out the generated samples [here](https://wandb.ai/sayakpaul/tomesd-results/runs/23j4bj3i?workspace=). We used [this script](https://gist.github.com/sayakpaul/8cac98d7f22399085a060992f411ecbd) for conducting this experiment. \ No newline at end of file +To reproduce this benchmark, feel free to use this [script](https://gist.github.com/sayakpaul/27aec6bca7eb7b0e0aa4112205850335). The results are reported in seconds, and where applicable we report the speed-up percentage over the vanilla pipeline when using ToMe and ToMe + xFormers. + +| **GPU** | **Resolution** | **Batch size** | **Vanilla** | **ToMe** | **ToMe + xFormers** | +|----------|----------------|----------------|-------------|----------------|---------------------| +| **A100** | 512 | 10 | 6.88 | 5.26 (+23.55%) | 4.69 (+31.83%) | +| | 768 | 10 | OOM | 14.71 | 11 | +| | | 8 | OOM | 11.56 | 8.84 | +| | | 4 | OOM | 5.98 | 4.66 | +| | | 2 | 4.99 | 3.24 (+35.07%) | 2.1 (+37.88%) | +| | | 1 | 3.29 | 2.24 (+31.91%) | 2.03 (+38.3%) | +| | 1024 | 10 | OOM | OOM | OOM | +| | | 8 | OOM | OOM | OOM | +| | | 4 | OOM | 12.51 | 9.09 | +| | | 2 | OOM | 6.52 | 4.96 | +| | | 1 | 6.4 | 3.61 (+43.59%) | 2.81 (+56.09%) | +| **V100** | 512 | 10 | OOM | 10.03 | 9.29 | +| | | 8 | OOM | 8.05 | 7.47 | +| | | 4 | 5.7 | 4.3 (+24.56%) | 3.98 (+30.18%) | +| | | 2 | 3.14 | 2.43 (+22.61%) | 2.27 (+27.71%) | +| | | 1 | 1.88 | 1.57 (+16.49%) | 1.57 (+16.49%) | +| | 768 | 10 | OOM | OOM | 23.67 | +| | | 8 | OOM | OOM | 18.81 | +| | | 4 | OOM | 11.81 | 9.7 | +| | | 2 | OOM | 6.27 | 5.2 | +| | | 1 | 5.43 | 3.38 (+37.75%) | 2.82 (+48.07%) | +| | 1024 | 10 | OOM | OOM | OOM | +| | | 8 | OOM | OOM | OOM | +| | | 4 | OOM | OOM | 19.35 | +| | | 2 | OOM | 13 | 10.78 | +| | | 1 | OOM | 6.66 | 5.54 | + +As seen in the tables above, the speed-up from `tomesd` becomes more pronounced for larger image resolutions. It is also interesting to note that with `tomesd`, it is possible to run the pipeline on a higher resolution like 1024x1024. You may be able to speed-up inference even more with [`torch.compile`](torch2.0). diff --git a/docs/source/en/optimization/torch2.0.md b/docs/source/en/optimization/torch2.0.md index e1de432f6590..4984f385a3f5 100644 --- a/docs/source/en/optimization/torch2.0.md +++ b/docs/source/en/optimization/torch2.0.md @@ -10,96 +10,83 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o specific language governing permissions and limitations under the License. --> -# Accelerated PyTorch 2.0 support in Diffusers +# Torch 2.0 -Starting from version `0.13.0`, Diffusers supports the latest optimization from [PyTorch 2.0](https://pytorch.org/get-started/pytorch-2.0/). These include: -1. Support for accelerated transformers implementation with memory-efficient attention – no extra dependencies (such as `xformers`) required. -2. [torch.compile](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) support for extra performance boost when individual models are compiled. +πŸ€— Diffusers supports the latest optimizations from [PyTorch 2.0](https://pytorch.org/get-started/pytorch-2.0/) which include: +1. A memory-efficient attention implementation, scaled dot product attention, without requiring any extra dependencies such as xFormers. +2. [`torch.compile`](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html), a just-in-time (JIT) compiler to provide an extra performance boost when individual models are compiled. -## Installation - -To benefit from the accelerated attention implementation and `torch.compile()`, you just need to install the latest versions of PyTorch 2.0 from pip, and make sure you are on diffusers 0.13.0 or later. As explained below, diffusers automatically uses the optimized attention processor ([`AttnProcessor2_0`](https://github.com/huggingface/diffusers/blob/1a5797c6d4491a879ea5285c4efc377664e0332d/src/diffusers/models/attention_processor.py#L798)) (but not `torch.compile()`) -when PyTorch 2.0 is available. +Both of these optimizations require PyTorch 2.0 or later and πŸ€— Diffusers > 0.13.0. ```bash pip install --upgrade torch diffusers ``` -## Using accelerated transformers and `torch.compile`. - - -1. **Accelerated Transformers implementation** - - PyTorch 2.0 includes an optimized and memory-efficient attention implementation through the [`torch.nn.functional.scaled_dot_product_attention`](https://pytorch.org/docs/master/generated/torch.nn.functional.scaled_dot_product_attention) function, which automatically enables several optimizations depending on the inputs and the GPU type. This is similar to the `memory_efficient_attention` from [xFormers](https://github.com/facebookresearch/xformers), but built natively into PyTorch. +## Scaled dot product attention - These optimizations will be enabled by default in Diffusers if PyTorch 2.0 is installed and if `torch.nn.functional.scaled_dot_product_attention` is available. To use it, just install `torch 2.0` as suggested above and simply use the pipeline. For example: +[`torch.nn.functional.scaled_dot_product_attention`](https://pytorch.org/docs/master/generated/torch.nn.functional.scaled_dot_product_attention) (SDPA) is an optimized and memory-efficient attention (similar to xFormers) that automatically enables several other optimizations depending on the model inputs and GPU type. SDPA is enabled by default if you're using PyTorch 2.0 and the latest version of πŸ€— Diffusers, so you don't need to add anything to your code. - ```Python - import torch - from diffusers import DiffusionPipeline +However, if you want to explicitly enable it, you can set a [`DiffusionPipeline`] to use [`~models.attention_processor.AttnProcessor2_0`]: - pipe = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True) - pipe = pipe.to("cuda") +```diff + import torch + from diffusers import DiffusionPipeline ++ from diffusers.models.attention_processor import AttnProcessor2_0 - prompt = "a photo of an astronaut riding a horse on mars" - image = pipe(prompt).images[0] - ``` + pipe = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True).to("cuda") ++ pipe.unet.set_attn_processor(AttnProcessor2_0()) - If you want to enable it explicitly (which is not required), you can do so as shown below. - - ```diff - import torch - from diffusers import DiffusionPipeline - + from diffusers.models.attention_processor import AttnProcessor2_0 + prompt = "a photo of an astronaut riding a horse on mars" + image = pipe(prompt).images[0] +``` - pipe = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True).to("cuda") - + pipe.unet.set_attn_processor(AttnProcessor2_0()) +SDPA should be as fast and memory efficient as `xFormers`; check the [benchmark](#benchmark) for more details. - prompt = "a photo of an astronaut riding a horse on mars" - image = pipe(prompt).images[0] - ``` +In some cases - such as making the pipeline more deterministic or converting it to other formats - it may be helpful to use the vanilla attention processor, [`~models.attention_processor.AttnProcessor`]. To revert to [`~models.attention_processor.AttnProcessor`], call the [`~UNet2DConditionModel.set_default_attn_processor`] function on the pipeline: - This should be as fast and memory efficient as `xFormers`. More details [in our benchmark](#benchmark). +```diff + import torch + from diffusers import DiffusionPipeline + from diffusers.models.attention_processor import AttnProcessor - It is possible to revert to the vanilla attention processor ([`AttnProcessor`](https://github.com/huggingface/diffusers/blob/1a5797c6d4491a879ea5285c4efc377664e0332d/src/diffusers/models/attention_processor.py#L402)), which can be helpful to make the pipeline more deterministic, or if you need to convert a fine-tuned model to other formats such as [Core ML](https://huggingface.co/docs/diffusers/v0.16.0/en/optimization/coreml#how-to-run-stable-diffusion-with-core-ml). To use the normal attention processor you can use the [`~diffusers.UNet2DConditionModel.set_default_attn_processor`] function: + pipe = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True).to("cuda") ++ pipe.unet.set_default_attn_processor() - ```Python - import torch - from diffusers import DiffusionPipeline - from diffusers.models.attention_processor import AttnProcessor + prompt = "a photo of an astronaut riding a horse on mars" + image = pipe(prompt).images[0] +``` - pipe = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True).to("cuda") - pipe.unet.set_default_attn_processor() +## torch.compile - prompt = "a photo of an astronaut riding a horse on mars" - image = pipe(prompt).images[0] - ``` +The `torch.compile` function can often provide an additional speed-up to your PyTorch code. In πŸ€— Diffusers, it is usually best to wrap the UNet with `torch.compile` because it does most of the heavy lifting in the pipeline. -2. **torch.compile** +```python +from diffusers import DiffusionPipeline +import torch - To get an additional speedup, we can use the new `torch.compile` feature. Since the UNet of the pipeline is usually the most computationally expensive, we wrap the `unet` with `torch.compile` leaving rest of the sub-models (text encoder and VAE) as is. For more information and different options, refer to the - [torch compile docs](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html). +pipe = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True).to("cuda") +pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True) +images = pipe(prompt, num_inference_steps=steps, num_images_per_prompt=batch_size).images[0] +``` - ```python - pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True) - images = pipe(prompt, num_inference_steps=steps, num_images_per_prompt=batch_size).images - ``` +Depending on GPU type, `torch.compile` can provide an *addtional speed-up* of **5-300x** on top of SDPA! If you're using more recent GPU architectures such as Ampere (A100, 3090), Ada (4090), and Hopper (H100), `torch.compile` is able to squeeze even more performance out of these GPUs. - Depending on the type of GPU, `compile()` can yield between **5% - 300%** of _additional speed-up_ over the accelerated transformer optimizations. Note, however, that compilation is able to squeeze more performance improvements in more recent GPU architectures such as Ampere (A100, 3090), Ada (4090) and Hopper (H100). - - Compilation takes some time to complete, so it is best suited for situations where you need to prepare your pipeline once and then perform the same type of inference operations multiple times. Calling the compiled pipeline on a different image size will re-trigger compilation which can be expensive. +Compilation requires some time to complete, so it is best suited for situations where you prepare your pipeline once and then perform the same type of inference operations multiple times. For example, calling the compiled pipeline on a different image size triggers compilation again which can be expensive. +For more information and different options about `torch.compile`, refer to the [`torch_compile`](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) tutorial. ## Benchmark -We conducted a comprehensive benchmark with PyTorch 2.0's efficient attention implementation and `torch.compile` across different GPUs and batch sizes for five of our most used pipelines. We used `diffusers 0.17.0.dev0`, which [makes sure `torch.compile()` is leveraged optimally](https://github.com/huggingface/diffusers/pull/3313). +We conducted a comprehensive benchmark with PyTorch 2.0's efficient attention implementation and `torch.compile` across different GPUs and batch sizes for five of our most used pipelines. The code is benchmarked on πŸ€— Diffusers v0.17.0.dev0 to optimize `torch.compile` usage (see [here](https://github.com/huggingface/diffusers/pull/3313) for more details). -### Benchmarking code +Expand the dropdown below to find the code used to benchmark each pipeline: -#### Stable Diffusion text-to-image +
-```python +### Stable Diffusion text-to-image + +```python from diffusers import DiffusionPipeline import torch @@ -121,7 +108,7 @@ for _ in range(3): images = pipe(prompt=prompt).images ``` -#### Stable Diffusion image-to-image +### Stable Diffusion image-to-image ```python from diffusers import StableDiffusionImg2ImgPipeline @@ -154,7 +141,7 @@ for _ in range(3): image = pipe(prompt=prompt, image=init_image).images[0] ``` -#### Stable Diffusion - inpainting +### Stable Diffusion inpainting ```python from diffusers import StableDiffusionInpaintPipeline @@ -194,7 +181,7 @@ for _ in range(3): image = pipe(prompt=prompt, image=init_image, mask_image=mask_image).images[0] ``` -#### ControlNet +### ControlNet ```python from diffusers import StableDiffusionControlNetPipeline, ControlNetModel @@ -232,7 +219,7 @@ for _ in range(3): image = pipe(prompt=prompt, image=init_image).images[0] ``` -#### IF text-to-image + upscaling +### DeepFloyd IF text-to-image + upscaling ```python from diffusers import DiffusionPipeline @@ -267,24 +254,18 @@ for _ in range(3): image_2 = pipe_2(image=image, prompt_embeds=prompt_embeds, negative_prompt_embeds=neg_prompt_embeds, output_type="pt").images image_3 = pipe_3(prompt=prompt, image=image, noise_level=100).images ``` +
-To give you a pictorial overview of the possible speed-ups that can be obtained with PyTorch 2.0 and `torch.compile()`, -here is a plot that shows relative speed-ups for the [Stable Diffusion text-to-image pipeline](StableDiffusionPipeline) across five -different GPU families (with a batch size of 4): +The graph below highlights the relative speed-ups for the [`StableDiffusionPipeline`] across five GPU families with PyTorch 2.0 and `torch.compile` enabled. The benchmarks for the following graphs are measured in *number of iterations/second*. ![t2i_speedup](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/pt2_benchmarks/t2i_speedup.png) -To give you an even better idea of how this speed-up holds for the other pipelines presented above, consider the following -plot that shows the benchmarking numbers from an A100 across three different batch sizes -(with PyTorch 2.0 nightly and `torch.compile()`): +To give you an even better idea of how this speed-up holds for the other pipelines, consider the following +graph for an A100 with PyTorch 2.0 and `torch.compile`: ![a100_numbers](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/pt2_benchmarks/a100_numbers.png) -_(Our benchmarking metric for the plots above is **number of iterations/second**)_ - -But we reveal all the benchmarking numbers in the interest of transparency! - -In the following tables, we report our findings in terms of the number of **_iterations processed per second_**. +In the following tables, we report our findings in terms of the *number of iterations/second*. ### A100 (batch size: 1) @@ -438,7 +419,7 @@ In the following tables, we report our findings in terms of the number of **_ite ## Notes -* Follow [this PR](https://github.com/huggingface/diffusers/pull/3313) for more details on the environment used for conducting the benchmarks. -* For the IF pipeline and batch sizes > 1, we only used a batch size of >1 in the first IF pipeline for text-to-image generation and NOT for upscaling. So, that means the two upscaling pipelines received a batch size of 1. +* Follow this [PR](https://github.com/huggingface/diffusers/pull/3313) for more details on the environment used for conducting the benchmarks. +* For the DeepFloyd IF pipeline where batch sizes > 1, we only used a batch size of > 1 in the first IF pipeline for text-to-image generation and NOT for upscaling. That means the two upscaling pipelines received a batch size of 1. *Thanks to [Horace He](https://github.com/Chillee) from the PyTorch team for their support in improving our support of `torch.compile()` in Diffusers.* \ No newline at end of file diff --git a/docs/source/en/optimization/xformers.md b/docs/source/en/optimization/xformers.md index ede074a59fa9..e5aa4d106ad2 100644 --- a/docs/source/en/optimization/xformers.md +++ b/docs/source/en/optimization/xformers.md @@ -10,11 +10,11 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o specific language governing permissions and limitations under the License. --> -# Installing xFormers +# xFormers -We recommend the use of [xFormers](https://github.com/facebookresearch/xformers) for both inference and training. In our tests, the optimizations performed in the attention blocks allow for both faster speed and reduced memory consumption. +We recommend [xFormers](https://github.com/facebookresearch/xformers) for both inference and training. In our tests, the optimizations performed in the attention blocks allow for both faster speed and reduced memory consumption. -Starting from version `0.0.16` of xFormers, released on January 2023, installation can be easily performed using pre-built pip wheels: +Install xFormers from `pip`: ```bash pip install xformers @@ -22,14 +22,14 @@ pip install xformers -The xFormers PIP package requires the latest version of PyTorch (1.13.1 as of xFormers 0.0.16). If you need to use a previous version of PyTorch, then we recommend you install xFormers from source using [the project instructions](https://github.com/facebookresearch/xformers#installing-xformers). +The xFormers `pip` package requires the latest version of PyTorch. If you need to use a previous version of PyTorch, then we recommend [installing xFormers from the source](https://github.com/facebookresearch/xformers#installing-xformers). -After xFormers is installed, you can use `enable_xformers_memory_efficient_attention()` for faster inference and reduced memory consumption, as discussed [here](fp16#memory-efficient-attention). +After xFormers is installed, you can use `enable_xformers_memory_efficient_attention()` for faster inference and reduced memory consumption as shown in this [section](memory#memory-efficient-attention). -According to [this issue](https://github.com/huggingface/diffusers/issues/2234#issuecomment-1416931212), xFormers `v0.0.16` cannot be used for training (fine-tune or Dreambooth) in some GPUs. If you observe that problem, please install a development version as indicated in that comment. +According to this [issue](https://github.com/huggingface/diffusers/issues/2234#issuecomment-1416931212), xFormers `v0.0.16` cannot be used for training (fine-tune or DreamBooth) in some GPUs. If you observe this problem, please install a development version as indicated in the issue comments. diff --git a/docs/source/en/using-diffusers/stable_diffusion_jax_how_to.md b/docs/source/en/using-diffusers/stable_diffusion_jax_how_to.md index 91d49a5f6748..d62ce0bf91bf 100644 --- a/docs/source/en/using-diffusers/stable_diffusion_jax_how_to.md +++ b/docs/source/en/using-diffusers/stable_diffusion_jax_how_to.md @@ -1,51 +1,41 @@ -# 🧨 Stable Diffusion in JAX / Flax ! +# JAX/Flax [[open-in-colab]] -πŸ€— Hugging Face [Diffusers](https://github.com/huggingface/diffusers) supports Flax since version `0.5.1`! This allows for super fast inference on Google TPUs, such as those available in Colab, Kaggle or Google Cloud Platform. +πŸ€— Diffusers supports Flax for super fast inference on Google TPUs, such as those available in Colab, Kaggle or Google Cloud Platform. This guide shows you how to run inference with Stable Diffusion using JAX/Flax. -This notebook shows how to run inference using JAX / Flax. If you want more details about how Stable Diffusion works or want to run it in GPU, please refer to [this notebook](https://huggingface.co/docs/diffusers/stable_diffusion). - -First, make sure you are using a TPU backend. If you are running this notebook in Colab, select `Runtime` in the menu above, then select the option "Change runtime type" and then select `TPU` under the `Hardware accelerator` setting. - -Note that JAX is not exclusive to TPUs, but it shines on that hardware because each TPU server has 8 TPU accelerators working in parallel. - -## Setup - -First make sure diffusers is installed. +Before you begin, make sure you have the necessary libraries installed: ```py # uncomment to install the necessary libraries in Colab -#!pip install jax==0.3.25 jaxlib==0.3.25 flax transformers ftfy -#!pip install diffusers +#!pip install -q jax==0.3.25 jaxlib==0.3.25 flax transformers ftfy +#!pip install -q diffusers ``` +You should also make sure you're using a TPU backend. While JAX does not run exclusively on TPUs, you'll get the best performance on a TPU because each server has 8 TPU accelerators working in parallel. + +If you are running this guide in Colab, select *Runtime* in the menu above, select the option *Change runtime type*, and then select *TPU* under the *Hardware accelerator* setting. Import JAX and quickly check whether you're using a TPU: + ```python +import jax import jax.tools.colab_tpu - jax.tools.colab_tpu.setup_tpu() -import jax -``` -```python num_devices = jax.device_count() device_type = jax.devices()[0].device_kind print(f"Found {num_devices} JAX devices of type {device_type}.") assert ( - "TPU" in device_type -), "Available device is not a TPU, please select TPU from Edit > Notebook settings > Hardware accelerator" -``` - -```python out -Found 8 JAX devices of type Cloud TPU. + "TPU" in device_type, + "Available device is not a TPU, please select TPU from Edit > Notebook settings > Hardware accelerator" +) +"Found 8 JAX devices of type Cloud TPU." ``` -Then we import all the dependencies. +Great, now you can import the rest of the dependencies you'll need: ```python import numpy as np -import jax import jax.numpy as jnp from pathlib import Path @@ -58,17 +48,12 @@ from huggingface_hub import notebook_login from diffusers import FlaxStableDiffusionPipeline ``` -## Model Loading +## Load a model -TPU devices support `bfloat16`, an efficient half-float type. We'll use it for our tests, but you can also use `float32` to use full precision instead. +Flax is a functional framework, so models are stateless and parameters are stored outside of them. Loading a pretrained Flax pipeline returns *both* the pipeline and the model weights (or parameters). In this guide, you'll use `bfloat16`, a more efficient half-float type that is supported by TPUs (you can also use `float32` for full precision if you want). ```python dtype = jnp.bfloat16 -``` - -Flax is a functional framework, so models are stateless and parameters are stored outside them. Loading the pre-trained Flax pipeline will return both the pipeline itself and the model weights (or parameters). We are using a `bf16` version of the weights, which leads to type warnings that you can safely ignore. - -```python pipeline, params = FlaxStableDiffusionPipeline.from_pretrained( "CompVis/stable-diffusion-v1-4", revision="bf16", @@ -78,95 +63,87 @@ pipeline, params = FlaxStableDiffusionPipeline.from_pretrained( ## Inference -Since TPUs usually have 8 devices working in parallel, we'll replicate our prompt as many times as devices we have. Then we'll perform inference on the 8 devices at once, each responsible for generating one image. Thus, we'll get 8 images in the same amount of time it takes for one chip to generate a single one. +TPUs usually have 8 devices working in parallel, so let's use the same prompt for each device. This means you can perform inference on 8 devices at once, with each device generating one image. As a result, you'll get 8 images in the same amount of time it takes for one chip to generate a single image! + + + +Learn more details in the [How does parallelization work?](#how-does-parallelization-work) section. -After replicating the prompt, we obtain the tokenized text ids by invoking the `prepare_inputs` function of the pipeline. The length of the tokenized text is set to 77 tokens, as required by the configuration of the underlying CLIP Text model. + + +After replicating the prompt, get the tokenized text ids by calling the `prepare_inputs` function on the pipeline. The length of the tokenized text is set to 77 tokens as required by the configuration of the underlying CLIP text model. ```python prompt = "A cinematic film still of Morgan Freeman starring as Jimi Hendrix, portrait, 40mm lens, shallow depth of field, close up, split lighting, cinematic" prompt = [prompt] * jax.device_count() prompt_ids = pipeline.prepare_inputs(prompt) prompt_ids.shape +"(8, 77)" ``` -```python out -(8, 77) -``` - -### Replication and parallelization - -Model parameters and inputs have to be replicated across the 8 parallel devices we have. The parameters dictionary is replicated using `flax.jax_utils.replicate`, which traverses the dictionary and changes the shape of the weights so they are repeated 8 times. Arrays are replicated using `shard`. +Model parameters and inputs have to be replicated across the 8 parallel devices. The parameters dictionary is replicated with [`flax.jax_utils.replicate`](https://flax.readthedocs.io/en/latest/api_reference/flax.jax_utils.html#flax.jax_utils.replicate) which traverses the dictionary and changes the shape of the weights so they are repeated 8 times. Arrays are replicated using `shard`. ```python +# parameters p_params = replicate(params) -``` -```python +# arrays prompt_ids = shard(prompt_ids) prompt_ids.shape +"(8, 1, 77)" ``` -```python out -(8, 1, 77) -``` - -That shape means that each one of the `8` devices will receive as an input a `jnp` array with shape `(1, 77)`. `1` is therefore the batch size per device. In TPUs with sufficient memory, it could be larger than `1` if we wanted to generate multiple images (per chip) at once. +This shape means each one of the 8 devices receives as an input a `jnp` array with shape `(1, 77)`, where `1` is the batch size per device. On TPUs with sufficient memory, you could have a batch size larger than `1` if you want to generate multiple images (per chip) at once. -We are almost ready to generate images! We just need to create a random number generator to pass to the generation function. This is the standard procedure in Flax, which is very serious and opinionated about random numbers – all functions that deal with random numbers are expected to receive a generator. This ensures reproducibility, even when we are training across multiple distributed devices. +Next, create a random number generator to pass to the generation function. This is standard procedure in Flax, which is very serious and opinionated about random numbers. All functions that deal with random numbers are expected to receive a generator to ensure reproducibility, even when you're training across multiple distributed devices. -The helper function below uses a seed to initialize a random number generator. As long as we use the same seed, we'll get the exact same results. Feel free to use different seeds when exploring results later in the notebook. +The helper function below uses a seed to initialize a random number generator. As long as you use the same seed, you'll get the exact same results. Feel free to use different seeds when exploring results later in the guide. ```python def create_key(seed=0): return jax.random.PRNGKey(seed) ``` -We obtain a rng and then "split" it 8 times so each device receives a different generator. Therefore, each device will create a different image, and the full process is reproducible. +The helper function, or `rng`, is split 8 times so each device receives a different generator and generates a different image. ```python rng = create_key(0) rng = jax.random.split(rng, jax.device_count()) ``` -JAX code can be compiled to an efficient representation that runs very fast. However, we need to ensure that all inputs have the same shape in subsequent calls; otherwise, JAX will have to recompile the code, and we wouldn't be able to take advantage of the optimized speed. +To take advantage of JAX's optimized speed on a TPU, pass `jit=True` to the pipeline to compile the JAX code into an efficient representation and to ensure the model runs in parallel across the 8 devices. -The Flax pipeline can compile the code for us if we pass `jit = True` as an argument. It will also ensure that the model runs in parallel in the 8 available devices. + -The first time we run the following cell it will take a long time to compile, but subequent calls (even with different inputs) will be much faster. For example, it took more than a minute to compile in a TPU v2-8 when I tested, but then it takes about **`7s`** for future inference runs. +You need to ensure all your inputs have the same shape in subsequent calls, other JAX will need to recompile the code which is slower. -``` -%%time -images = pipeline(prompt_ids, p_params, rng, jit=True)[0] -``` + -```python out -CPU times: user 56.2 s, sys: 42.5 s, total: 1min 38s -Wall time: 1min 29s -``` +The first inference run takes more time because it needs to compile the code, but subsequent calls (even with different inputs) are much faster. For example, it took more than a minute to compile on a TPU v2-8, but then it takes about **7s** on a future inference run! -The returned array has shape `(8, 1, 512, 512, 3)`. We reshape it to get rid of the second dimension and obtain 8 images of `512 Γ— 512 Γ— 3` and then convert them to PIL. +```py +%%time +images = pipeline(prompt_ids, p_params, rng, jit=True)[0] -```python -images = images.reshape((images.shape[0] * images.shape[1],) + images.shape[-3:]) -images = pipeline.numpy_to_pil(images) +"CPU times: user 56.2 s, sys: 42.5 s, total: 1min 38s" +"Wall time: 1min 29s" ``` -### Visualization +The returned array has shape `(8, 1, 512, 512, 3)` which should be reshaped to remove the second dimension and get 8 images of `512 Γ— 512 Γ— 3`. Then you can use the [`~utils.numpy_to_pil`] function to convert the arrays into images. ```python from diffusers import make_image_grid +images = images.reshape((images.shape[0] * images.shape[1],) + images.shape[-3:]) +images = pipeline.numpy_to_pil(images) make_image_grid(images, 2, 4) ``` ![img](https://huggingface.co/datasets/YiYiXu/test-doc-assets/resolve/main/stable_diffusion_jax_how_to_cell_38_output_0.jpeg) - ## Using different prompts -We don't have to replicate the _same_ prompt in all the devices. We can do whatever we want: generate 2 prompts 4 times each, or even generate 8 different prompts at once. Let's do that! - -First, we'll refactor the input preparation code into a handy function: +You don't necessarily have to use the same prompt on all devices. For example, to generate 8 different prompts: ```python prompts = [ @@ -179,9 +156,7 @@ prompts = [ "Armchair in the shape of an avocado", "Clown astronaut in space, with Earth in the background", ] -``` -```python prompt_ids = pipeline.prepare_inputs(prompts) prompt_ids = shard(prompt_ids) @@ -197,46 +172,41 @@ make_image_grid(images, 2, 4) ## How does parallelization work? -We said before that the `diffusers` Flax pipeline automatically compiles the model and runs it in parallel on all available devices. We'll now briefly look inside that process to show how it works. +The Flax pipeline in πŸ€— Diffusers automatically compiles the model and runs it in parallel on all available devices. Let's take a closer look at how that process works. -JAX parallelization can be done in multiple ways. The easiest one revolves around using the `jax.pmap` function to achieve single-program, multiple-data (SPMD) parallelization. It means we'll run several copies of the same code, each on different data inputs. More sophisticated approaches are possible, we invite you to go over the [JAX documentation](https://jax.readthedocs.io/en/latest/index.html) and the [`pjit` pages](https://jax.readthedocs.io/en/latest/jax-101/08-pjit.html?highlight=pjit) to explore this topic if you are interested! +JAX parallelization can be done in multiple ways. The easiest one revolves around using the [`jax.pmap`](https://jax.readthedocs.io/en/latest/_autosummary/jax.pmap.html) function to achieve single-program multiple-data (SPMD) parallelization. It means running several copies of the same code, each on different data inputs. More sophisticated approaches are possible, and you can go over to the JAX [documentation](https://jax.readthedocs.io/en/latest/index.html) to explore this topic in more detail if you are interested! -`jax.pmap` does two things for us: -- Compiles (or `jit`s) the code, as if we had invoked `jax.jit()`. This does not happen when we call `pmap`, but the first time the pmapped function is invoked. -- Ensures the compiled code runs in parallel in all the available devices. +`jax.pmap` does two things: -To show how it works we `pmap` the `_generate` method of the pipeline, which is the private method that runs generates images. Please, note that this method may be renamed or removed in future releases of `diffusers`. +1. Compiles (or "`jit`s") the code which is similar to `jax.jit()`. This does not happen when you call `pmap`, and only the first time the `pmap`ped function is called. +2. Ensures the compiled code runs in parallel on all available devices. + +To demonstrate, call `pmap` on the pipeline's `_generate` method (this is a private method that generates images and may be renamed or removed in future releases of πŸ€— Diffusers): ```python p_generate = pmap(pipeline._generate) ``` -After we use `pmap`, the prepared function `p_generate` will conceptually do the following: -* Invoke a copy of the underlying function `pipeline._generate` in each device. -* Send each device a different portion of the input arguments. That's what sharding is used for. In our case, `prompt_ids` has shape `(8, 1, 77, 768)`. This array will be split in `8` and each copy of `_generate` will receive an input with shape `(1, 77, 768)`. +After calling `pmap`, the prepared function `p_generate` will: -We can code `_generate` completely ignoring the fact that it will be invoked in parallel. We just care about our batch size (`1` in this example) and the dimensions that make sense for our code, and don't have to change anything to make it work in parallel. +1. Make a copy of the underlying function, `pipeline._generate`, on each device. +2. Send each device a different portion of the input arguments (this is why its necessary to call the *shard* function). In this case, `prompt_ids` has shape `(8, 1, 77, 768)` so the array is split into 8 and each copy of `_generate` receives an input with shape `(1, 77, 768)`. -The same way as when we used the pipeline call, the first time we run the following cell it will take a while, but then it will be much faster. +The most important thing to pay attention to here is the batch size (1 in this example), and the input dimensions that make sense for your code. You don't have to change anything else to make the code work in parallel. -``` +The first time you call the pipeline takes more time, but the calls afterward are much faster. The `block_until_ready` function is used to correctly measure inference time because JAX uses asynchronous dispatch and returns control to the Python loop as soon as it can. You don't need to use that in your code; blocking occurs automatically when you want to use the result of a computation that has not yet been materialized. + +```py %%time images = p_generate(prompt_ids, p_params, rng) images = images.block_until_ready() -images.shape +"CPU times: user 1min 15s, sys: 18.2 s, total: 1min 34s" +"Wall time: 1min 15s" ``` -```python out -CPU times: user 1min 15s, sys: 18.2 s, total: 1min 34s -Wall time: 1min 15s -``` +Check your image dimensions to see if they're correct: ```python images.shape -``` - -```python out -(8, 1, 512, 512, 3) -``` - -We use `block_until_ready()` to correctly measure inference time, because JAX uses asynchronous dispatch and returns control to the Python loop as soon as it can. You don't need to use that in your code; blocking will occur automatically when you want to use the result of a computation that has not yet been materialized. \ No newline at end of file +"(8, 1, 512, 512, 3)" +``` \ No newline at end of file