You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/en/optimization/fp16.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -20,9 +20,9 @@ In many cases, optimizing for speed or memory leads to improved performance in t
20
20
21
21
</Tip>
22
22
23
-
The results below are obtained from generating a single 512x512 image from the prompt `a photo of an astronaut riding a horse on mars` with 50 DDIM steps on a Nvidia Titan RTX, demonstrating the speedup you can expect.
23
+
The results below are obtained from generating a single 512x512 image from the prompt `a photo of an astronaut riding a horse on mars` with 50 DDIM steps on a Nvidia Titan RTX, demonstrating the speed-up you can expect.
24
24
25
-
||Latency|Speedup|
25
+
||latency|speed-up|
26
26
| ---------------- | ------- | ------- |
27
27
| original | 9.50s | x1 |
28
28
| fp16 | 3.61s | x2.63 |
@@ -32,7 +32,7 @@ The results below are obtained from generating a single 512x512 image from the p
32
32
33
33
## Use TensorFloat-32
34
34
35
-
On Ampere and later CUDA devices, matrix multiplications and convolutions can use the [TensorFloat-32 (TF32)](https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/) mode for faster, but slightly less accurate computations. By default, PyTorch enables TF32 mode for convolutions but not matrix multiplications. Unless your network requires full float32 precision, we recommend enabling TF32 for matrix multiplications. It can significantly speed up computations with typically negligible loss in numerical accuracy.
35
+
On Ampere and later CUDA devices, matrix multiplications and convolutions can use the [TensorFloat-32 (TF32)](https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/) mode for faster, but slightly less accurate computations. By default, PyTorch enables TF32 mode for convolutions but not matrix multiplications. Unless your network requires full float32 precision, we recommend enabling TF32 for matrix multiplications. It can significantly speeds up computations with typically negligible loss in numerical accuracy.
Copy file name to clipboardExpand all lines: docs/source/en/optimization/habana.md
+19-21Lines changed: 19 additions & 21 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -10,25 +10,22 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
10
10
specific language governing permissions and limitations under the License.
11
11
-->
12
12
13
-
# How to use Stable Diffusion on Habana Gaudi
13
+
# Habana Gaudi
14
14
15
-
🤗 Diffusers is compatible with Habana Gaudi through 🤗 [Optimum Habana](https://huggingface.co/docs/optimum/habana/usage_guides/stable_diffusion).
15
+
🤗 Diffusers is compatible with Habana Gaudi through 🤗 [Optimum](https://huggingface.co/docs/optimum/habana/usage_guides/stable_diffusion). Follow the [installation](https://docs.habana.ai/en/latest/Installation_Guide/index.html) guide to install the SynapseAI and Gaudi drivers, and then install Optimum Habana:
- Optimum Habana 1.6 or later, [here](https://huggingface.co/docs/optimum/habana/installation) is how to install it.
20
-
- SynapseAI 1.10.
21
+
To generate images with Stable Diffusion 1 and 2 on Gaudi, you need to instantiate two instances:
21
22
23
+
-[`~optimum.habana.diffusers.GaudiStableDiffusionPipeline`], a pipeline for text-to-image generation.
24
+
-[`~optimum.habana.diffusers.GaudiDDIMScheduler`], a Gaudi-optimized scheduler.
22
25
23
-
## Inference Pipeline
26
+
When you initialize the pipeline, you have to specify `use_habana=True` to deploy it on HPUs and to get the fastest possible generation, you should enable **HPU graphs** with `use_hpu_graphs=True`.
24
27
25
-
To generate images with Stable Diffusion 1 and 2 on Gaudi, you need to instantiate two instances:
26
-
- A pipeline with [`GaudiStableDiffusionPipeline`](https://huggingface.co/docs/optimum/habana/package_reference/stable_diffusion_pipeline). This pipeline supports *text-to-image generation*.
27
-
- A scheduler with [`GaudiDDIMScheduler`](https://huggingface.co/docs/optimum/habana/package_reference/stable_diffusion_pipeline#optimum.habana.diffusers.GaudiDDIMScheduler). This scheduler has been optimized for Habana Gaudi.
28
-
29
-
When initializing the pipeline, you have to specify `use_habana=True` to deploy it on HPUs.
30
-
Furthermore, in order to get the fastest possible generations you should enable **HPU graphs** with `use_hpu_graphs=True`.
31
-
Finally, you will need to specify a [Gaudi configuration](https://huggingface.co/docs/optimum/habana/package_reference/gaudi_config) which can be downloaded from the [Hugging Face Hub](https://huggingface.co/Habana).
28
+
Finally, specify a [`~optimum.habana.GaudiConfig`] which can be downloaded from the [Habana](https://huggingface.co/Habana) organization on the Hub.
You can then call the pipeline to generate images by batches from one or several prompts:
45
+
Now you can call the pipeline to generate images by batches from one or several prompts:
46
+
49
47
```python
50
48
outputs = pipeline(
51
49
prompt=[
@@ -57,21 +55,21 @@ outputs = pipeline(
57
55
)
58
56
```
59
57
60
-
For more information, check out Optimum Habana's [documentation](https://huggingface.co/docs/optimum/habana/usage_guides/stable_diffusion) and the [example](https://github.com/huggingface/optimum-habana/tree/main/examples/stable-diffusion) provided in the official Github repository.
58
+
For more information, check out 🤗 Optimum Habana's [documentation](https://huggingface.co/docs/optimum/habana/usage_guides/stable_diffusion) and the [example](https://github.com/huggingface/optimum-habana/tree/main/examples/stable-diffusion) provided in the official Github repository.
61
59
62
60
63
61
## Benchmark
64
62
65
-
Here are the latencies for Habana first-generation Gaudi and Gaudi2 with the [Habana/stable-diffusion](https://huggingface.co/Habana/stable-diffusion) and [Habana/stable-diffusion-2](https://huggingface.co/Habana/stable-diffusion-2) Gaudi configurations (mixed precision bf16/fp32):
63
+
We benchmarked Habana's first-generation Gaudi and Gaudi2 with the [Habana/stable-diffusion](https://huggingface.co/Habana/stable-diffusion) and [Habana/stable-diffusion-2](https://huggingface.co/Habana/stable-diffusion-2) Gaudi configurations (mixed precision bf16/fp32) to demonstrate their performance.
Copy file name to clipboardExpand all lines: docs/source/en/optimization/mps.md
+34-30Lines changed: 34 additions & 30 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -10,29 +10,16 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
10
10
specific language governing permissions and limitations under the License.
11
11
-->
12
12
13
-
# How to use Stable Diffusion in Apple Silicon (M1/M2)
13
+
# Metal Performance Shaders (MPS)
14
14
15
-
🤗 Diffusers is compatible with Apple silicon for Stable Diffusion inference, using the PyTorch `mps` device. These are the steps you need to follow to use your M1 or M2 computer with Stable Diffusion.
15
+
🤗 Diffusers is compatible with Apple silicon (M1/M2 chips) using the PyTorch [`mps`](https://pytorch.org/docs/stable/notes/mps.html) device, which uses the Metal framework to leverage the GPU on MacOS devices. You'll need to have:
16
16
17
-
## Requirements
17
+
- macOS computer with Apple silicon (M1/M2) hardware
18
+
- macOS 12.6 or later (13.0 or later recommended)
19
+
- arm64 version of Python
20
+
-[PyTorch 2.0](https://pytorch.org/get-started/locally/) (recommended) or 1.13 (minimum version supported for `mps`)
18
21
19
-
- Mac computer with Apple silicon (M1/M2) hardware.
20
-
- macOS 12.6 or later (13.0 or later recommended).
21
-
- arm64 version of Python.
22
-
- PyTorch 2.0 (recommended) or 1.13 (minimum version supported for `mps`). You can install it with `pip` or `conda` using the instructions in https://pytorch.org/get-started/locally/.
23
-
24
-
25
-
## Inference Pipeline
26
-
27
-
The snippet below demonstrates how to use the `mps` backend using the familiar `to()` interface to move the Stable Diffusion pipeline to your M1 or M2 device.
28
-
29
-
<Tipwarning={true}>
30
-
31
-
**If you are using PyTorch 1.13** you need to "prime" the pipeline using an additional one-time pass through it. This is a temporary workaround for a weird issue we detected: the first inference pass produces slightly different results than subsequent ones. You only need to do this pass once, and it's ok to use just one inference step and discard the result.
32
-
33
-
</Tip>
34
-
35
-
We strongly recommend you use PyTorch 2 or better, as it solves a number of problems like the one described in the previous tip.
22
+
The `mps` backend uses PyTorch's `.to()` interface to move the Stable Diffusion pipeline on to your M1 or M2 device:
36
23
37
24
```python
38
25
from diffusers import DiffusionPipeline
@@ -44,24 +31,41 @@ pipe = pipe.to("mps")
44
31
pipe.enable_attention_slicing()
45
32
46
33
prompt ="a photo of an astronaut riding a horse on mars"
34
+
```
35
+
36
+
<Tipwarning={true}>
37
+
38
+
Generating multiple prompts in a batch can [crash](https://github.com/huggingface/diffusers/issues/363) or fail to work reliably. We believe this is related to the [`mps`](https://github.com/pytorch/pytorch/issues/84039) backend in PyTorch. While this is being investigated, you should iterate instead of batching.
39
+
40
+
</Tip>
41
+
42
+
If you're using **PyTorch 1.13**, you need to "prime" the pipeline with an additional one-time pass through it. This is a temporary workaround for an issue where the first inference pass produces slightly different results than subsequent ones. You only need to do this pass once, and after just one inference step you can discard the result.
43
+
44
+
```diff
45
+
from diffusers import DiffusionPipeline
47
46
48
-
# First-time "warmup" pass if PyTorch version is 1.13 (see explanation above)
prompt = "a photo of an astronaut riding a horse on mars"
51
+
# First-time "warmup" pass if PyTorch version is 1.13
52
+
+ _ = pipe(prompt, num_inference_steps=1)
50
53
51
54
# Results match those from the CPU device after the warmup pass.
52
-
image = pipe(prompt).images[0]
55
+
image = pipe(prompt).images[0]
53
56
```
54
57
55
-
## Performance Recommendations
58
+
## Recommendation
56
59
57
-
M1/M2 performance is very sensitive to memory pressure. The system will automatically swap if it needs to, but performance will degrade significantly when it does.
60
+
M1/M2 performance is very sensitive to memory pressure. When this occurs, the system automatically swaps if it needs to which significantly degrades performance.
58
61
59
-
We recommend you use _attention slicing_to reduce memory pressure during inference and prevent swapping, particularly if your computer has less than 64 GB of system RAM, or if you generate images at non-standard resolutions larger than 512 × 512 pixels. Attention slicing performs the costly attention operation in multiple steps instead of all at once. It usually has a performance impact of ~20% in computers without universal memory, but we have observed _better performance_ in most Apple Silicon computers, unless you have 64 GB or more.
62
+
To prevent this from happening, we recommend *attention slicing*to reduce memory pressure during inference and prevent swapping. This is especially relevant if your computer has less than 64GB of system RAM, or if you generate images at non-standard resolutions larger than 512×512 pixels. Call the [`~DiffusionPipeline.enable_attention_slicing`] function on your pipeline:
- Generating multiple prompts in a batch [crashes or doesn't work reliably](https://github.com/huggingface/diffusers/issues/363). We believe this is related to the [`mps` backend in PyTorch](https://github.com/pytorch/pytorch/issues/84039). This is being resolved, but for now we recommend to iterate instead of batching.
71
+
Attention slicing performs the costly attention operation in multiple steps instead of all at once. It usually improves performance by ~20% in computers without universal memory, but we've observed *better performance* in most Apple silicon computers unless you have 64GB of RAM or more.
Copy file name to clipboardExpand all lines: docs/source/en/optimization/onnx.md
+22-44Lines changed: 22 additions & 44 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -11,23 +11,19 @@ specific language governing permissions and limitations under the License.
11
11
-->
12
12
13
13
14
-
# How to use ONNX Runtime for inference
14
+
# ONNX Runtime
15
15
16
-
🤗 [Optimum](https://github.com/huggingface/optimum) provides a Stable Diffusion pipeline compatible with ONNX Runtime.
16
+
🤗 [Optimum](https://github.com/huggingface/optimum) provides a Stable Diffusion pipeline compatible with ONNX Runtime. You'll need to install 🤗 Optimum with the following command for ONNX Runtime support:
17
17
18
-
## Installation
19
-
20
-
Install 🤗 Optimum with the following command for ONNX Runtime support:
21
-
22
-
```
18
+
```bash
23
19
pip install optimum["onnxruntime"]
24
20
```
25
21
26
-
## Stable Diffusion
22
+
This guide will show you how to use the Stable Diffusion and Stable Diffusion XL (SDXL) pipelines with ONNX Runtime.
27
23
28
-
### Inference
24
+
##Stable Diffusion
29
25
30
-
To load an ONNX model and run inference with ONNX Runtime, you need to replace [`StableDiffusionPipeline`] with `ORTStableDiffusionPipeline`. In case you want to load a PyTorch model and convert it to the ONNX format on-the-fly, you can set `export=True`.
26
+
To load and run inference, use the [`~optimum.onnxruntime.ORTStableDiffusionPipeline`]. If you want to load a PyTorch model and convert it to the ONNX format on-the-fly, set `export=True`:
31
27
32
28
```python
33
29
from optimum.onnxruntime import ORTStableDiffusionPipeline
If you want to export the pipeline in the ONNX format offline and later use it for inference,
43
-
you can use the [`optimum-cli export`](https://huggingface.co/docs/optimum/main/en/exporters/onnx/usage_guides/export_a_model#exporting-a-model-to-onnx-using-the-cli) command:
38
+
<Tipwarning={true}>
39
+
40
+
Generating multiple prompts in a batch seems to take too much memory. While we look into it, you may need to iterate instead of batching.
41
+
42
+
</Tip>
43
+
44
+
To export the pipeline in the ONNX format offline and use it later for inference,
45
+
use the [`optimum-cli export`](https://huggingface.co/docs/optimum/main/en/exporters/onnx/usage_guides/export_a_model#exporting-a-model-to-onnx-using-the-cli) command:
You can find more examples in 🤗 Optimum [documentation](https://huggingface.co/docs/optimum/), and Stable Diffusion is supported for text-to-image, image-to-image, and inpainting.
76
67
77
68
## Stable Diffusion XL
78
69
79
-
### Export
80
-
81
-
To export your model to ONNX, you can use the [Optimum CLI](https://huggingface.co/docs/optimum/main/en/exporters/onnx/usage_guides/export_a_model#exporting-a-model-to-onnx-using-the-cli) as follows :
Here is an example of how you can load a SDXL ONNX model from [stabilityai/stable-diffusion-xl-base-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0) and run inference with ONNX Runtime :
70
+
To load and run inference with SDXL, use the [`~optimum.onnxruntime.ORTStableDiffusionXLPipeline`]:
90
71
91
72
```python
92
73
from optimum.onnxruntime import ORTStableDiffusionXLPipeline
@@ -97,13 +78,10 @@ prompt = "sailing ship in storm by Leonardo da Vinci"
To export the pipeline in the ONNX format and use it later for inference, use the [`optimum-cli export`](https://huggingface.co/docs/optimum/main/en/exporters/onnx/usage_guides/export_a_model#exporting-a-model-to-onnx-using-the-cli) command:
0 commit comments