update more sections

stevhliu · stevhliu · commit 717e4539a8c6 · 2023-09-08T11:37:28.000-07:00
diff --git a/docs/source/en/optimization/fp16.md b/docs/source/en/optimization/fp16.md
@@ -20,9 +20,9 @@ In many cases, optimizing for speed or memory leads to improved performance in t
 
 </Tip>
 
-The results below are obtained from generating a single 512x512 image from the prompt `a photo of an astronaut riding a horse on mars` with 50 DDIM steps on a Nvidia Titan RTX, demonstrating the speed up you can expect.
+The results below are obtained from generating a single 512x512 image from the prompt `a photo of an astronaut riding a horse on mars` with 50 DDIM steps on a Nvidia Titan RTX, demonstrating the speed-up you can expect.
 
-|                  | Latency | Speedup |
+|                  | latency | speed-up |
 | ---------------- | ------- | ------- |
 | original         | 9.50s   | x1      |
 | fp16             | 3.61s   | x2.63   |
@@ -32,7 +32,7 @@ The results below are obtained from generating a single 512x512 image from the p
 
 ## Use TensorFloat-32
 
-On Ampere and later CUDA devices, matrix multiplications and convolutions can use the [TensorFloat-32 (TF32)](https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/) mode for faster, but slightly less accurate computations. By default, PyTorch enables TF32 mode for convolutions but not matrix multiplications. Unless your network requires full float32 precision, we recommend enabling TF32 for matrix multiplications. It can significantly speed up computations with typically negligible loss in numerical accuracy.
+On Ampere and later CUDA devices, matrix multiplications and convolutions can use the [TensorFloat-32 (TF32)](https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/) mode for faster, but slightly less accurate computations. By default, PyTorch enables TF32 mode for convolutions but not matrix multiplications. Unless your network requires full float32 precision, we recommend enabling TF32 for matrix multiplications. It can significantly speeds up computations with typically negligible loss in numerical accuracy.
 
 ```python
 import torch
diff --git a/docs/source/en/optimization/habana.md b/docs/source/en/optimization/habana.md
@@ -10,25 +10,22 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
 specific language governing permissions and limitations under the License.
 -->
 
-# How to use Stable Diffusion on Habana Gaudi
+# Habana Gaudi
 
-🤗 Diffusers is compatible with Habana Gaudi through 🤗 [Optimum Habana](https://huggingface.co/docs/optimum/habana/usage_guides/stable_diffusion).
+🤗 Diffusers is compatible with Habana Gaudi through 🤗 [Optimum](https://huggingface.co/docs/optimum/habana/usage_guides/stable_diffusion). Follow the [installation](https://docs.habana.ai/en/latest/Installation_Guide/index.html) guide to install the SynapseAI and Gaudi drivers, and then install Optimum Habana:
 
-## Requirements
+```bash
+python -m pip install --upgrade-strategy eager optimum[habana]
+```
 
-- Optimum Habana 1.6 or later, [here](https://huggingface.co/docs/optimum/habana/installation) is how to install it.
-- SynapseAI 1.10.
+To generate images with Stable Diffusion 1 and 2 on Gaudi, you need to instantiate two instances:
 
+- [`~optimum.habana.diffusers.GaudiStableDiffusionPipeline`], a pipeline for text-to-image generation.
+- [`~optimum.habana.diffusers.GaudiDDIMScheduler`], a Gaudi-optimized scheduler.
 
-## Inference Pipeline
+When you initialize the pipeline, you have to specify `use_habana=True` to deploy it on HPUs and to get the fastest possible generation, you should enable **HPU graphs** with `use_hpu_graphs=True`.
 
-To generate images with Stable Diffusion 1 and 2 on Gaudi, you need to instantiate two instances:
-- A pipeline with [`GaudiStableDiffusionPipeline`](https://huggingface.co/docs/optimum/habana/package_reference/stable_diffusion_pipeline). This pipeline supports *text-to-image generation*.
-- A scheduler with [`GaudiDDIMScheduler`](https://huggingface.co/docs/optimum/habana/package_reference/stable_diffusion_pipeline#optimum.habana.diffusers.GaudiDDIMScheduler). This scheduler has been optimized for Habana Gaudi.
-
-When initializing the pipeline, you have to specify `use_habana=True` to deploy it on HPUs.
-Furthermore, in order to get the fastest possible generations you should enable **HPU graphs** with `use_hpu_graphs=True`.
-Finally, you will need to specify a [Gaudi configuration](https://huggingface.co/docs/optimum/habana/package_reference/gaudi_config) which can be downloaded from the [Hugging Face Hub](https://huggingface.co/Habana).
+Finally, specify a [`~optimum.habana.GaudiConfig`] which can be downloaded from the [Habana](https://huggingface.co/Habana) organization on the Hub.
 
 ```python
 from optimum.habana import GaudiConfig
@@ -45,7 +42,8 @@ pipeline = GaudiStableDiffusionPipeline.from_pretrained(
 )
 ```
 
-You can then call the pipeline to generate images by batches from one or several prompts:
+Now you can call the pipeline to generate images by batches from one or several prompts:
+
 ```python
 outputs = pipeline(
     prompt=[
@@ -57,21 +55,21 @@ outputs = pipeline(
 )
 ```
 
-For more information, check out Optimum Habana's [documentation](https://huggingface.co/docs/optimum/habana/usage_guides/stable_diffusion) and the [example](https://github.com/huggingface/optimum-habana/tree/main/examples/stable-diffusion) provided in the official Github repository.
+For more information, check out 🤗 Optimum Habana's [documentation](https://huggingface.co/docs/optimum/habana/usage_guides/stable_diffusion) and the [example](https://github.com/huggingface/optimum-habana/tree/main/examples/stable-diffusion) provided in the official Github repository.
 
 
 ## Benchmark
 
-Here are the latencies for Habana first-generation Gaudi and Gaudi2 with the [Habana/stable-diffusion](https://huggingface.co/Habana/stable-diffusion) and [Habana/stable-diffusion-2](https://huggingface.co/Habana/stable-diffusion-2) Gaudi configurations (mixed precision bf16/fp32):
+We benchmarked Habana's first-generation Gaudi and Gaudi2 with the [Habana/stable-diffusion](https://huggingface.co/Habana/stable-diffusion) and [Habana/stable-diffusion-2](https://huggingface.co/Habana/stable-diffusion-2) Gaudi configurations (mixed precision bf16/fp32) to demonstrate their performance.
 
-- [Stable Diffusion v1.5](https://huggingface.co/runwayml/stable-diffusion-v1-5) (512x512 resolution):
+For [Stable Diffusion v1.5](https://huggingface.co/runwayml/stable-diffusion-v1-5) on 512x512 images:
 
-|                        | Latency (batch size = 1) | Throughput (batch size = 8) |
+|                        | Latency (batch size = 1) | Throughput  |
 | ---------------------- |:------------------------:|:---------------------------:|
-| first-generation Gaudi | 3.80s                    | 0.308 images/s              |
-| Gaudi2                 | 1.33s                    | 1.081 images/s              |
+| first-generation Gaudi | 3.80s                    | 0.308 images/s (batch size = 8)             |
+| Gaudi2                 | 1.33s                    | 1.081 images/s (batch size = 8)             |
 
-- [Stable Diffusion v2.1](https://huggingface.co/stabilityai/stable-diffusion-2-1) (768x768 resolution):
+For [Stable Diffusion v2.1](https://huggingface.co/stabilityai/stable-diffusion-2-1) on 768x768 images:
 
 |                        | Latency (batch size = 1) | Throughput                      |
 | ---------------------- |:------------------------:|:-------------------------------:|
diff --git a/docs/source/en/optimization/memory.md b/docs/source/en/optimization/memory.md
@@ -333,14 +333,14 @@ The table below details the speed-ups from a few different Nvidia GPUs when runn
 | A100-SXM4-40GB   |              18.6it/s |                           29.it/s |
 | A100-SXM-80GB    |              18.7it/s |                          29.5it/s |
 
-To use Flash Attention, install the following:
-
 <Tip warning={true}>
 
 If you have PyTorch 2.0 installed, you shouldn't use xFormers!
 
 </Tip>
 
+To use Flash Attention, install the following:
+
 - PyTorch > 1.12
 - CUDA available
 - [xFormers](xformers)
diff --git a/docs/source/en/optimization/mps.md b/docs/source/en/optimization/mps.md
@@ -10,29 +10,16 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
 specific language governing permissions and limitations under the License.
 -->
 
-# How to use Stable Diffusion in Apple Silicon (M1/M2)
+# Metal Performance Shaders (MPS)
 
-🤗 Diffusers is compatible with Apple silicon for Stable Diffusion inference, using the PyTorch `mps` device. These are the steps you need to follow to use your M1 or M2 computer with Stable Diffusion.
+🤗 Diffusers is compatible with Apple silicon (M1/M2 chips) using the PyTorch [`mps`](https://pytorch.org/docs/stable/notes/mps.html) device, which uses the Metal framework to leverage the GPU on MacOS devices. You'll need to have:
 
-## Requirements
+- macOS computer with Apple silicon (M1/M2) hardware
+- macOS 12.6 or later (13.0 or later recommended)
+- arm64 version of Python
+- [PyTorch 2.0](https://pytorch.org/get-started/locally/) (recommended) or 1.13 (minimum version supported for `mps`)
 
-- Mac computer with Apple silicon (M1/M2) hardware.
-- macOS 12.6 or later (13.0 or later recommended).
-- arm64 version of Python.
-- PyTorch 2.0 (recommended) or 1.13 (minimum version supported for `mps`). You can install it with `pip` or `conda` using the instructions in https://pytorch.org/get-started/locally/.
-
-
-## Inference Pipeline
-
-The snippet below demonstrates how to use the `mps` backend using the familiar `to()` interface to move the Stable Diffusion pipeline to your M1 or M2 device.
-
-<Tip warning={true}>
-
-**If you are using PyTorch 1.13** you need to "prime" the pipeline using an additional one-time pass through it. This is a temporary workaround for a weird issue we detected: the first inference pass produces slightly different results than subsequent ones. You only need to do this pass once, and it's ok to use just one inference step and discard the result.
-
-</Tip>
-
-We strongly recommend you use PyTorch 2 or better, as it solves a number of problems like the one described in the previous tip.
+The `mps` backend uses PyTorch's `.to()` interface to move the Stable Diffusion pipeline on to your M1 or M2 device:
 
 ```python
 from diffusers import DiffusionPipeline
@@ -44,24 +31,41 @@ pipe = pipe.to("mps")
 pipe.enable_attention_slicing()
 
 prompt = "a photo of an astronaut riding a horse on mars"
+```
+
+<Tip warning={true}>
+
+Generating multiple prompts in a batch can [crash](https://github.com/huggingface/diffusers/issues/363) or fail to work reliably. We believe this is related to the [`mps`](https://github.com/pytorch/pytorch/issues/84039) backend in PyTorch. While this is being investigated, you should iterate instead of batching.
+
+</Tip>
+
+If you're using **PyTorch 1.13**, you need to "prime" the pipeline with an additional one-time pass through it. This is a temporary workaround for an issue where the first inference pass produces slightly different results than subsequent ones. You only need to do this pass once, and after just one inference step you can discard the result.
+
+```diff
+  from diffusers import DiffusionPipeline
 
-# First-time "warmup" pass if PyTorch version is 1.13 (see explanation above)
-_ = pipe(prompt, num_inference_steps=1)
+  pipe = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5").to("mps")
+  pipe.enable_attention_slicing()
+
+  prompt = "a photo of an astronaut riding a horse on mars"
+# First-time "warmup" pass if PyTorch version is 1.13
++ _ = pipe(prompt, num_inference_steps=1)
 
 # Results match those from the CPU device after the warmup pass.
-image = pipe(prompt).images[0]
+  image = pipe(prompt).images[0]
 ```
 
-## Performance Recommendations
+## Recommendation
 
-M1/M2 performance is very sensitive to memory pressure. The system will automatically swap if it needs to, but performance will degrade significantly when it does.
+M1/M2 performance is very sensitive to memory pressure. When this occurs, the system automatically swaps if it needs to which significantly degrades performance.
 
-We recommend you use _attention slicing_ to reduce memory pressure during inference and prevent swapping, particularly if your computer has less than 64 GB of system RAM, or if you generate images at non-standard resolutions larger than 512 × 512 pixels. Attention slicing performs the costly attention operation in multiple steps instead of all at once. It usually has a performance impact of ~20% in computers without universal memory, but we have observed _better performance_ in most Apple Silicon computers, unless you have 64 GB or more.
+To prevent this from happening, we recommend *attention slicing* to reduce memory pressure during inference and prevent swapping. This is especially relevant if your computer has less than 64GB of system RAM, or if you generate images at non-standard resolutions larger than 512×512 pixels. Call the [`~DiffusionPipeline.enable_attention_slicing`] function on your pipeline:
 
-```python
+```py
+from diffusers import DiffusionPipeline
+
+pipeline = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, variant="fp16", use_safetensors=True).to("mps")
 pipeline.enable_attention_slicing()
 ```
 
-## Known Issues
-
-- Generating multiple prompts in a batch [crashes or doesn't work reliably](https://github.com/huggingface/diffusers/issues/363). We believe this is related to the [`mps` backend in PyTorch](https://github.com/pytorch/pytorch/issues/84039). This is being resolved, but for now we recommend to iterate instead of batching.
+Attention slicing performs the costly attention operation in multiple steps instead of all at once. It usually improves performance by ~20% in computers without universal memory, but we've observed *better performance* in most Apple silicon computers unless you have 64GB of RAM or more.
diff --git a/docs/source/en/optimization/onnx.md b/docs/source/en/optimization/onnx.md
@@ -11,23 +11,19 @@ specific language governing permissions and limitations under the License.
 -->
 
 
-# How to use ONNX Runtime for inference
+# ONNX Runtime
 
-🤗 [Optimum](https://github.com/huggingface/optimum) provides a Stable Diffusion pipeline compatible with ONNX Runtime. 
+🤗 [Optimum](https://github.com/huggingface/optimum) provides a Stable Diffusion pipeline compatible with ONNX Runtime. You'll need to install 🤗 Optimum with the following command for ONNX Runtime support:
 
-## Installation
-
-Install 🤗 Optimum with the following command for ONNX Runtime support:
-
-```
+```bash
 pip install optimum["onnxruntime"]
 ```
 
-## Stable Diffusion
+This guide will show you how to use the Stable Diffusion and Stable Diffusion XL (SDXL) pipelines with ONNX Runtime.
 
-### Inference
+## Stable Diffusion
 
-To load an ONNX model and run inference with ONNX Runtime, you need to replace [`StableDiffusionPipeline`] with `ORTStableDiffusionPipeline`. In case you want to load a PyTorch model and convert it to the ONNX format on-the-fly, you can set `export=True`.
+To load and run inference, use the [`~optimum.onnxruntime.ORTStableDiffusionPipeline`]. If you want to load a PyTorch model and convert it to the ONNX format on-the-fly, set `export=True`:
 
 ```python
 from optimum.onnxruntime import ORTStableDiffusionPipeline
@@ -39,14 +35,20 @@ image = pipeline(prompt).images[0]
 pipeline.save_pretrained("./onnx-stable-diffusion-v1-5")
 ```
 
-If you want to export the pipeline in the ONNX format offline and later use it for inference,
-you can use the [`optimum-cli export`](https://huggingface.co/docs/optimum/main/en/exporters/onnx/usage_guides/export_a_model#exporting-a-model-to-onnx-using-the-cli) command: 
+<Tip warning={true}>
+
+Generating multiple prompts in a batch seems to take too much memory. While we look into it, you may need to iterate instead of batching.
+
+</Tip>
+
+To export the pipeline in the ONNX format offline and use it later for inference,
+use the [`optimum-cli export`](https://huggingface.co/docs/optimum/main/en/exporters/onnx/usage_guides/export_a_model#exporting-a-model-to-onnx-using-the-cli) command:
 
 ```bash
 optimum-cli export onnx --model runwayml/stable-diffusion-v1-5 sd_v15_onnx/
 ```
 
-Then perform inference:
+Then to perform inference (you don't have to specify `export=True` again):
 
 ```python 
 from optimum.onnxruntime import ORTStableDiffusionPipeline
@@ -57,36 +59,15 @@ prompt = "sailing ship in storm by Leonardo da Vinci"
 image = pipeline(prompt).images[0]
 ```
 
-Notice that we didn't have to specify `export=True` above.
-
 <div class="flex justify-center">
     <img src="https://huggingface.co/datasets/optimum/documentation-images/resolve/main/onnxruntime/stable_diffusion_v1_5_ort_sail_boat.png">
 </div>
 
-You can find more examples in [optimum documentation](https://huggingface.co/docs/optimum/).
-
-
-### Supported tasks
-
-| Task                                 | Loading Class                        |
-|--------------------------------------|--------------------------------------|
-| `text-to-image`                      | `ORTStableDiffusionPipeline`         |
-| `image-to-image`                     | `ORTStableDiffusionImg2ImgPipeline`  |
-| `inpaint`                            | `ORTStableDiffusionInpaintPipeline`  |
+You can find more examples in 🤗 Optimum [documentation](https://huggingface.co/docs/optimum/), and Stable Diffusion is supported for text-to-image, image-to-image, and inpainting.
 
 ## Stable Diffusion XL
 
-### Export
-
-To export your model to ONNX, you can use the [Optimum CLI](https://huggingface.co/docs/optimum/main/en/exporters/onnx/usage_guides/export_a_model#exporting-a-model-to-onnx-using-the-cli) as follows :
-
-```bash
-optimum-cli export onnx --model stabilityai/stable-diffusion-xl-base-1.0 --task stable-diffusion-xl sd_xl_onnx/
-```
-
-### Inference
-
-Here is an example of how you can load a SDXL ONNX model from [stabilityai/stable-diffusion-xl-base-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0) and run inference with ONNX Runtime :
+To load and run inference with SDXL, use the [`~optimum.onnxruntime.ORTStableDiffusionXLPipeline`]:
 
 ```python
 from optimum.onnxruntime import ORTStableDiffusionXLPipeline
@@ -97,13 +78,10 @@ prompt = "sailing ship in storm by Leonardo da Vinci"
 image = pipeline(prompt).images[0]
 ```
 
-### Supported tasks
-
-| Task                                 | Loading Class                        |
-|--------------------------------------|--------------------------------------|
-| `text-to-image`                      | `ORTStableDiffusionXLPipeline`       |
-| `image-to-image`                     | `ORTStableDiffusionXLImg2ImgPipeline`|
+To export the pipeline in the ONNX format and use it later for inference, use the [`optimum-cli export`](https://huggingface.co/docs/optimum/main/en/exporters/onnx/usage_guides/export_a_model#exporting-a-model-to-onnx-using-the-cli) command:
 
-## Known Issues
+```bash
+optimum-cli export onnx --model stabilityai/stable-diffusion-xl-base-1.0 --task stable-diffusion-xl sd_xl_onnx/
+```
 
-- Generating multiple prompts in a batch seems to take too much memory. While we look into it, you may need to iterate instead of batching.
+SDXL is supported for text-to-image and image-to-image.
diff --git a/docs/source/en/optimization/open_vino.md b/docs/source/en/optimization/open_vino.md
diff --git a/docs/source/en/optimization/torch2.0.md b/docs/source/en/optimization/torch2.0.md