Skip to content

Commit 3ecdab0

Browse files
sayakpaulstevhliu
andcommitted
add docs on model sharding (#8658)
* add docs on model sharding * add entry to _toctree. * Apply suggestions from code review Co-authored-by: Steven Liu <[email protected]> * simplify wording * add a note on transformer library handling * move device placement section * Update docs/source/en/training/distributed_inference.md Co-authored-by: Steven Liu <[email protected]> --------- Co-authored-by: Steven Liu <[email protected]>
1 parent 7c4736c commit 3ecdab0

File tree

3 files changed

+144
-70
lines changed

3 files changed

+144
-70
lines changed

docs/source/en/_toctree.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,8 @@
2121
title: Load LoRAs for inference
2222
- local: tutorials/fast_diffusion
2323
title: Accelerate inference of text-to-image diffusion models
24+
- local: tutorials/inference_with_big_models
25+
title: Working with big models
2426
title: Tutorials
2527
- sections:
2628
- local: using-diffusers/loading

docs/source/en/training/distributed_inference.md

Lines changed: 3 additions & 70 deletions
Original file line numberDiff line numberDiff line change
@@ -52,76 +52,6 @@ To learn more, take a look at the [Distributed Inference with 🤗 Accelerate](h
5252

5353
</Tip>
5454

55-
### Device placement
56-
57-
> [!WARNING]
58-
> This feature is experimental and its APIs might change in the future.
59-
60-
With Accelerate, you can use the `device_map` to determine how to distribute the models of a pipeline across multiple devices. This is useful in situations where you have more than one GPU.
61-
62-
For example, if you have two 8GB GPUs, then using [`~DiffusionPipeline.enable_model_cpu_offload`] may not work so well because:
63-
64-
* it only works on a single GPU
65-
* a single model might not fit on a single GPU ([`~DiffusionPipeline.enable_sequential_cpu_offload`] might work but it will be extremely slow and it is also limited to a single GPU)
66-
67-
To make use of both GPUs, you can use the "balanced" device placement strategy which splits the models across all available GPUs.
68-
69-
> [!WARNING]
70-
> Only the "balanced" strategy is supported at the moment, and we plan to support additional mapping strategies in the future.
71-
72-
```diff
73-
from diffusers import DiffusionPipeline
74-
import torch
75-
76-
pipeline = DiffusionPipeline.from_pretrained(
77-
- "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True,
78-
+ "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True, device_map="balanced"
79-
)
80-
image = pipeline("a dog").images[0]
81-
image
82-
```
83-
84-
You can also pass a dictionary to enforce the maximum GPU memory that can be used on each device:
85-
86-
```diff
87-
from diffusers import DiffusionPipeline
88-
import torch
89-
90-
max_memory = {0:"1GB", 1:"1GB"}
91-
pipeline = DiffusionPipeline.from_pretrained(
92-
"runwayml/stable-diffusion-v1-5",
93-
torch_dtype=torch.float16,
94-
use_safetensors=True,
95-
device_map="balanced",
96-
+ max_memory=max_memory
97-
)
98-
image = pipeline("a dog").images[0]
99-
image
100-
```
101-
102-
If a device is not present in `max_memory`, then it will be completely ignored and will not participate in the device placement.
103-
104-
By default, Diffusers uses the maximum memory of all devices. If the models don't fit on the GPUs, they are offloaded to the CPU. If the CPU doesn't have enough memory, then you might see an error. In that case, you could defer to using [`~DiffusionPipeline.enable_sequential_cpu_offload`] and [`~DiffusionPipeline.enable_model_cpu_offload`].
105-
106-
Call [`~DiffusionPipeline.reset_device_map`] to reset the `device_map` of a pipeline. This is also necessary if you want to use methods like `to()`, [`~DiffusionPipeline.enable_sequential_cpu_offload`], and [`~DiffusionPipeline.enable_model_cpu_offload`] on a pipeline that was device-mapped.
107-
108-
```py
109-
pipeline.reset_device_map()
110-
```
111-
112-
Once a pipeline has been device-mapped, you can also access its device map via `hf_device_map`:
113-
114-
```py
115-
print(pipeline.hf_device_map)
116-
```
117-
118-
An example device map would look like so:
119-
120-
121-
```bash
122-
{'unet': 1, 'vae': 1, 'safety_checker': 0, 'text_encoder': 0}
123-
```
124-
12555
## PyTorch Distributed
12656

12757
PyTorch supports [`DistributedDataParallel`](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html) which enables data parallelism.
@@ -176,3 +106,6 @@ Once you've completed the inference script, use the `--nproc_per_node` argument
176106
```bash
177107
torchrun run_distributed.py --nproc_per_node=2
178108
```
109+
110+
> [!TIP]
111+
> You can use `device_map` within a [`DiffusionPipeline`] to distribute its model-level components on multiple devices. Refer to the [Device placement](../tutorials/inference_with_big_models#device-placement) guide to learn more.
Lines changed: 139 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,139 @@
1+
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License.
11+
-->
12+
13+
# Working with big models
14+
15+
A modern diffusion model, like [Stable Diffusion XL (SDXL)](../using-diffusers/sdxl), is not just a single model, but a collection of multiple models. SDXL has four different model-level components:
16+
17+
* A variational autoencoder (VAE)
18+
* Two text encoders
19+
* A UNet for denoising
20+
21+
Usually, the text encoders and the denoiser are much larger compared to the VAE.
22+
23+
As models get bigger and better, it’s possible your model is so big that even a single copy won’t fit in memory. But that doesn’t mean it can’t be loaded. If you have more than one GPU, there is more memory available to store your model. In this case, it’s better to split your model checkpoint into several smaller *checkpoint shards*.
24+
25+
When a text encoder checkpoint has multiple shards, like [T5-xxl for SD3](https://huggingface.co/stabilityai/stable-diffusion-3-medium-diffusers/tree/main/text_encoder_3), it is automatically handled by the [Transformers](https://huggingface.co/docs/transformers/index) library as it is a required dependency of Diffusers when using the [`StableDiffusion3Pipeline`]. More specifically, Transformers will automatically handle the loading of multiple shards within the requested model class and get it ready so that inference can be performed.
26+
27+
The denoiser checkpoint can also have multiple shards and supports inference thanks to the [Accelerate](https://huggingface.co/docs/accelerate/index) library.
28+
29+
> [!TIP]
30+
> Refer to the [Handling big models for inference](https://huggingface.co/docs/accelerate/main/en/concept_guides/big_model_inference) guide for general guidance when working with big models that are hard to fit into memory.
31+
32+
For example, let's save a sharded checkpoint for the [SDXL UNet](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/tree/main/unet):
33+
34+
```python
35+
from diffusers import UNet2DConditionModel
36+
37+
unet = UNet2DConditionModel.from_pretrained(
38+
"stabilityai/stable-diffusion-xl-base-1.0", subfolder="unet"
39+
)
40+
unet.save_pretrained("sdxl-unet-sharded", max_shard_size="5GB")
41+
```
42+
43+
The size of the fp32 variant of the SDXL UNet checkpoint is ~10.4GB. Set the `max_shard_size` parameter to 5GB to create 3 shards. After saving, you can load them in [`StableDiffusionXLPipeline`]:
44+
45+
```python
46+
from diffusers import UNet2DConditionModel, StableDiffusionXLPipeline
47+
import torch
48+
49+
unet = UNet2DConditionModel.from_pretrained(
50+
"sayakpaul/sdxl-unet-sharded", torch_dtype=torch.float16
51+
)
52+
pipeline = StableDiffusionXLPipeline.from_pretrained(
53+
"stabilityai/stable-diffusion-xl-base-1.0", unet=unet, torch_dtype=torch.float16
54+
).to("cuda")
55+
56+
image = pipeline("a cute dog running on the grass", num_inference_steps=30).images[0]
57+
image.save("dog.png")
58+
```
59+
60+
If placing all the model-level components on the GPU at once is not feasible, use [`~DiffusionPipeline.enable_model_cpu_offload`] to help you:
61+
62+
```diff
63+
- pipeline.to("cuda")
64+
+ pipeline.enable_model_cpu_offload()
65+
```
66+
67+
In general, we recommend sharding when a checkpoint is more than 5GB (in fp32).
68+
69+
## Device placement
70+
71+
On distributed setups, you can run inference across multiple GPUs with Accelerate.
72+
73+
> [!WARNING]
74+
> This feature is experimental and its APIs might change in the future.
75+
76+
With Accelerate, you can use the `device_map` to determine how to distribute the models of a pipeline across multiple devices. This is useful in situations where you have more than one GPU.
77+
78+
For example, if you have two 8GB GPUs, then using [`~DiffusionPipeline.enable_model_cpu_offload`] may not work so well because:
79+
80+
* it only works on a single GPU
81+
* a single model might not fit on a single GPU ([`~DiffusionPipeline.enable_sequential_cpu_offload`] might work but it will be extremely slow and it is also limited to a single GPU)
82+
83+
To make use of both GPUs, you can use the "balanced" device placement strategy which splits the models across all available GPUs.
84+
85+
> [!WARNING]
86+
> Only the "balanced" strategy is supported at the moment, and we plan to support additional mapping strategies in the future.
87+
88+
```diff
89+
from diffusers import DiffusionPipeline
90+
import torch
91+
92+
pipeline = DiffusionPipeline.from_pretrained(
93+
- "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True,
94+
+ "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True, device_map="balanced"
95+
)
96+
image = pipeline("a dog").images[0]
97+
image
98+
```
99+
100+
You can also pass a dictionary to enforce the maximum GPU memory that can be used on each device:
101+
102+
```diff
103+
from diffusers import DiffusionPipeline
104+
import torch
105+
106+
max_memory = {0:"1GB", 1:"1GB"}
107+
pipeline = DiffusionPipeline.from_pretrained(
108+
"runwayml/stable-diffusion-v1-5",
109+
torch_dtype=torch.float16,
110+
use_safetensors=True,
111+
device_map="balanced",
112+
+ max_memory=max_memory
113+
)
114+
image = pipeline("a dog").images[0]
115+
image
116+
```
117+
118+
If a device is not present in `max_memory`, then it will be completely ignored and will not participate in the device placement.
119+
120+
By default, Diffusers uses the maximum memory of all devices. If the models don't fit on the GPUs, they are offloaded to the CPU. If the CPU doesn't have enough memory, then you might see an error. In that case, you could defer to using [`~DiffusionPipeline.enable_sequential_cpu_offload`] and [`~DiffusionPipeline.enable_model_cpu_offload`].
121+
122+
Call [`~DiffusionPipeline.reset_device_map`] to reset the `device_map` of a pipeline. This is also necessary if you want to use methods like `to()`, [`~DiffusionPipeline.enable_sequential_cpu_offload`], and [`~DiffusionPipeline.enable_model_cpu_offload`] on a pipeline that was device-mapped.
123+
124+
```py
125+
pipeline.reset_device_map()
126+
```
127+
128+
Once a pipeline has been device-mapped, you can also access its device map via `hf_device_map`:
129+
130+
```py
131+
print(pipeline.hf_device_map)
132+
```
133+
134+
An example device map would look like so:
135+
136+
137+
```bash
138+
{'unet': 1, 'vae': 1, 'safety_checker': 0, 'text_encoder': 0}
139+
```

0 commit comments

Comments
 (0)