Skip to content

Commit cb1d975

Browse files
authored
feat: add wan2.1/2.2 support (#778)
* add wan vae suppport * add wan model support * add umt5 support * add wan2.1 t2i support * make flash attn work with wan * make wan a little faster * add wan2.1 t2v support * add wan gguf support * add offload params to cpu support * add wan2.1 i2v support * crop image before resize * set default fps to 16 * add diff lora support * fix wan2.1 i2v * introduce sd_sample_params_t * add wan2.2 t2v support * add wan2.2 14B i2v support * add wan2.2 ti2v support * add high noise lora support * sync: update ggml submodule url * avoid build failure on linux * avoid build failure * update ggml * update ggml * fix sd_version_is_wan * update ggml, fix cpu im2col_3d * fix ggml_nn_attention_ext mask * add cache support to ggml runner * fix the issue of illegal memory access * unify image loading processing * add wan2.1/2.2 FLF2V support * fix end_image mask * update to latest ggml * add GGUFReader * update docs
1 parent 2eb3845 commit cb1d975

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

46 files changed

+768072
-1411
lines changed

.gitmodules

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
11
[submodule "ggml"]
22
path = ggml
3-
url = https://github.com/ggerganov/ggml.git
3+
url = https://github.com/ggml-org/ggml.git

README.md

Lines changed: 48 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -4,19 +4,33 @@
44

55
# stable-diffusion.cpp
66

7-
Inference of Stable Diffusion and Flux in pure C/C++
7+
Diffusion model(SD,Flux,Wan,...) inference in pure C/C++
8+
9+
***Note that this project is under active development. \
10+
API and command-line parameters may change frequently.***
811

912
## Features
1013

1114
- Plain C/C++ implementation based on [ggml](https://github.com/ggerganov/ggml), working in the same way as [llama.cpp](https://github.com/ggerganov/llama.cpp)
1215
- Super lightweight and without external dependencies
13-
- SD1.x, SD2.x, SDXL and [SD3/SD3.5](./docs/sd3.md) support
14-
- !!!The VAE in SDXL encounters NaN issues under FP16, but unfortunately, the ggml_conv_2d only operates under FP16. Hence, a parameter is needed to specify the VAE that has fixed the FP16 NaN issue. You can find it here: [SDXL VAE FP16 Fix](https://huggingface.co/madebyollin/sdxl-vae-fp16-fix/blob/main/sdxl_vae.safetensors).
15-
- [Flux-dev/Flux-schnell Support](./docs/flux.md)
16-
- [FLUX.1-Kontext-dev](./docs/kontext.md)
17-
- [Chroma](./docs/chroma.md)
18-
- [SD-Turbo](https://huggingface.co/stabilityai/sd-turbo) and [SDXL-Turbo](https://huggingface.co/stabilityai/sdxl-turbo) support
19-
- [PhotoMaker](https://github.com/TencentARC/PhotoMaker) support.
16+
- Supported models
17+
- Image Models
18+
- SD1.x, SD2.x, [SD-Turbo](https://huggingface.co/stabilityai/sd-turbo)
19+
- SDXL, [SDXL-Turbo](https://huggingface.co/stabilityai/sdxl-turbo)
20+
- !!!The VAE in SDXL encounters NaN issues under FP16, but unfortunately, the ggml_conv_2d only operates under FP16. Hence, a parameter is needed to specify the VAE that has fixed the FP16 NaN issue. You can find it here: [SDXL VAE FP16 Fix](https://huggingface.co/madebyollin/sdxl-vae-fp16-fix/blob/main/sdxl_vae.safetensors).
21+
- [SD3/SD3.5](./docs/sd3.md)
22+
- [Flux-dev/Flux-schnell](./docs/flux.md)
23+
- [Chroma](./docs/chroma.md)
24+
- Image Edit Models
25+
- [FLUX.1-Kontext-dev](./docs/kontext.md)
26+
- Video Models
27+
- [Wan2.1/Wan2.2](./docs/wan.md)
28+
- [PhotoMaker](https://github.com/TencentARC/PhotoMaker) support.
29+
- Control Net support with SD 1.5
30+
- LoRA support, same as [stable-diffusion-webui](https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Features#lora)
31+
- Latent Consistency Models support (LCM/LCM-LoRA)
32+
- Faster and memory efficient latent decoding with [TAESD](https://github.com/madebyollin/taesd)
33+
- Upscale images generated with [ESRGAN](https://github.com/xinntao/Real-ESRGAN)
2034
- 16-bit, 32-bit float support
2135
- 2-bit, 3-bit, 4-bit, 5-bit and 8-bit integer quantization support
2236
- Accelerated memory-efficient CPU inference
@@ -26,15 +40,9 @@ Inference of Stable Diffusion and Flux in pure C/C++
2640
- Can load ckpt, safetensors and diffusers models/checkpoints. Standalone VAEs models
2741
- No need to convert to `.ggml` or `.gguf` anymore!
2842
- Flash Attention for memory usage optimization
29-
- Original `txt2img` and `img2img` mode
3043
- Negative prompt
3144
- [stable-diffusion-webui](https://github.com/AUTOMATIC1111/stable-diffusion-webui) style tokenizer (not all the features, only token weighting for now)
32-
- LoRA support, same as [stable-diffusion-webui](https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Features#lora)
33-
- Latent Consistency Models support (LCM/LCM-LoRA)
34-
- Faster and memory efficient latent decoding with [TAESD](https://github.com/madebyollin/taesd)
35-
- Upscale images generated with [ESRGAN](https://github.com/xinntao/Real-ESRGAN)
3645
- VAE tiling processing for reduce memory usage
37-
- Control Net support with SD 1.5
3846
- Sampling method
3947
- `Euler A`
4048
- `Euler`
@@ -287,8 +295,10 @@ arguments:
287295
If threads <= 0, then threads will be set to the number of CPU physical cores
288296
-m, --model [MODEL] path to full model
289297
--diffusion-model path to the standalone diffusion model
298+
--high-noise-diffusion-model path to the standalone high noise diffusion model
290299
--clip_l path to the clip-l text encoder
291300
--clip_g path to the clip-g text encoder
301+
--clip_vision path to the clip-vision encoder
292302
--t5xxl path to the t5xxl text encoder
293303
--vae [VAE] path to vae
294304
--taesd [TAESD_PATH] path to taesd. Using Tiny AutoEncoder for fast decoding (low quality)
@@ -303,8 +313,9 @@ arguments:
303313
If not specified, the default is the type of the weight file
304314
--tensor-type-rules [EXPRESSION] weight type per tensor pattern (example: "^vae\.=f16,model\.=q8_0")
305315
--lora-model-dir [DIR] lora model directory
306-
-i, --init-img [IMAGE] path to the input image, required by img2img
316+
-i, --init-img [IMAGE] path to the init image, required by img2img
307317
--mask [MASK] path to the mask image, required by img2img with mask
318+
-i, --end-img [IMAGE] path to the end image, required by flf2v
308319
--control-image [IMAGE] path to image condition, control net
309320
-r, --ref-image [PATH] reference image for Flux Kontext models (can be used multiple times)
310321
-o, --output OUTPUT path to write result image to (default: ./output.png)
@@ -319,21 +330,34 @@ arguments:
319330
--skip-layers LAYERS Layers to skip for SLG steps: (default: [7,8,9])
320331
--skip-layer-start START SLG enabling point: (default: 0.01)
321332
--skip-layer-end END SLG disabling point: (default: 0.2)
333+
--scheduler {discrete, karras, exponential, ays, gits} Denoiser sigma scheduler (default: discrete)
334+
--sampling-method {euler, euler_a, heun, dpm2, dpm++2s_a, dpm++2m, dpm++2mv2, ipndm, ipndm_v, lcm, ddim_trailing, tcd}
335+
sampling method (default: "euler_a")
336+
--steps STEPS number of sample steps (default: 20)
337+
--high-noise-cfg-scale SCALE (high noise) unconditional guidance scale: (default: 7.0)
338+
--high-noise-img-cfg-scale SCALE (high noise) image guidance scale for inpaint or instruct-pix2pix models: (default: same as --cfg-scale)
339+
--high-noise-guidance SCALE (high noise) distilled guidance scale for models with guidance input (default: 3.5)
340+
--high-noise-slg-scale SCALE (high noise) skip layer guidance (SLG) scale, only for DiT models: (default: 0)
341+
0 means disabled, a value of 2.5 is nice for sd3.5 medium
342+
--high-noise-eta SCALE (high noise) eta in DDIM, only for DDIM and TCD: (default: 0)
343+
--high-noise-skip-layers LAYERS (high noise) Layers to skip for SLG steps: (default: [7,8,9])
344+
--high-noise-skip-layer-start (high noise) SLG enabling point: (default: 0.01)
345+
--high-noise-skip-layer-end END (high noise) SLG disabling point: (default: 0.2)
346+
--high-noise-scheduler {discrete, karras, exponential, ays, gits} Denoiser sigma scheduler (default: discrete)
347+
--high-noise-sampling-method {euler, euler_a, heun, dpm2, dpm++2s_a, dpm++2m, dpm++2mv2, ipndm, ipndm_v, lcm, ddim_trailing, tcd}
348+
(high noise) sampling method (default: "euler_a")
349+
--high-noise-steps STEPS (high noise) number of sample steps (default: 20)
322350
SLG will be enabled at step int([STEPS]*[START]) and disabled at int([STEPS]*[END])
323351
--strength STRENGTH strength for noising/unnoising (default: 0.75)
324352
--style-ratio STYLE-RATIO strength for keeping input identity (default: 20)
325353
--control-strength STRENGTH strength to apply Control Net (default: 0.9)
326354
1.0 corresponds to full destruction of information in init image
327355
-H, --height H image height, in pixel space (default: 512)
328356
-W, --width W image width, in pixel space (default: 512)
329-
--sampling-method {euler, euler_a, heun, dpm2, dpm++2s_a, dpm++2m, dpm++2mv2, ipndm, ipndm_v, lcm, ddim_trailing, tcd}
330-
sampling method (default: "euler_a")
331-
--steps STEPS number of sample steps (default: 20)
332357
--rng {std_default, cuda} RNG (default: cuda)
333358
-s SEED, --seed SEED RNG seed (default: 42, use random seed for < 0)
334359
-b, --batch-count COUNT number of images to generate
335-
--schedule {discrete, karras, exponential, ays, gits} Denoiser sigma schedule (default: discrete)
336-
--clip-skip N ignore last layers of CLIP network; 1 ignores none, 2 ignores one layer (default: -1)
360+
--clip-skip N ignore last_dot_pos layers of CLIP network; 1 ignores none, 2 ignores one layer (default: -1)
337361
<= 0 represents unspecified, will be 1 for SD1.x, 2 for SD2.x
338362
--vae-tiling process vae in tiles to reduce memory usage
339363
--vae-on-cpu keep vae in cpu (for low vram)
@@ -351,6 +375,8 @@ arguments:
351375
--chroma-disable-dit-mask disable dit mask for chroma
352376
--chroma-enable-t5-mask enable t5 mask for chroma
353377
--chroma-t5-mask-pad PAD_SIZE t5 mask pad size of chroma
378+
--video-frames video frames (default: 1)
379+
--fps fps (default: 24)
354380
-v, --verbose print extra info
355381
```
356382
@@ -438,3 +464,5 @@ Thank you to all the people who have already contributed to stable-diffusion.cpp
438464
- [latent-consistency-model](https://github.com/luosiallen/latent-consistency-model)
439465
- [generative-models](https://github.com/Stability-AI/generative-models/)
440466
- [PhotoMaker](https://github.com/TencentARC/PhotoMaker)
467+
- [Wan2.1](https://github.com/Wan-Video/Wan2.1)
468+
- [Wan2.2](https://github.com/Wan-Video/Wan2.2)

assets/wan/Wan2.1_1.3B_t2v.mp4

238 KB
Binary file not shown.

assets/wan/Wan2.1_14B_flf2v.mp4

674 KB
Binary file not shown.

assets/wan/Wan2.1_14B_i2v.mp4

340 KB
Binary file not shown.

assets/wan/Wan2.1_14B_t2v.mp4

256 KB
Binary file not shown.

assets/wan/Wan2.2_14B_flf2v.mp4

890 KB
Binary file not shown.

assets/wan/Wan2.2_14B_i2v.mp4

309 KB
Binary file not shown.

assets/wan/Wan2.2_14B_t2i.png

595 KB
Loading

assets/wan/Wan2.2_14B_t2v.mp4

282 KB
Binary file not shown.

0 commit comments

Comments
 (0)