diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml index 4879a7bf045e..e5d8d2beec78 100644 --- a/docs/source/en/_toctree.yml +++ b/docs/source/en/_toctree.yml @@ -203,34 +203,6 @@ - local: optimization/neuron title: AWS Neuron -- title: Specific pipeline examples - isExpanded: false - sections: - - local: using-diffusers/consisid - title: ConsisID - - local: using-diffusers/sdxl - title: Stable Diffusion XL - - local: using-diffusers/sdxl_turbo - title: SDXL Turbo - - local: using-diffusers/kandinsky - title: Kandinsky - - local: using-diffusers/omnigen - title: OmniGen - - local: using-diffusers/pag - title: PAG - - local: using-diffusers/inference_with_lcm - title: Latent Consistency Model - - local: using-diffusers/shap-e - title: Shap-E - - local: using-diffusers/diffedit - title: DiffEdit - - local: using-diffusers/inference_with_tcd_lora - title: Trajectory Consistency Distillation-LoRA - - local: using-diffusers/svd - title: Stable Video Diffusion - - local: using-diffusers/marigold_usage - title: Marigold Computer Vision - - title: Resources isExpanded: false sections: diff --git a/docs/source/en/api/pipelines/consisid.md b/docs/source/en/api/pipelines/consisid.md index db6b5e59aca3..54385334121a 100644 --- a/docs/source/en/api/pipelines/consisid.md +++ b/docs/source/en/api/pipelines/consisid.md @@ -40,6 +40,44 @@ There are two official ConsisID checkpoints for identity-preserving text-to-vide | [`BestWishYsh/ConsisID-preview`](https://huggingface.co/BestWishYsh/ConsisID-preview) | torch.bfloat16 | | [`BestWishYsh/ConsisID-1.5`](https://huggingface.co/BestWishYsh/ConsisID-preview) | torch.bfloat16 | +## Load Model Checkpoints + +Model weights may be stored in separate subfolders on the Hub or locally, in which case, you should use the [`~DiffusionPipeline.from_pretrained`] method. + +```python +# !pip install consisid_eva_clip insightface facexlib +import torch +from diffusers import ConsisIDPipeline +from diffusers.pipelines.consisid.consisid_utils import prepare_face_models, process_face_embeddings_infer +from huggingface_hub import snapshot_download + +# Download ckpts +snapshot_download(repo_id="BestWishYsh/ConsisID-preview", local_dir="BestWishYsh/ConsisID-preview") + +# Load face helper model to preprocess input face image +face_helper_1, face_helper_2, face_clip_model, face_main_model, eva_transform_mean, eva_transform_std = prepare_face_models("BestWishYsh/ConsisID-preview", device="cuda", dtype=torch.bfloat16) + +# Load consisid base model +pipe = ConsisIDPipeline.from_pretrained("BestWishYsh/ConsisID-preview", torch_dtype=torch.bfloat16) +pipe.to("cuda") +``` + +## Identity-Preserving Text-to-Video + +For identity-preserving text-to-video, pass a text prompt and an image contain clear face (e.g., preferably half-body or full-body). By default, ConsisID generates a 720x480 video for the best results. + +```python +from diffusers.utils import export_to_video + +prompt = "The video captures a boy walking along a city street, filmed in black and white on a classic 35mm camera. His expression is thoughtful, his brow slightly furrowed as if he's lost in contemplation. The film grain adds a textured, timeless quality to the image, evoking a sense of nostalgia. Around him, the cityscape is filled with vintage buildings, cobblestone sidewalks, and softly blurred figures passing by, their outlines faint and indistinct. Streetlights cast a gentle glow, while shadows play across the boy's path, adding depth to the scene. The lighting highlights the boy's subtle smile, hinting at a fleeting moment of curiosity. The overall cinematic atmosphere, complete with classic film still aesthetics and dramatic contrasts, gives the scene an evocative and introspective feel." +image = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/consisid/consisid_input.png?download=true" + +id_cond, id_vit_hidden, image, face_kps = process_face_embeddings_infer(face_helper_1, face_clip_model, face_helper_2, eva_transform_mean, eva_transform_std, face_main_model, "cuda", torch.bfloat16, image, is_align_face=True) + +video = pipe(image=image, prompt=prompt, num_inference_steps=50, guidance_scale=6.0, use_dynamic_cfg=False, id_vit_hidden=id_vit_hidden, id_cond=id_cond, kps_cond=face_kps, generator=torch.Generator("cuda").manual_seed(42)) +export_to_video(video.frames[0], "output.mp4", fps=8) +``` + ### Memory optimization ConsisID requires about 44 GB of GPU memory to decode 49 frames (6 seconds of video at 8 FPS) with output resolution 720x480 (W x H), which makes it not possible to run on consumer GPUs or free-tier T4 Colab. The following memory optimizations could be used to reduce the memory footprint. For replication, you can refer to [this](https://gist.github.com/SHYuanBest/bc4207c36f454f9e969adbb50eaf8258) script. @@ -52,6 +90,12 @@ ConsisID requires about 44 GB of GPU memory to decode 49 frames (6 seconds of vi | vae.enable_slicing | 16 GB | 22 GB | | vae.enable_tiling | 5 GB | 7 GB | +## Resources + +Learn more about ConsisID with the following resources. +- A [video](https://www.youtube.com/watch?v=PhlgC-bI5SQ) demonstrating ConsisID's main features. +- The research paper, [Identity-Preserving Text-to-Video Generation by Frequency Decomposition](https://hf.co/papers/2411.17440) for more details. + ## ConsisIDPipeline [[autodoc]] ConsisIDPipeline diff --git a/docs/source/en/api/pipelines/diffedit.md b/docs/source/en/api/pipelines/diffedit.md index 9734ca2eabc3..98e2ffe793fb 100644 --- a/docs/source/en/api/pipelines/diffedit.md +++ b/docs/source/en/api/pipelines/diffedit.md @@ -25,6 +25,262 @@ The original codebase can be found at [Xiang-cd/DiffEdit-stable-diffusion](https This pipeline was contributed by [clarencechen](https://github.com/clarencechen). ❤️ +The [`StableDiffusionDiffEditPipeline`] requires an image mask and a set of partially inverted latents. The image mask is generated from the [`~StableDiffusionDiffEditPipeline.generate_mask`] function, and includes two parameters, `source_prompt` and `target_prompt`. These parameters determine what to edit in the image. For example, if you want to change a bowl of *fruits* to a bowl of *pears*, then: + +```py +source_prompt = "a bowl of fruits" +target_prompt = "a bowl of pears" +``` + +The partially inverted latents are generated from the [`~StableDiffusionDiffEditPipeline.invert`] function, and it is generally a good idea to include a `prompt` or *caption* describing the image to help guide the inverse latent sampling process. The caption can often be your `source_prompt`, but feel free to experiment with other text descriptions! + +Let's load the pipeline, scheduler, inverse scheduler, and enable some optimizations to reduce memory usage: + +```py +import torch +from diffusers import DDIMScheduler, DDIMInverseScheduler, StableDiffusionDiffEditPipeline + +pipeline = StableDiffusionDiffEditPipeline.from_pretrained( + "stabilityai/stable-diffusion-2-1", + torch_dtype=torch.float16, + safety_checker=None, + use_safetensors=True, +) +pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config) +pipeline.inverse_scheduler = DDIMInverseScheduler.from_config(pipeline.scheduler.config) +pipeline.enable_model_cpu_offload() +pipeline.enable_vae_slicing() +``` + +Load the image to edit: + +```py +from diffusers.utils import load_image, make_image_grid + +img_url = "https://github.com/Xiang-cd/DiffEdit-stable-diffusion/raw/main/assets/origin.png" +raw_image = load_image(img_url).resize((768, 768)) +raw_image +``` + +Use the [`~StableDiffusionDiffEditPipeline.generate_mask`] function to generate the image mask. You'll need to pass it the `source_prompt` and `target_prompt` to specify what to edit in the image: + +```py +from PIL import Image + +source_prompt = "a bowl of fruits" +target_prompt = "a basket of pears" +mask_image = pipeline.generate_mask( + image=raw_image, + source_prompt=source_prompt, + target_prompt=target_prompt, +) +Image.fromarray((mask_image.squeeze()*255).astype("uint8"), "L").resize((768, 768)) +``` + +Next, create the inverted latents and pass it a caption describing the image: + +```py +inv_latents = pipeline.invert(prompt=source_prompt, image=raw_image).latents +``` + +Finally, pass the image mask and inverted latents to the pipeline. The `target_prompt` becomes the `prompt` now, and the `source_prompt` is used as the `negative_prompt`: + +```py +output_image = pipeline( + prompt=target_prompt, + mask_image=mask_image, + image_latents=inv_latents, + negative_prompt=source_prompt, +).images[0] +mask_image = Image.fromarray((mask_image.squeeze()*255).astype("uint8"), "L").resize((768, 768)) +make_image_grid([raw_image, mask_image, output_image], rows=1, cols=3) +``` + +
+
+ +
original image
+
+
+ +
edited image
+
+
+ +## Generate source and target embeddings + +The source and target embeddings can be automatically generated with the [Flan-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5) model instead of creating them manually. + +Load the Flan-T5 model and tokenizer from the 🤗 Transformers library: + +```py +import torch +from transformers import AutoTokenizer, T5ForConditionalGeneration + +tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-large") +model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-large", device_map="auto", torch_dtype=torch.float16) +``` + +Provide some initial text to prompt the model to generate the source and target prompts. + +```py +source_concept = "bowl" +target_concept = "basket" + +source_text = f"Provide a caption for images containing a {source_concept}. " +"The captions should be in English and should be no longer than 150 characters." + +target_text = f"Provide a caption for images containing a {target_concept}. " +"The captions should be in English and should be no longer than 150 characters." +``` + +Next, create a utility function to generate the prompts: + +```py +@torch.no_grad() +def generate_prompts(input_prompt): + input_ids = tokenizer(input_prompt, return_tensors="pt").input_ids.to("cuda") + + outputs = model.generate( + input_ids, temperature=0.8, num_return_sequences=16, do_sample=True, max_new_tokens=128, top_k=10 + ) + return tokenizer.batch_decode(outputs, skip_special_tokens=True) + +source_prompts = generate_prompts(source_text) +target_prompts = generate_prompts(target_text) +print(source_prompts) +print(target_prompts) +``` + + + +Check out the [generation strategy](https://huggingface.co/docs/transformers/main/en/generation_strategies) guide if you're interested in learning more about strategies for generating different quality text. + + + +Load the text encoder model used by the [`StableDiffusionDiffEditPipeline`] to encode the text. You'll use the text encoder to compute the text embeddings: + +```py +import torch +from diffusers import StableDiffusionDiffEditPipeline + +pipeline = StableDiffusionDiffEditPipeline.from_pretrained( + "stabilityai/stable-diffusion-2-1", torch_dtype=torch.float16, use_safetensors=True +) +pipeline.enable_model_cpu_offload() +pipeline.enable_vae_slicing() + +@torch.no_grad() +def embed_prompts(sentences, tokenizer, text_encoder, device="cuda"): + embeddings = [] + for sent in sentences: + text_inputs = tokenizer( + sent, + padding="max_length", + max_length=tokenizer.model_max_length, + truncation=True, + return_tensors="pt", + ) + text_input_ids = text_inputs.input_ids + prompt_embeds = text_encoder(text_input_ids.to(device), attention_mask=None)[0] + embeddings.append(prompt_embeds) + return torch.concatenate(embeddings, dim=0).mean(dim=0).unsqueeze(0) + +source_embeds = embed_prompts(source_prompts, pipeline.tokenizer, pipeline.text_encoder) +target_embeds = embed_prompts(target_prompts, pipeline.tokenizer, pipeline.text_encoder) +``` + +Finally, pass the embeddings to the [`~StableDiffusionDiffEditPipeline.generate_mask`] and [`~StableDiffusionDiffEditPipeline.invert`] functions, and pipeline to generate the image: + +```diff + from diffusers import DDIMInverseScheduler, DDIMScheduler + from diffusers.utils import load_image, make_image_grid + from PIL import Image + + pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config) + pipeline.inverse_scheduler = DDIMInverseScheduler.from_config(pipeline.scheduler.config) + + img_url = "https://github.com/Xiang-cd/DiffEdit-stable-diffusion/raw/main/assets/origin.png" + raw_image = load_image(img_url).resize((768, 768)) + + mask_image = pipeline.generate_mask( + image=raw_image, +- source_prompt=source_prompt, +- target_prompt=target_prompt, ++ source_prompt_embeds=source_embeds, ++ target_prompt_embeds=target_embeds, + ) + + inv_latents = pipeline.invert( +- prompt=source_prompt, ++ prompt_embeds=source_embeds, + image=raw_image, + ).latents + + output_image = pipeline( + mask_image=mask_image, + image_latents=inv_latents, +- prompt=target_prompt, +- negative_prompt=source_prompt, ++ prompt_embeds=target_embeds, ++ negative_prompt_embeds=source_embeds, + ).images[0] + mask_image = Image.fromarray((mask_image.squeeze()*255).astype("uint8"), "L") + make_image_grid([raw_image, mask_image, output_image], rows=1, cols=3) +``` + +## Generate a caption for inversion + +While you can use the `source_prompt` as a caption to help generate the partially inverted latents, you can also use the [BLIP](https://huggingface.co/docs/transformers/model_doc/blip) model to automatically generate a caption. + +Load the BLIP model and processor from the 🤗 Transformers library: + +```py +import torch +from transformers import BlipForConditionalGeneration, BlipProcessor + +processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base") +model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base", torch_dtype=torch.float16, low_cpu_mem_usage=True) +``` + +Create a utility function to generate a caption from the input image: + +```py +@torch.no_grad() +def generate_caption(images, caption_generator, caption_processor): + text = "a photograph of" + + inputs = caption_processor(images, text, return_tensors="pt").to(device="cuda", dtype=caption_generator.dtype) + caption_generator.to("cuda") + outputs = caption_generator.generate(**inputs, max_new_tokens=128) + + # offload caption generator + caption_generator.to("cpu") + + caption = caption_processor.batch_decode(outputs, skip_special_tokens=True)[0] + return caption +``` + +Load an input image and generate a caption for it using the `generate_caption` function: + +```py +from diffusers.utils import load_image + +img_url = "https://github.com/Xiang-cd/DiffEdit-stable-diffusion/raw/main/assets/origin.png" +raw_image = load_image(img_url).resize((768, 768)) +caption = generate_caption(raw_image, model, processor) +``` + +
+
+ +
generated caption: "a photograph of a bowl of fruit on a table"
+
+
+ +Now you can drop the caption into the [`~StableDiffusionDiffEditPipeline.invert`] function to generate the partially inverted latents! + + ## Tips * The pipeline can generate masks that can be fed into other inpainting pipelines. diff --git a/docs/source/en/api/pipelines/kandinsky_v22.md b/docs/source/en/api/pipelines/kandinsky_v22.md index e68c094e23f0..229fde9e7082 100644 --- a/docs/source/en/api/pipelines/kandinsky_v22.md +++ b/docs/source/en/api/pipelines/kandinsky_v22.md @@ -29,6 +29,694 @@ Make sure to check out the schedulers [guide](../../using-diffusers/schedulers) +## Text-to-image + +To use the Kandinsky models for any task, you always start by setting up the prior pipeline to encode the prompt and generate the image embeddings. The prior pipeline also generates `negative_image_embeds` that correspond to the negative prompt `""`. For better results, you can pass an actual `negative_prompt` to the prior pipeline, but this'll increase the effective batch size of the prior pipeline by 2x. + + + + +```py +from diffusers import KandinskyPriorPipeline, KandinskyPipeline +import torch + +prior_pipeline = KandinskyPriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-prior", torch_dtype=torch.float16).to("cuda") +pipeline = KandinskyPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16).to("cuda") + +prompt = "A alien cheeseburger creature eating itself, claymation, cinematic, moody lighting" +negative_prompt = "low quality, bad quality" # optional to include a negative prompt, but results are usually better +image_embeds, negative_image_embeds = prior_pipeline(prompt, negative_prompt, guidance_scale=1.0).to_tuple() +``` + +Now pass all the prompts and embeddings to the [`KandinskyPipeline`] to generate an image: + +```py +image = pipeline(prompt, image_embeds=image_embeds, negative_prompt=negative_prompt, negative_image_embeds=negative_image_embeds, height=768, width=768).images[0] +image +``` + +
+ +
+ +
+ + +```py +from diffusers import KandinskyV22PriorPipeline, KandinskyV22Pipeline +import torch + +prior_pipeline = KandinskyV22PriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16).to("cuda") +pipeline = KandinskyV22Pipeline.from_pretrained("kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16).to("cuda") + +prompt = "A alien cheeseburger creature eating itself, claymation, cinematic, moody lighting" +negative_prompt = "low quality, bad quality" # optional to include a negative prompt, but results are usually better +image_embeds, negative_image_embeds = prior_pipeline(prompt, guidance_scale=1.0).to_tuple() +``` + +Pass the `image_embeds` and `negative_image_embeds` to the [`KandinskyV22Pipeline`] to generate an image: + +```py +image = pipeline(image_embeds=image_embeds, negative_image_embeds=negative_image_embeds, height=768, width=768).images[0] +image +``` + +
+ +
+ +
+
+ +🤗 Diffusers also provides an end-to-end API with the [`KandinskyCombinedPipeline`] and [`KandinskyV22CombinedPipeline`], meaning you don't have to separately load the prior and text-to-image pipeline. The combined pipeline automatically loads both the prior model and the decoder. You can still set different values for the prior pipeline with the `prior_guidance_scale` and `prior_num_inference_steps` parameters if you want. + +Use the [`AutoPipelineForText2Image`] to automatically call the combined pipelines under the hood: + + + + +```py +from diffusers import AutoPipelineForText2Image +import torch + +pipeline = AutoPipelineForText2Image.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16) +pipeline.enable_model_cpu_offload() + +prompt = "A alien cheeseburger creature eating itself, claymation, cinematic, moody lighting" +negative_prompt = "low quality, bad quality" + +image = pipeline(prompt=prompt, negative_prompt=negative_prompt, prior_guidance_scale=1.0, guidance_scale=4.0, height=768, width=768).images[0] +image +``` + + + + +```py +from diffusers import AutoPipelineForText2Image +import torch + +pipeline = AutoPipelineForText2Image.from_pretrained("kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16) +pipeline.enable_model_cpu_offload() + +prompt = "A alien cheeseburger creature eating itself, claymation, cinematic, moody lighting" +negative_prompt = "low quality, bad quality" + +image = pipeline(prompt=prompt, negative_prompt=negative_prompt, prior_guidance_scale=1.0, guidance_scale=4.0, height=768, width=768).images[0] +image +``` + + + + +## Image-to-image + +For image-to-image, pass the initial image and text prompt to condition the image to the pipeline. Start by loading the prior pipeline: + + + + +```py +import torch +from diffusers import KandinskyImg2ImgPipeline, KandinskyPriorPipeline + +prior_pipeline = KandinskyPriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-prior", torch_dtype=torch.float16, use_safetensors=True).to("cuda") +pipeline = KandinskyImg2ImgPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16, use_safetensors=True).to("cuda") +``` + + + + +```py +import torch +from diffusers import KandinskyV22Img2ImgPipeline, KandinskyPriorPipeline + +prior_pipeline = KandinskyPriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16, use_safetensors=True).to("cuda") +pipeline = KandinskyV22Img2ImgPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16, use_safetensors=True).to("cuda") +``` + + + + +Download an image to condition on: + +```py +from diffusers.utils import load_image + +# download image +url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg" +original_image = load_image(url) +original_image = original_image.resize((768, 512)) +``` + +
+ +
+ +Generate the `image_embeds` and `negative_image_embeds` with the prior pipeline: + +```py +prompt = "A fantasy landscape, Cinematic lighting" +negative_prompt = "low quality, bad quality" + +image_embeds, negative_image_embeds = prior_pipeline(prompt, negative_prompt).to_tuple() +``` + +Now pass the original image, and all the prompts and embeddings to the pipeline to generate an image: + + + + +```py +from diffusers.utils import make_image_grid + +image = pipeline(prompt, negative_prompt=negative_prompt, image=original_image, image_embeds=image_embeds, negative_image_embeds=negative_image_embeds, height=768, width=768, strength=0.3).images[0] +make_image_grid([original_image.resize((512, 512)), image.resize((512, 512))], rows=1, cols=2) +``` + +
+ +
+ +
+ + +```py +from diffusers.utils import make_image_grid + +image = pipeline(image=original_image, image_embeds=image_embeds, negative_image_embeds=negative_image_embeds, height=768, width=768, strength=0.3).images[0] +make_image_grid([original_image.resize((512, 512)), image.resize((512, 512))], rows=1, cols=2) +``` + +
+ +
+ +
+
+ +🤗 Diffusers also provides an end-to-end API with the [`KandinskyImg2ImgCombinedPipeline`] and [`KandinskyV22Img2ImgCombinedPipeline`], meaning you don't have to separately load the prior and image-to-image pipeline. The combined pipeline automatically loads both the prior model and the decoder. You can still set different values for the prior pipeline with the `prior_guidance_scale` and `prior_num_inference_steps` parameters if you want. + +Use the [`AutoPipelineForImage2Image`] to automatically call the combined pipelines under the hood: + + + + +```py +from diffusers import AutoPipelineForImage2Image +from diffusers.utils import make_image_grid, load_image +import torch + +pipeline = AutoPipelineForImage2Image.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16, use_safetensors=True) +pipeline.enable_model_cpu_offload() + +prompt = "A fantasy landscape, Cinematic lighting" +negative_prompt = "low quality, bad quality" + +url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg" +original_image = load_image(url) + +original_image.thumbnail((768, 768)) + +image = pipeline(prompt=prompt, negative_prompt=negative_prompt, image=original_image, strength=0.3).images[0] +make_image_grid([original_image.resize((512, 512)), image.resize((512, 512))], rows=1, cols=2) +``` + + + + +```py +from diffusers import AutoPipelineForImage2Image +from diffusers.utils import make_image_grid, load_image +import torch + +pipeline = AutoPipelineForImage2Image.from_pretrained("kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16) +pipeline.enable_model_cpu_offload() + +prompt = "A fantasy landscape, Cinematic lighting" +negative_prompt = "low quality, bad quality" + +url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg" +original_image = load_image(url) + +original_image.thumbnail((768, 768)) + +image = pipeline(prompt=prompt, negative_prompt=negative_prompt, image=original_image, strength=0.3).images[0] +make_image_grid([original_image.resize((512, 512)), image.resize((512, 512))], rows=1, cols=2) +``` + + + + +## Inpainting + + + +⚠️ The Kandinsky models use ⬜️ **white pixels** to represent the masked area now instead of black pixels. If you are using [`KandinskyInpaintPipeline`] in production, you need to change the mask to use white pixels: + +```py +# For PIL input +import PIL.ImageOps +mask = PIL.ImageOps.invert(mask) + +# For PyTorch and NumPy input +mask = 1 - mask +``` + + + +For inpainting, you'll need the original image, a mask of the area to replace in the original image, and a text prompt of what to inpaint. Load the prior pipeline: + + + + +```py +from diffusers import KandinskyInpaintPipeline, KandinskyPriorPipeline +from diffusers.utils import load_image, make_image_grid +import torch +import numpy as np +from PIL import Image + +prior_pipeline = KandinskyPriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-prior", torch_dtype=torch.float16, use_safetensors=True).to("cuda") +pipeline = KandinskyInpaintPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-inpaint", torch_dtype=torch.float16, use_safetensors=True).to("cuda") +``` + + + + +```py +from diffusers import KandinskyV22InpaintPipeline, KandinskyV22PriorPipeline +from diffusers.utils import load_image, make_image_grid +import torch +import numpy as np +from PIL import Image + +prior_pipeline = KandinskyV22PriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16, use_safetensors=True).to("cuda") +pipeline = KandinskyV22InpaintPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-decoder-inpaint", torch_dtype=torch.float16, use_safetensors=True).to("cuda") +``` + + + + +Load an initial image and create a mask: + +```py +init_image = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/cat.png") +mask = np.zeros((768, 768), dtype=np.float32) +# mask area above cat's head +mask[:250, 250:-250] = 1 +``` + +Generate the embeddings with the prior pipeline: + +```py +prompt = "a hat" +prior_output = prior_pipeline(prompt) +``` + +Now pass the initial image, mask, and prompt and embeddings to the pipeline to generate an image: + + + + +```py +output_image = pipeline(prompt, image=init_image, mask_image=mask, **prior_output, height=768, width=768, num_inference_steps=150).images[0] +mask = Image.fromarray((mask*255).astype('uint8'), 'L') +make_image_grid([init_image, mask, output_image], rows=1, cols=3) +``` + +
+ +
+ +
+ + +```py +output_image = pipeline(image=init_image, mask_image=mask, **prior_output, height=768, width=768, num_inference_steps=150).images[0] +mask = Image.fromarray((mask*255).astype('uint8'), 'L') +make_image_grid([init_image, mask, output_image], rows=1, cols=3) +``` + +
+ +
+ +
+
+ +You can also use the end-to-end [`KandinskyInpaintCombinedPipeline`] and [`KandinskyV22InpaintCombinedPipeline`] to call the prior and decoder pipelines together under the hood. Use the [`AutoPipelineForInpainting`] for this: + + + + +```py +import torch +import numpy as np +from PIL import Image +from diffusers import AutoPipelineForInpainting +from diffusers.utils import load_image, make_image_grid + +pipe = AutoPipelineForInpainting.from_pretrained("kandinsky-community/kandinsky-2-1-inpaint", torch_dtype=torch.float16) +pipe.enable_model_cpu_offload() + +init_image = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/cat.png") +mask = np.zeros((768, 768), dtype=np.float32) +# mask area above cat's head +mask[:250, 250:-250] = 1 +prompt = "a hat" + +output_image = pipe(prompt=prompt, image=init_image, mask_image=mask).images[0] +mask = Image.fromarray((mask*255).astype('uint8'), 'L') +make_image_grid([init_image, mask, output_image], rows=1, cols=3) +``` + + + + +```py +import torch +import numpy as np +from PIL import Image +from diffusers import AutoPipelineForInpainting +from diffusers.utils import load_image, make_image_grid + +pipe = AutoPipelineForInpainting.from_pretrained("kandinsky-community/kandinsky-2-2-decoder-inpaint", torch_dtype=torch.float16) +pipe.enable_model_cpu_offload() + +init_image = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/cat.png") +mask = np.zeros((768, 768), dtype=np.float32) +# mask area above cat's head +mask[:250, 250:-250] = 1 +prompt = "a hat" + +output_image = pipe(prompt=prompt, image=original_image, mask_image=mask).images[0] +mask = Image.fromarray((mask*255).astype('uint8'), 'L') +make_image_grid([init_image, mask, output_image], rows=1, cols=3) +``` + + + + +## Interpolation + +Interpolation allows you to explore the latent space between the image and text embeddings which is a cool way to see some of the prior model's intermediate outputs. Load the prior pipeline and two images you'd like to interpolate: + + + + +```py +from diffusers import KandinskyPriorPipeline, KandinskyPipeline +from diffusers.utils import load_image, make_image_grid +import torch + +prior_pipeline = KandinskyPriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-prior", torch_dtype=torch.float16, use_safetensors=True).to("cuda") +img_1 = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/cat.png") +img_2 = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/starry_night.jpeg") +make_image_grid([img_1.resize((512,512)), img_2.resize((512,512))], rows=1, cols=2) +``` + + + + +```py +from diffusers import KandinskyV22PriorPipeline, KandinskyV22Pipeline +from diffusers.utils import load_image, make_image_grid +import torch + +prior_pipeline = KandinskyV22PriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16, use_safetensors=True).to("cuda") +img_1 = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/cat.png") +img_2 = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/starry_night.jpeg") +make_image_grid([img_1.resize((512,512)), img_2.resize((512,512))], rows=1, cols=2) +``` + + + + +
+
+ +
a cat
+
+
+ +
Van Gogh's Starry Night painting
+
+
+ +Specify the text or images to interpolate, and set the weights for each text or image. Experiment with the weights to see how they affect the interpolation! + +```py +images_texts = ["a cat", img_1, img_2] +weights = [0.3, 0.3, 0.4] +``` + +Call the `interpolate` function to generate the embeddings, and then pass them to the pipeline to generate the image: + + + + +```py +# prompt can be left empty +prompt = "" +prior_out = prior_pipeline.interpolate(images_texts, weights) + +pipeline = KandinskyPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16, use_safetensors=True).to("cuda") + +image = pipeline(prompt, **prior_out, height=768, width=768).images[0] +image +``` + +
+ +
+ +
+ + +```py +# prompt can be left empty +prompt = "" +prior_out = prior_pipeline.interpolate(images_texts, weights) + +pipeline = KandinskyV22Pipeline.from_pretrained("kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16, use_safetensors=True).to("cuda") + +image = pipeline(prompt, **prior_out, height=768, width=768).images[0] +image +``` + +
+ +
+ +
+
+ +## ControlNet + + + +⚠️ ControlNet is only supported for Kandinsky 2.2! + + + +ControlNet enables conditioning large pretrained diffusion models with additional inputs such as a depth map or edge detection. For example, you can condition Kandinsky 2.2 with a depth map so the model understands and preserves the structure of the depth image. + +Let's load an image and extract it's depth map: + +```py +from diffusers.utils import load_image + +img = load_image( + "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinskyv22/cat.png" +).resize((768, 768)) +img +``` + +
+ +
+ +Then you can use the `depth-estimation` [`~transformers.Pipeline`] from 🤗 Transformers to process the image and retrieve the depth map: + +```py +import torch +import numpy as np + +from transformers import pipeline + +def make_hint(image, depth_estimator): + image = depth_estimator(image)["depth"] + image = np.array(image) + image = image[:, :, None] + image = np.concatenate([image, image, image], axis=2) + detected_map = torch.from_numpy(image).float() / 255.0 + hint = detected_map.permute(2, 0, 1) + return hint + +depth_estimator = pipeline("depth-estimation") +hint = make_hint(img, depth_estimator).unsqueeze(0).half().to("cuda") +``` + +### Text-to-image [[controlnet-text-to-image]] + +Load the prior pipeline and the [`KandinskyV22ControlnetPipeline`]: + +```py +from diffusers import KandinskyV22PriorPipeline, KandinskyV22ControlnetPipeline + +prior_pipeline = KandinskyV22PriorPipeline.from_pretrained( + "kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16, use_safetensors=True +).to("cuda") + +pipeline = KandinskyV22ControlnetPipeline.from_pretrained( + "kandinsky-community/kandinsky-2-2-controlnet-depth", torch_dtype=torch.float16 +).to("cuda") +``` + +Generate the image embeddings from a prompt and negative prompt: + +```py +prompt = "A robot, 4k photo" +negative_prior_prompt = "lowres, text, error, cropped, worst quality, low quality, jpeg artifacts, ugly, duplicate, morbid, mutilated, out of frame, extra fingers, mutated hands, poorly drawn hands, poorly drawn face, mutation, deformed, blurry, dehydrated, bad anatomy, bad proportions, extra limbs, cloned face, disfigured, gross proportions, malformed limbs, missing arms, missing legs, extra arms, extra legs, fused fingers, too many fingers, long neck, username, watermark, signature" + +generator = torch.Generator(device="cuda").manual_seed(43) + +image_emb, zero_image_emb = prior_pipeline( + prompt=prompt, negative_prompt=negative_prior_prompt, generator=generator +).to_tuple() +``` + +Finally, pass the image embeddings and the depth image to the [`KandinskyV22ControlnetPipeline`] to generate an image: + +```py +image = pipeline(image_embeds=image_emb, negative_image_embeds=zero_image_emb, hint=hint, num_inference_steps=50, generator=generator, height=768, width=768).images[0] +image +``` + +
+ +
+ +### Image-to-image [[controlnet-image-to-image]] + +For image-to-image with ControlNet, you'll need to use the: + +- [`KandinskyV22PriorEmb2EmbPipeline`] to generate the image embeddings from a text prompt and an image +- [`KandinskyV22ControlnetImg2ImgPipeline`] to generate an image from the initial image and the image embeddings + +Process and extract a depth map of an initial image of a cat with the `depth-estimation` [`~transformers.Pipeline`] from 🤗 Transformers: + +```py +import torch +import numpy as np + +from diffusers import KandinskyV22PriorEmb2EmbPipeline, KandinskyV22ControlnetImg2ImgPipeline +from diffusers.utils import load_image +from transformers import pipeline + +img = load_image( + "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinskyv22/cat.png" +).resize((768, 768)) + +def make_hint(image, depth_estimator): + image = depth_estimator(image)["depth"] + image = np.array(image) + image = image[:, :, None] + image = np.concatenate([image, image, image], axis=2) + detected_map = torch.from_numpy(image).float() / 255.0 + hint = detected_map.permute(2, 0, 1) + return hint + +depth_estimator = pipeline("depth-estimation") +hint = make_hint(img, depth_estimator).unsqueeze(0).half().to("cuda") +``` + +Load the prior pipeline and the [`KandinskyV22ControlnetImg2ImgPipeline`]: + +```py +prior_pipeline = KandinskyV22PriorEmb2EmbPipeline.from_pretrained( + "kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16, use_safetensors=True +).to("cuda") + +pipeline = KandinskyV22ControlnetImg2ImgPipeline.from_pretrained( + "kandinsky-community/kandinsky-2-2-controlnet-depth", torch_dtype=torch.float16 +).to("cuda") +``` + +Pass a text prompt and the initial image to the prior pipeline to generate the image embeddings: + +```py +prompt = "A robot, 4k photo" +negative_prior_prompt = "lowres, text, error, cropped, worst quality, low quality, jpeg artifacts, ugly, duplicate, morbid, mutilated, out of frame, extra fingers, mutated hands, poorly drawn hands, poorly drawn face, mutation, deformed, blurry, dehydrated, bad anatomy, bad proportions, extra limbs, cloned face, disfigured, gross proportions, malformed limbs, missing arms, missing legs, extra arms, extra legs, fused fingers, too many fingers, long neck, username, watermark, signature" + +generator = torch.Generator(device="cuda").manual_seed(43) + +img_emb = prior_pipeline(prompt=prompt, image=img, strength=0.85, generator=generator) +negative_emb = prior_pipeline(prompt=negative_prior_prompt, image=img, strength=1, generator=generator) +``` + +Now you can run the [`KandinskyV22ControlnetImg2ImgPipeline`] to generate an image from the initial image and the image embeddings: + +```py +image = pipeline(image=img, strength=0.5, image_embeds=img_emb.image_embeds, negative_image_embeds=negative_emb.image_embeds, hint=hint, num_inference_steps=50, generator=generator, height=768, width=768).images[0] +make_image_grid([img.resize((512, 512)), image.resize((512, 512))], rows=1, cols=2) +``` + +
+ +
+ +## Optimizations + +Kandinsky is unique because it requires a prior pipeline to generate the mappings, and a second pipeline to decode the latents into an image. Optimization efforts should be focused on the second pipeline because that is where the bulk of the computation is done. Here are some tips to improve Kandinsky during inference. + +1. Enable [xFormers](../optimization/xformers) if you're using PyTorch < 2.0: + +```diff + from diffusers import DiffusionPipeline + import torch + + pipe = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16) ++ pipe.enable_xformers_memory_efficient_attention() +``` + +2. Enable `torch.compile` if you're using PyTorch >= 2.0 to automatically use scaled dot-product attention (SDPA): + +```diff + pipe.unet.to(memory_format=torch.channels_last) ++ pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True) +``` + +This is the same as explicitly setting the attention processor to use [`~models.attention_processor.AttnAddedKVProcessor2_0`]: + +```py +from diffusers.models.attention_processor import AttnAddedKVProcessor2_0 + +pipe.unet.set_attn_processor(AttnAddedKVProcessor2_0()) +``` + +3. Offload the model to the CPU with [`~KandinskyPriorPipeline.enable_model_cpu_offload`] to avoid out-of-memory errors: + +```diff + from diffusers import DiffusionPipeline + import torch + + pipe = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16) ++ pipe.enable_model_cpu_offload() +``` + +4. By default, the text-to-image pipeline uses the [`DDIMScheduler`] but you can replace it with another scheduler like [`DDPMScheduler`] to see how that affects the tradeoff between inference speed and image quality: + +```py +from diffusers import DDPMScheduler +from diffusers import DiffusionPipeline + +scheduler = DDPMScheduler.from_pretrained("kandinsky-community/kandinsky-2-1", subfolder="ddpm_scheduler") +pipe = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", scheduler=scheduler, torch_dtype=torch.float16, use_safetensors=True).to("cuda") +``` + + ## KandinskyV22PriorPipeline [[autodoc]] KandinskyV22PriorPipeline diff --git a/docs/source/en/api/pipelines/latent_consistency_models.md b/docs/source/en/api/pipelines/latent_consistency_models.md index 54e81fbe2519..da4e01e0d314 100644 --- a/docs/source/en/api/pipelines/latent_consistency_models.md +++ b/docs/source/en/api/pipelines/latent_consistency_models.md @@ -26,6 +26,613 @@ A demo for the [SimianLuo/LCM_Dreamshaper_v7](https://huggingface.co/SimianLuo/L The pipelines were contributed by [luosiallen](https://luosiallen.github.io/), [nagolinc](https://github.com/nagolinc), and [dg845](https://github.com/dg845). +## Text-to-image + + + + +To use LCMs, you need to load the LCM checkpoint for your supported model into [`UNet2DConditionModel`] and replace the scheduler with the [`LCMScheduler`]. Then you can use the pipeline as usual, and pass a text prompt to generate an image in just 4 steps. + +A couple of notes to keep in mind when using LCMs are: + +* Typically, batch size is doubled inside the pipeline for classifier-free guidance. But LCM applies guidance with guidance embeddings and doesn't need to double the batch size, which leads to faster inference. The downside is that negative prompts don't work with LCM because they don't have any effect on the denoising process. +* The ideal range for `guidance_scale` is [3., 13.] because that is what the UNet was trained with. However, disabling `guidance_scale` with a value of 1.0 is also effective in most cases. + +```python +from diffusers import StableDiffusionXLPipeline, UNet2DConditionModel, LCMScheduler +import torch + +unet = UNet2DConditionModel.from_pretrained( + "latent-consistency/lcm-sdxl", + torch_dtype=torch.float16, + variant="fp16", +) +pipe = StableDiffusionXLPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", unet=unet, torch_dtype=torch.float16, variant="fp16", +).to("cuda") +pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config) + +prompt = "Self-portrait oil painting, a beautiful cyborg with golden hair, 8k" +generator = torch.manual_seed(0) +image = pipe( + prompt=prompt, num_inference_steps=4, generator=generator, guidance_scale=8.0 +).images[0] +image +``` + +
+ +
+ +
+ + +To use LCM-LoRAs, you need to replace the scheduler with the [`LCMScheduler`] and load the LCM-LoRA weights with the [`~loaders.StableDiffusionLoraLoaderMixin.load_lora_weights`] method. Then you can use the pipeline as usual, and pass a text prompt to generate an image in just 4 steps. + +A couple of notes to keep in mind when using LCM-LoRAs are: + +* Typically, batch size is doubled inside the pipeline for classifier-free guidance. But LCM applies guidance with guidance embeddings and doesn't need to double the batch size, which leads to faster inference. The downside is that negative prompts don't work with LCM because they don't have any effect on the denoising process. +* You could use guidance with LCM-LoRAs, but it is very sensitive to high `guidance_scale` values and can lead to artifacts in the generated image. The best values we've found are between [1.0, 2.0]. +* Replace [stabilityai/stable-diffusion-xl-base-1.0](https://hf.co/stabilityai/stable-diffusion-xl-base-1.0) with any finetuned model. For example, try using the [animagine-xl](https://huggingface.co/Linaqruf/animagine-xl) checkpoint to generate anime images with SDXL. + +```py +import torch +from diffusers import DiffusionPipeline, LCMScheduler + +pipe = DiffusionPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", + variant="fp16", + torch_dtype=torch.float16 +).to("cuda") +pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config) +pipe.load_lora_weights("latent-consistency/lcm-lora-sdxl") + +prompt = "Self-portrait oil painting, a beautiful cyborg with golden hair, 8k" +generator = torch.manual_seed(42) +image = pipe( + prompt=prompt, num_inference_steps=4, generator=generator, guidance_scale=1.0 +).images[0] +image +``` + +
+ +
+ +
+
+ +## Image-to-image + + + + +To use LCMs for image-to-image, you need to load the LCM checkpoint for your supported model into [`UNet2DConditionModel`] and replace the scheduler with the [`LCMScheduler`]. Then you can use the pipeline as usual, and pass a text prompt and initial image to generate an image in just 4 steps. + +> [!TIP] +> Experiment with different values for `num_inference_steps`, `strength`, and `guidance_scale` to get the best results. + +```python +import torch +from diffusers import AutoPipelineForImage2Image, UNet2DConditionModel, LCMScheduler +from diffusers.utils import load_image + +unet = UNet2DConditionModel.from_pretrained( + "SimianLuo/LCM_Dreamshaper_v7", + subfolder="unet", + torch_dtype=torch.float16, +) + +pipe = AutoPipelineForImage2Image.from_pretrained( + "Lykon/dreamshaper-7", + unet=unet, + torch_dtype=torch.float16, + variant="fp16", +).to("cuda") +pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config) + +init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-init.png") +prompt = "Astronauts in a jungle, cold color palette, muted colors, detailed, 8k" +generator = torch.manual_seed(0) +image = pipe( + prompt, + image=init_image, + num_inference_steps=4, + guidance_scale=7.5, + strength=0.5, + generator=generator +).images[0] +image +``` + +
+
+ +
initial image
+
+
+ +
generated image
+
+
+ +
+ + +To use LCM-LoRAs for image-to-image, you need to replace the scheduler with the [`LCMScheduler`] and load the LCM-LoRA weights with the [`~loaders.StableDiffusionLoraLoaderMixin.load_lora_weights`] method. Then you can use the pipeline as usual, and pass a text prompt and initial image to generate an image in just 4 steps. + +> [!TIP] +> Experiment with different values for `num_inference_steps`, `strength`, and `guidance_scale` to get the best results. + +```py +import torch +from diffusers import AutoPipelineForImage2Image, LCMScheduler +from diffusers.utils import make_image_grid, load_image + +pipe = AutoPipelineForImage2Image.from_pretrained( + "Lykon/dreamshaper-7", + torch_dtype=torch.float16, + variant="fp16", +).to("cuda") + +pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config) + +pipe.load_lora_weights("latent-consistency/lcm-lora-sdv1-5") + +init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-init.png") +prompt = "Astronauts in a jungle, cold color palette, muted colors, detailed, 8k" + +generator = torch.manual_seed(0) +image = pipe( + prompt, + image=init_image, + num_inference_steps=4, + guidance_scale=1, + strength=0.6, + generator=generator +).images[0] +image +``` + +
+
+ +
initial image
+
+
+ +
generated image
+
+
+ +
+
+ +## Inpainting + +To use LCM-LoRAs for inpainting, you need to replace the scheduler with the [`LCMScheduler`] and load the LCM-LoRA weights with the [`~loaders.StableDiffusionLoraLoaderMixin.load_lora_weights`] method. Then you can use the pipeline as usual, and pass a text prompt, initial image, and mask image to generate an image in just 4 steps. + +```py +import torch +from diffusers import AutoPipelineForInpainting, LCMScheduler +from diffusers.utils import load_image, make_image_grid + +pipe = AutoPipelineForInpainting.from_pretrained( + "runwayml/stable-diffusion-inpainting", + torch_dtype=torch.float16, + variant="fp16", +).to("cuda") + +pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config) + +pipe.load_lora_weights("latent-consistency/lcm-lora-sdv1-5") + +init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png") +mask_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint_mask.png") + +prompt = "concept art digital painting of an elven castle, inspired by lord of the rings, highly detailed, 8k" +generator = torch.manual_seed(0) +image = pipe( + prompt=prompt, + image=init_image, + mask_image=mask_image, + generator=generator, + num_inference_steps=4, + guidance_scale=4, +).images[0] +image +``` + +
+
+ +
initial image
+
+
+ +
generated image
+
+
+ +## Adapters + +LCMs are compatible with adapters like LoRA, ControlNet, T2I-Adapter, and AnimateDiff. You can bring the speed of LCMs to these adapters to generate images in a certain style or condition the model on another input like a canny image. + +### LoRA + +[LoRA](../using-diffusers/loading_adapters#lora) adapters can be rapidly finetuned to learn a new style from just a few images and plugged into a pretrained model to generate images in that style. + + + + +Load the LCM checkpoint for your supported model into [`UNet2DConditionModel`] and replace the scheduler with the [`LCMScheduler`]. Then you can use the [`~loaders.StableDiffusionLoraLoaderMixin.load_lora_weights`] method to load the LoRA weights into the LCM and generate a styled image in a few steps. + +```python +from diffusers import StableDiffusionXLPipeline, UNet2DConditionModel, LCMScheduler +import torch + +unet = UNet2DConditionModel.from_pretrained( + "latent-consistency/lcm-sdxl", + torch_dtype=torch.float16, + variant="fp16", +) +pipe = StableDiffusionXLPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", unet=unet, torch_dtype=torch.float16, variant="fp16", +).to("cuda") +pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config) +pipe.load_lora_weights("TheLastBen/Papercut_SDXL", weight_name="papercut.safetensors", adapter_name="papercut") + +prompt = "papercut, a cute fox" +generator = torch.manual_seed(0) +image = pipe( + prompt=prompt, num_inference_steps=4, generator=generator, guidance_scale=8.0 +).images[0] +image +``` + +
+ +
+ +
+ + +Replace the scheduler with the [`LCMScheduler`]. Then you can use the [`~loaders.StableDiffusionLoraLoaderMixin.load_lora_weights`] method to load the LCM-LoRA weights and the style LoRA you want to use. Combine both LoRA adapters with the [`~loaders.UNet2DConditionLoadersMixin.set_adapters`] method and generate a styled image in a few steps. + +```py +import torch +from diffusers import DiffusionPipeline, LCMScheduler + +pipe = DiffusionPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", + variant="fp16", + torch_dtype=torch.float16 +).to("cuda") + +pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config) + +pipe.load_lora_weights("latent-consistency/lcm-lora-sdxl", adapter_name="lcm") +pipe.load_lora_weights("TheLastBen/Papercut_SDXL", weight_name="papercut.safetensors", adapter_name="papercut") + +pipe.set_adapters(["lcm", "papercut"], adapter_weights=[1.0, 0.8]) + +prompt = "papercut, a cute fox" +generator = torch.manual_seed(0) +image = pipe(prompt, num_inference_steps=4, guidance_scale=1, generator=generator).images[0] +image +``` + +
+ +
+ +
+
+ +### ControlNet + +[ControlNet](./controlnet) are adapters that can be trained on a variety of inputs like canny edge, pose estimation, or depth. The ControlNet can be inserted into the pipeline to provide additional conditioning and control to the model for more accurate generation. + +You can find additional ControlNet models trained on other inputs in [lllyasviel's](https://hf.co/lllyasviel) repository. + + + + +Load a ControlNet model trained on canny images and pass it to the [`ControlNetModel`]. Then you can load a LCM model into [`StableDiffusionControlNetPipeline`] and replace the scheduler with the [`LCMScheduler`]. Now pass the canny image to the pipeline and generate an image. + +> [!TIP] +> Experiment with different values for `num_inference_steps`, `controlnet_conditioning_scale`, `cross_attention_kwargs`, and `guidance_scale` to get the best results. + +```python +import torch +import cv2 +import numpy as np +from PIL import Image + +from diffusers import StableDiffusionControlNetPipeline, ControlNetModel, LCMScheduler +from diffusers.utils import load_image, make_image_grid + +image = load_image( + "https://hf.co/datasets/huggingface/documentation-images/resolve/main/diffusers/input_image_vermeer.png" +).resize((512, 512)) + +image = np.array(image) + +low_threshold = 100 +high_threshold = 200 + +image = cv2.Canny(image, low_threshold, high_threshold) +image = image[:, :, None] +image = np.concatenate([image, image, image], axis=2) +canny_image = Image.fromarray(image) + +controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny", torch_dtype=torch.float16) +pipe = StableDiffusionControlNetPipeline.from_pretrained( + "SimianLuo/LCM_Dreamshaper_v7", + controlnet=controlnet, + torch_dtype=torch.float16, + safety_checker=None, +).to("cuda") +pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config) + +generator = torch.manual_seed(0) +image = pipe( + "the mona lisa", + image=canny_image, + num_inference_steps=4, + generator=generator, +).images[0] +make_image_grid([canny_image, image], rows=1, cols=2) +``` + +
+ +
+ +
+ + +Load a ControlNet model trained on canny images and pass it to the [`ControlNetModel`]. Then you can load a Stable Diffusion v1.5 model into [`StableDiffusionControlNetPipeline`] and replace the scheduler with the [`LCMScheduler`]. Use the [`~loaders.StableDiffusionLoraLoaderMixin.load_lora_weights`] method to load the LCM-LoRA weights, and pass the canny image to the pipeline and generate an image. + +> [!TIP] +> Experiment with different values for `num_inference_steps`, `controlnet_conditioning_scale`, `cross_attention_kwargs`, and `guidance_scale` to get the best results. + +```py +import torch +import cv2 +import numpy as np +from PIL import Image + +from diffusers import StableDiffusionControlNetPipeline, ControlNetModel, LCMScheduler +from diffusers.utils import load_image + +image = load_image( + "https://hf.co/datasets/huggingface/documentation-images/resolve/main/diffusers/input_image_vermeer.png" +).resize((512, 512)) + +image = np.array(image) + +low_threshold = 100 +high_threshold = 200 + +image = cv2.Canny(image, low_threshold, high_threshold) +image = image[:, :, None] +image = np.concatenate([image, image, image], axis=2) +canny_image = Image.fromarray(image) + +controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny", torch_dtype=torch.float16) +pipe = StableDiffusionControlNetPipeline.from_pretrained( + "stable-diffusion-v1-5/stable-diffusion-v1-5", + controlnet=controlnet, + torch_dtype=torch.float16, + safety_checker=None, + variant="fp16" +).to("cuda") + +pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config) + +pipe.load_lora_weights("latent-consistency/lcm-lora-sdv1-5") + +generator = torch.manual_seed(0) +image = pipe( + "the mona lisa", + image=canny_image, + num_inference_steps=4, + guidance_scale=1.5, + controlnet_conditioning_scale=0.8, + cross_attention_kwargs={"scale": 1}, + generator=generator, +).images[0] +image +``` + +
+ +
+ +
+
+ +### T2I-Adapter + +[T2I-Adapter](./t2i_adapter) is an even more lightweight adapter than ControlNet, that provides an additional input to condition a pretrained model with. It is faster than ControlNet but the results may be slightly worse. + +You can find additional T2I-Adapter checkpoints trained on other inputs in [TencentArc's](https://hf.co/TencentARC) repository. + + + + +Load a T2IAdapter trained on canny images and pass it to the [`StableDiffusionXLAdapterPipeline`]. Then load a LCM checkpoint into [`UNet2DConditionModel`] and replace the scheduler with the [`LCMScheduler`]. Now pass the canny image to the pipeline and generate an image. + +```python +import torch +import cv2 +import numpy as np +from PIL import Image + +from diffusers import StableDiffusionXLAdapterPipeline, UNet2DConditionModel, T2IAdapter, LCMScheduler +from diffusers.utils import load_image, make_image_grid + +# detect the canny map in low resolution to avoid high-frequency details +image = load_image( + "https://hf.co/datasets/huggingface/documentation-images/resolve/main/diffusers/input_image_vermeer.png" +).resize((384, 384)) + +image = np.array(image) + +low_threshold = 100 +high_threshold = 200 + +image = cv2.Canny(image, low_threshold, high_threshold) +image = image[:, :, None] +image = np.concatenate([image, image, image], axis=2) +canny_image = Image.fromarray(image).resize((1024, 1216)) + +adapter = T2IAdapter.from_pretrained("TencentARC/t2i-adapter-canny-sdxl-1.0", torch_dtype=torch.float16, variant="fp16").to("cuda") + +unet = UNet2DConditionModel.from_pretrained( + "latent-consistency/lcm-sdxl", + torch_dtype=torch.float16, + variant="fp16", +) +pipe = StableDiffusionXLAdapterPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", + unet=unet, + adapter=adapter, + torch_dtype=torch.float16, + variant="fp16", +).to("cuda") + +pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config) + +prompt = "the mona lisa, 4k picture, high quality" +negative_prompt = "extra digit, fewer digits, cropped, worst quality, low quality, glitch, deformed, mutated, ugly, disfigured" + +generator = torch.manual_seed(0) +image = pipe( + prompt=prompt, + negative_prompt=negative_prompt, + image=canny_image, + num_inference_steps=4, + guidance_scale=5, + adapter_conditioning_scale=0.8, + adapter_conditioning_factor=1, + generator=generator, +).images[0] +``` + +
+ +
+ +
+ + +Load a T2IAdapter trained on canny images and pass it to the [`StableDiffusionXLAdapterPipeline`]. Replace the scheduler with the [`LCMScheduler`], and use the [`~loaders.StableDiffusionLoraLoaderMixin.load_lora_weights`] method to load the LCM-LoRA weights. Pass the canny image to the pipeline and generate an image. + +```py +import torch +import cv2 +import numpy as np +from PIL import Image + +from diffusers import StableDiffusionXLAdapterPipeline, UNet2DConditionModel, T2IAdapter, LCMScheduler +from diffusers.utils import load_image, make_image_grid + +# detect the canny map in low resolution to avoid high-frequency details +image = load_image( + "https://hf.co/datasets/huggingface/documentation-images/resolve/main/diffusers/input_image_vermeer.png" +).resize((384, 384)) + +image = np.array(image) + +low_threshold = 100 +high_threshold = 200 + +image = cv2.Canny(image, low_threshold, high_threshold) +image = image[:, :, None] +image = np.concatenate([image, image, image], axis=2) +canny_image = Image.fromarray(image).resize((1024, 1024)) + +adapter = T2IAdapter.from_pretrained("TencentARC/t2i-adapter-canny-sdxl-1.0", torch_dtype=torch.float16, variant="fp16").to("cuda") + +pipe = StableDiffusionXLAdapterPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", + adapter=adapter, + torch_dtype=torch.float16, + variant="fp16", +).to("cuda") + +pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config) + +pipe.load_lora_weights("latent-consistency/lcm-lora-sdxl") + +prompt = "the mona lisa, 4k picture, high quality" +negative_prompt = "extra digit, fewer digits, cropped, worst quality, low quality, glitch, deformed, mutated, ugly, disfigured" + +generator = torch.manual_seed(0) +image = pipe( + prompt=prompt, + negative_prompt=negative_prompt, + image=canny_image, + num_inference_steps=4, + guidance_scale=1.5, + adapter_conditioning_scale=0.8, + adapter_conditioning_factor=1, + generator=generator, +).images[0] +``` + +
+ +
+ +
+
+ +### AnimateDiff + +[AnimateDiff](../api/pipelines/animatediff) is an adapter that adds motion to an image. It can be used with most Stable Diffusion models, effectively turning them into "video generation" models. Generating good results with a video model usually requires generating multiple frames (16-24), which can be very slow with a regular Stable Diffusion model. LCM-LoRA can speed up this process by only taking 4-8 steps for each frame. + +Load a [`AnimateDiffPipeline`] and pass a [`MotionAdapter`] to it. Then replace the scheduler with the [`LCMScheduler`], and combine both LoRA adapters with the [`~loaders.UNet2DConditionLoadersMixin.set_adapters`] method. Now you can pass a prompt to the pipeline and generate an animated image. + +```py +import torch +from diffusers import MotionAdapter, AnimateDiffPipeline, DDIMScheduler, LCMScheduler +from diffusers.utils import export_to_gif + +adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5") +pipe = AnimateDiffPipeline.from_pretrained( + "frankjoshua/toonyou_beta6", + motion_adapter=adapter, +).to("cuda") + +# set scheduler +pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config) + +# load LCM-LoRA +pipe.load_lora_weights("latent-consistency/lcm-lora-sdv1-5", adapter_name="lcm") +pipe.load_lora_weights("guoyww/animatediff-motion-lora-zoom-in", weight_name="diffusion_pytorch_model.safetensors", adapter_name="motion-lora") + +pipe.set_adapters(["lcm", "motion-lora"], adapter_weights=[0.55, 1.2]) + +prompt = "best quality, masterpiece, 1girl, looking at viewer, blurry background, upper body, contemporary, dress" +generator = torch.manual_seed(0) +frames = pipe( + prompt=prompt, + num_inference_steps=5, + guidance_scale=1.25, + cross_attention_kwargs={"scale": 1}, + num_frames=24, + generator=generator +).frames[0] +export_to_gif(frames, "animation.gif") +``` + +
+ +
+ ## LatentConsistencyModelPipeline diff --git a/docs/source/en/api/pipelines/marigold.md b/docs/source/en/api/pipelines/marigold.md index e9ca0df067ba..3b1658fe1a9f 100644 --- a/docs/source/en/api/pipelines/marigold.md +++ b/docs/source/en/api/pipelines/marigold.md @@ -104,6 +104,554 @@ This ensures high-quality predictions when invoking the pipeline with only the ` See also Marigold [usage examples](../../using-diffusers/marigold_usage). +## Depth Prediction + +To get a depth prediction, load the `prs-eth/marigold-depth-v1-1` checkpoint into [`MarigoldDepthPipeline`], +put the image through the pipeline, and save the predictions: + +```python +import diffusers +import torch + +pipe = diffusers.MarigoldDepthPipeline.from_pretrained( + "prs-eth/marigold-depth-v1-1", variant="fp16", torch_dtype=torch.float16 +).to("cuda") + +image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg") + +depth = pipe(image) + +vis = pipe.image_processor.visualize_depth(depth.prediction) +vis[0].save("einstein_depth.png") + +depth_16bit = pipe.image_processor.export_depth_to_16bit_png(depth.prediction) +depth_16bit[0].save("einstein_depth_16bit.png") +``` + +The [`~pipelines.marigold.marigold_image_processing.MarigoldImageProcessor.visualize_depth`] function applies one of +[matplotlib's colormaps](https://matplotlib.org/stable/users/explain/colors/colormaps.html) (`Spectral` by default) to map the predicted pixel values from a single-channel `[0, 1]` +depth range into an RGB image. +With the `Spectral` colormap, pixels with near depth are painted red, and far pixels are blue. +The 16-bit PNG file stores the single channel values mapped linearly from the `[0, 1]` range into `[0, 65535]`. +Below are the raw and the visualized predictions. The darker and closer areas (mustache) are easier to distinguish in +the visualization. + +
+
+ +
+ Predicted depth (16-bit PNG) +
+
+
+ +
+ Predicted depth visualization (Spectral) +
+
+
+ +## Surface Normals Estimation + +Load the `prs-eth/marigold-normals-v1-1` checkpoint into [`MarigoldNormalsPipeline`], put the image through the +pipeline, and save the predictions: + +```python +import diffusers +import torch + +pipe = diffusers.MarigoldNormalsPipeline.from_pretrained( + "prs-eth/marigold-normals-v1-1", variant="fp16", torch_dtype=torch.float16 +).to("cuda") + +image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg") + +normals = pipe(image) + +vis = pipe.image_processor.visualize_normals(normals.prediction) +vis[0].save("einstein_normals.png") +``` + +The [`~pipelines.marigold.marigold_image_processing.MarigoldImageProcessor.visualize_normals`] maps the three-dimensional +prediction with pixel values in the range `[-1, 1]` into an RGB image. +The visualization function supports flipping surface normals axes to make the visualization compatible with other +choices of the frame of reference. +Conceptually, each pixel is painted according to the surface normal vector in the frame of reference, where `X` axis +points right, `Y` axis points up, and `Z` axis points at the viewer. +Below is the visualized prediction: + +
+
+ +
+ Predicted surface normals visualization +
+
+
+ +In this example, the nose tip almost certainly has a point on the surface, in which the surface normal vector points +straight at the viewer, meaning that its coordinates are `[0, 0, 1]`. +This vector maps to the RGB `[128, 128, 255]`, which corresponds to the violet-blue color. +Similarly, a surface normal on the cheek in the right part of the image has a large `X` component, which increases the +red hue. +Points on the shoulders pointing up with a large `Y` promote green color. + +## Intrinsic Image Decomposition + +Marigold provides two models for Intrinsic Image Decomposition (IID): "Appearance" and "Lighting". +Each model produces Albedo maps, derived from InteriorVerse and Hypersim annotations, respectively. + +- The "Appearance" model also estimates Material properties: Roughness and Metallicity. +- The "Lighting" model generates Diffuse Shading and Non-diffuse Residual. + +Here is the sample code saving predictions made by the "Appearance" model: + +```python +import diffusers +import torch + +pipe = diffusers.MarigoldIntrinsicsPipeline.from_pretrained( + "prs-eth/marigold-iid-appearance-v1-1", variant="fp16", torch_dtype=torch.float16 +).to("cuda") + +image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg") + +intrinsics = pipe(image) + +vis = pipe.image_processor.visualize_intrinsics(intrinsics.prediction, pipe.target_properties) +vis[0]["albedo"].save("einstein_albedo.png") +vis[0]["roughness"].save("einstein_roughness.png") +vis[0]["metallicity"].save("einstein_metallicity.png") +``` + +Another example demonstrating the predictions made by the "Lighting" model: + +```python +import diffusers +import torch + +pipe = diffusers.MarigoldIntrinsicsPipeline.from_pretrained( + "prs-eth/marigold-iid-lighting-v1-1", variant="fp16", torch_dtype=torch.float16 +).to("cuda") + +image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg") + +intrinsics = pipe(image) + +vis = pipe.image_processor.visualize_intrinsics(intrinsics.prediction, pipe.target_properties) +vis[0]["albedo"].save("einstein_albedo.png") +vis[0]["shading"].save("einstein_shading.png") +vis[0]["residual"].save("einstein_residual.png") +``` + +Both models share the same pipeline while supporting different decomposition types. +The exact decomposition parameterization (e.g., sRGB vs. linear space) is stored in the +`pipe.target_properties` dictionary, which is passed into the +[`~pipelines.marigold.marigold_image_processing.MarigoldImageProcessor.visualize_intrinsics`] function. + +Below are some examples showcasing the predicted decomposition outputs. +All modalities can be inspected in the +[Intrinsic Image Decomposition](https://huggingface.co/spaces/prs-eth/marigold-iid) Space. + +
+
+ +
+ Predicted albedo ("Appearance" model) +
+
+
+ +
+ Predicted diffuse shading ("Lighting" model) +
+
+
+ +## Speeding up inference + +The above quick start snippets are already optimized for quality and speed, loading the checkpoint, utilizing the +`fp16` variant of weights and computation, and performing the default number (4) of denoising diffusion steps. +The first step to accelerate inference, at the expense of prediction quality, is to reduce the denoising diffusion +steps to the minimum: + +```diff + import diffusers + import torch + + pipe = diffusers.MarigoldDepthPipeline.from_pretrained( + "prs-eth/marigold-depth-v1-1", variant="fp16", torch_dtype=torch.float16 + ).to("cuda") + + image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg") + +- depth = pipe(image) ++ depth = pipe(image, num_inference_steps=1) +``` + +With this change, the `pipe` call completes in 280ms on RTX 3090 GPU. +Internally, the input image is first encoded using the Stable Diffusion VAE encoder, followed by a single denoising +step performed by the U-Net. +Finally, the prediction latent is decoded with the VAE decoder into pixel space. +In this setup, two out of three module calls are dedicated to converting between the pixel and latent spaces of the LDM. +Since Marigold's latent space is compatible with Stable Diffusion 2.0, inference can be accelerated by more than 3x, +reducing the call time to 85ms on an RTX 3090, by using a [lightweight replacement of the SD VAE](../api/models/autoencoder_tiny). +Note that using a lightweight VAE may slightly reduce the visual quality of the predictions. + +```diff + import diffusers + import torch + + pipe = diffusers.MarigoldDepthPipeline.from_pretrained( + "prs-eth/marigold-depth-v1-1", variant="fp16", torch_dtype=torch.float16 + ).to("cuda") + ++ pipe.vae = diffusers.AutoencoderTiny.from_pretrained( ++ "madebyollin/taesd", torch_dtype=torch.float16 ++ ).cuda() + + image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg") + + depth = pipe(image, num_inference_steps=1) +``` + +So far, we have optimized the number of diffusion steps and model components. Self-attention operations account for a +significant portion of computations. +Speeding them up can be achieved by using a more efficient attention processor: + +```diff + import diffusers + import torch ++ from diffusers.models.attention_processor import AttnProcessor2_0 + + pipe = diffusers.MarigoldDepthPipeline.from_pretrained( + "prs-eth/marigold-depth-v1-1", variant="fp16", torch_dtype=torch.float16 + ).to("cuda") + ++ pipe.vae.set_attn_processor(AttnProcessor2_0()) ++ pipe.unet.set_attn_processor(AttnProcessor2_0()) + + image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg") + + depth = pipe(image, num_inference_steps=1) +``` + +Finally, as suggested in [Optimizations](../optimization/fp16#torchcompile), enabling `torch.compile` can further enhance performance depending on +the target hardware. +However, compilation incurs a significant overhead during the first pipeline invocation, making it beneficial only when +the same pipeline instance is called repeatedly, such as within a loop. + +```diff + import diffusers + import torch + from diffusers.models.attention_processor import AttnProcessor2_0 + + pipe = diffusers.MarigoldDepthPipeline.from_pretrained( + "prs-eth/marigold-depth-v1-1", variant="fp16", torch_dtype=torch.float16 + ).to("cuda") + + pipe.vae.set_attn_processor(AttnProcessor2_0()) + pipe.unet.set_attn_processor(AttnProcessor2_0()) + ++ pipe.vae = torch.compile(pipe.vae, mode="reduce-overhead", fullgraph=True) ++ pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True) + + image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg") + + depth = pipe(image, num_inference_steps=1) +``` + +## Maximizing Precision and Ensembling + +Marigold pipelines have a built-in ensembling mechanism combining multiple predictions from different random latents. +This is a brute-force way of improving the precision of predictions, capitalizing on the generative nature of diffusion. +The ensembling path is activated automatically when the `ensemble_size` argument is set greater or equal than `3`. +When aiming for maximum precision, it makes sense to adjust `num_inference_steps` simultaneously with `ensemble_size`. +The recommended values vary across checkpoints but primarily depend on the scheduler type. +The effect of ensembling is particularly well-seen with surface normals: + +```diff + import diffusers + + pipe = diffusers.MarigoldNormalsPipeline.from_pretrained("prs-eth/marigold-normals-v1-1").to("cuda") + + image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg") + +- depth = pipe(image) ++ depth = pipe(image, num_inference_steps=10, ensemble_size=5) + + vis = pipe.image_processor.visualize_normals(depth.prediction) + vis[0].save("einstein_normals.png") +``` + +
+
+ +
+ Surface normals, no ensembling +
+
+
+ +
+ Surface normals, with ensembling +
+
+
+ +As can be seen, all areas with fine-grained structurers, such as hair, got more conservative and on average more +correct predictions. +Such a result is more suitable for precision-sensitive downstream tasks, such as 3D reconstruction. + +## Frame-by-frame Video Processing with Temporal Consistency + +Due to Marigold's generative nature, each prediction is unique and defined by the random noise sampled for the latent +initialization. +This becomes an obvious drawback compared to traditional end-to-end dense regression networks, as exemplified in the +following videos: + +
+
+ +
Input video
+
+
+ +
Marigold Depth applied to input video frames independently
+
+
+ +To address this issue, it is possible to pass `latents` argument to the pipelines, which defines the starting point of +diffusion. +Empirically, we found that a convex combination of the very same starting point noise latent and the latent +corresponding to the previous frame prediction give sufficiently smooth results, as implemented in the snippet below: + +```python +import imageio +import diffusers +import torch +from diffusers.models.attention_processor import AttnProcessor2_0 +from PIL import Image +from tqdm import tqdm + +device = "cuda" +path_in = "https://huggingface.co/spaces/prs-eth/marigold-lcm/resolve/c7adb5427947d2680944f898cd91d386bf0d4924/files/video/obama.mp4" +path_out = "obama_depth.gif" + +pipe = diffusers.MarigoldDepthPipeline.from_pretrained( + "prs-eth/marigold-depth-v1-1", variant="fp16", torch_dtype=torch.float16 +).to(device) +pipe.vae = diffusers.AutoencoderTiny.from_pretrained( + "madebyollin/taesd", torch_dtype=torch.float16 +).to(device) +pipe.unet.set_attn_processor(AttnProcessor2_0()) +pipe.vae = torch.compile(pipe.vae, mode="reduce-overhead", fullgraph=True) +pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True) +pipe.set_progress_bar_config(disable=True) + +with imageio.get_reader(path_in) as reader: + size = reader.get_meta_data()['size'] + last_frame_latent = None + latent_common = torch.randn( + (1, 4, 768 * size[1] // (8 * max(size)), 768 * size[0] // (8 * max(size))) + ).to(device=device, dtype=torch.float16) + + out = [] + for frame_id, frame in tqdm(enumerate(reader), desc="Processing Video"): + frame = Image.fromarray(frame) + latents = latent_common + if last_frame_latent is not None: + latents = 0.9 * latents + 0.1 * last_frame_latent + + depth = pipe( + frame, + num_inference_steps=1, + match_input_resolution=False, + latents=latents, + output_latent=True, + ) + last_frame_latent = depth.latent + out.append(pipe.image_processor.visualize_depth(depth.prediction)[0]) + + diffusers.utils.export_to_gif(out, path_out, fps=reader.get_meta_data()['fps']) +``` + +Here, the diffusion process starts from the given computed latent. +The pipeline sets `output_latent=True` to access `out.latent` and computes its contribution to the next frame's latent +initialization. +The result is much more stable now: + +
+
+ +
Marigold Depth applied to input video frames independently
+
+
+ +
Marigold Depth with forced latents initialization
+
+
+ +## Marigold for ControlNet + +A very common application for depth prediction with diffusion models comes in conjunction with ControlNet. +Depth crispness plays a crucial role in obtaining high-quality results from ControlNet. +As seen in comparisons with other methods above, Marigold excels at that task. +The snippet below demonstrates how to load an image, compute depth, and pass it into ControlNet in a compatible format: + +```python +import torch +import diffusers + +device = "cuda" +generator = torch.Generator(device=device).manual_seed(2024) +image = diffusers.utils.load_image( + "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet_depth_source.png" +) + +pipe = diffusers.MarigoldDepthPipeline.from_pretrained( + "prs-eth/marigold-depth-v1-1", torch_dtype=torch.float16, variant="fp16" +).to(device) + +depth_image = pipe(image, generator=generator).prediction +depth_image = pipe.image_processor.visualize_depth(depth_image, color_map="binary") +depth_image[0].save("motorcycle_controlnet_depth.png") + +controlnet = diffusers.ControlNetModel.from_pretrained( + "diffusers/controlnet-depth-sdxl-1.0", torch_dtype=torch.float16, variant="fp16" +).to(device) +pipe = diffusers.StableDiffusionXLControlNetPipeline.from_pretrained( + "SG161222/RealVisXL_V4.0", torch_dtype=torch.float16, variant="fp16", controlnet=controlnet +).to(device) +pipe.scheduler = diffusers.DPMSolverMultistepScheduler.from_config(pipe.scheduler.config, use_karras_sigmas=True) + +controlnet_out = pipe( + prompt="high quality photo of a sports bike, city", + negative_prompt="", + guidance_scale=6.5, + num_inference_steps=25, + image=depth_image, + controlnet_conditioning_scale=0.7, + control_guidance_end=0.7, + generator=generator, +).images +controlnet_out[0].save("motorcycle_controlnet_out.png") +``` + +
+
+ +
+ Input image +
+
+
+ +
+ Depth in the format compatible with ControlNet +
+
+
+ +
+ ControlNet generation, conditioned on depth and prompt: "high quality photo of a sports bike, city" +
+
+
+ +## Quantitative Evaluation + +To evaluate Marigold quantitatively in standard leaderboards and benchmarks (such as NYU, KITTI, and other datasets), +follow the evaluation protocol outlined in the paper: load the full precision fp32 model and use appropriate values +for `num_inference_steps` and `ensemble_size`. +Optionally seed randomness to ensure reproducibility. +Maximizing `batch_size` will deliver maximum device utilization. + +```python +import diffusers +import torch + +device = "cuda" +seed = 2024 + +generator = torch.Generator(device=device).manual_seed(seed) +pipe = diffusers.MarigoldDepthPipeline.from_pretrained("prs-eth/marigold-depth-v1-1").to(device) + +image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg") + +depth = pipe( + image, + num_inference_steps=4, # set according to the evaluation protocol from the paper + ensemble_size=10, # set according to the evaluation protocol from the paper + generator=generator, +) + +# evaluate metrics +``` + +## Using Predictive Uncertainty + +The ensembling mechanism built into Marigold pipelines combines multiple predictions obtained from different random +latents. +As a side effect, it can be used to quantify epistemic (model) uncertainty; simply specify `ensemble_size` greater +or equal than 3 and set `output_uncertainty=True`. +The resulting uncertainty will be available in the `uncertainty` field of the output. +It can be visualized as follows: + +```python +import diffusers +import torch + +pipe = diffusers.MarigoldDepthPipeline.from_pretrained( + "prs-eth/marigold-depth-v1-1", variant="fp16", torch_dtype=torch.float16 +).to("cuda") + +image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg") + +depth = pipe( + image, + ensemble_size=10, # any number >= 3 + output_uncertainty=True, +) + +uncertainty = pipe.image_processor.visualize_uncertainty(depth.uncertainty) +uncertainty[0].save("einstein_depth_uncertainty.png") +``` + +
+
+ +
+ Depth uncertainty +
+
+
+ +
+ Surface normals uncertainty +
+
+
+ +
+ Albedo uncertainty +
+
+
+ +The interpretation of uncertainty is easy: higher values (white) correspond to pixels, where the model struggles to +make consistent predictions. +- The depth model exhibits the most uncertainty around discontinuities, where object depth changes abruptly. +- The surface normals model is least confident in fine-grained structures like hair and in dark regions such as the +collar area. +- Albedo uncertainty is represented as an RGB image, as it captures uncertainty independently for each color channel, +unlike depth and surface normals. It is also higher in shaded regions and at discontinuities. + +## Conclusion + +We hope Marigold proves valuable for your downstream tasks, whether as part of a broader generative workflow or for +perception-based applications like 3D reconstruction. + ## Marigold Depth Prediction API [[autodoc]] MarigoldDepthPipeline diff --git a/docs/source/en/api/pipelines/omnigen.md b/docs/source/en/api/pipelines/omnigen.md index 074e7b8f0115..f228e8347303 100644 --- a/docs/source/en/api/pipelines/omnigen.md +++ b/docs/source/en/api/pipelines/omnigen.md @@ -73,6 +73,305 @@ image = pipe( image.save("output.png") ``` +## Load model checkpoints + +Model weights may be stored in separate subfolders on the Hub or locally, in which case, you should use the [`~DiffusionPipeline.from_pretrained`] method. + +```python +import torch +from diffusers import OmniGenPipeline + +pipe = OmniGenPipeline.from_pretrained("Shitao/OmniGen-v1-diffusers", torch_dtype=torch.bfloat16) +``` + +## Text-to-image + +For text-to-image, pass a text prompt. By default, OmniGen generates a 1024x1024 image. +You can try setting the `height` and `width` parameters to generate images with different size. + +```python +import torch +from diffusers import OmniGenPipeline + +pipe = OmniGenPipeline.from_pretrained( + "Shitao/OmniGen-v1-diffusers", + torch_dtype=torch.bfloat16 +) +pipe.to("cuda") + +prompt = "Realistic photo. A young woman sits on a sofa, holding a book and facing the camera. She wears delicate silver hoop earrings adorned with tiny, sparkling diamonds that catch the light, with her long chestnut hair cascading over her shoulders. Her eyes are focused and gentle, framed by long, dark lashes. She is dressed in a cozy cream sweater, which complements her warm, inviting smile. Behind her, there is a table with a cup of water in a sleek, minimalist blue mug. The background is a serene indoor setting with soft natural light filtering through a window, adorned with tasteful art and flowers, creating a cozy and peaceful ambiance. 4K, HD." +image = pipe( + prompt=prompt, + height=1024, + width=1024, + guidance_scale=3, + generator=torch.Generator(device="cpu").manual_seed(111), +).images[0] +image.save("output.png") +``` + +
+ generated image +
+ +## Image edit + +OmniGen supports multimodal inputs. +When the input includes an image, you need to add a placeholder `<|image_1|>` in the text prompt to represent the image. +It is recommended to enable `use_input_image_size_as_output` to keep the edited image the same size as the original image. + +```python +import torch +from diffusers import OmniGenPipeline +from diffusers.utils import load_image + +pipe = OmniGenPipeline.from_pretrained( + "Shitao/OmniGen-v1-diffusers", + torch_dtype=torch.bfloat16 +) +pipe.to("cuda") + +prompt="<|image_1|> Remove the woman's earrings. Replace the mug with a clear glass filled with sparkling iced cola." +input_images=[load_image("https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/t2i_woman_with_book.png")] +image = pipe( + prompt=prompt, + input_images=input_images, + guidance_scale=2, + img_guidance_scale=1.6, + use_input_image_size_as_output=True, + generator=torch.Generator(device="cpu").manual_seed(222) +).images[0] +image.save("output.png") +``` + +
+
+ +
original image
+
+
+ +
edited image
+
+
+ +OmniGen has some interesting features, such as visual reasoning, as shown in the example below. + +```python +prompt="If the woman is thirsty, what should she take? Find it in the image and highlight it in blue. <|image_1|>" +input_images=[load_image("https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/edit.png")] +image = pipe( + prompt=prompt, + input_images=input_images, + guidance_scale=2, + img_guidance_scale=1.6, + use_input_image_size_as_output=True, + generator=torch.Generator(device="cpu").manual_seed(0) +).images[0] +image.save("output.png") +``` + +
+ generated image +
+ +## Controllable generation + +OmniGen can handle several classic computer vision tasks. As shown below, OmniGen can detect human skeletons in input images, which can be used as control conditions to generate new images. + +```python +import torch +from diffusers import OmniGenPipeline +from diffusers.utils import load_image + +pipe = OmniGenPipeline.from_pretrained( + "Shitao/OmniGen-v1-diffusers", + torch_dtype=torch.bfloat16 +) +pipe.to("cuda") + +prompt="Detect the skeleton of human in this image: <|image_1|>" +input_images=[load_image("https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/edit.png")] +image1 = pipe( + prompt=prompt, + input_images=input_images, + guidance_scale=2, + img_guidance_scale=1.6, + use_input_image_size_as_output=True, + generator=torch.Generator(device="cpu").manual_seed(333) +).images[0] +image1.save("image1.png") + +prompt="Generate a new photo using the following picture and text as conditions: <|image_1|>\n A young boy is sitting on a sofa in the library, holding a book. His hair is neatly combed, and a faint smile plays on his lips, with a few freckles scattered across his cheeks. The library is quiet, with rows of shelves filled with books stretching out behind him." +input_images=[load_image("https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/skeletal.png")] +image2 = pipe( + prompt=prompt, + input_images=input_images, + guidance_scale=2, + img_guidance_scale=1.6, + use_input_image_size_as_output=True, + generator=torch.Generator(device="cpu").manual_seed(333) +).images[0] +image2.save("image2.png") +``` + +
+
+ +
original image
+
+
+ +
detected skeleton
+
+
+ +
skeleton to image
+
+
+ + +OmniGen can also directly use relevant information from input images to generate new images. + +```python +import torch +from diffusers import OmniGenPipeline +from diffusers.utils import load_image + +pipe = OmniGenPipeline.from_pretrained( + "Shitao/OmniGen-v1-diffusers", + torch_dtype=torch.bfloat16 +) +pipe.to("cuda") + +prompt="Following the pose of this image <|image_1|>, generate a new photo: A young boy is sitting on a sofa in the library, holding a book. His hair is neatly combed, and a faint smile plays on his lips, with a few freckles scattered across his cheeks. The library is quiet, with rows of shelves filled with books stretching out behind him." +input_images=[load_image("https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/edit.png")] +image = pipe( + prompt=prompt, + input_images=input_images, + guidance_scale=2, + img_guidance_scale=1.6, + use_input_image_size_as_output=True, + generator=torch.Generator(device="cpu").manual_seed(0) +).images[0] +image.save("output.png") +``` + +
+
+ +
generated image
+
+
+ +## ID and object preserving + +OmniGen can generate multiple images based on the people and objects in the input image and supports inputting multiple images simultaneously. +Additionally, OmniGen can extract desired objects from an image containing multiple objects based on instructions. + +```python +import torch +from diffusers import OmniGenPipeline +from diffusers.utils import load_image + +pipe = OmniGenPipeline.from_pretrained( + "Shitao/OmniGen-v1-diffusers", + torch_dtype=torch.bfloat16 +) +pipe.to("cuda") + +prompt="A man and a woman are sitting at a classroom desk. The man is the man with yellow hair in <|image_1|>. The woman is the woman on the left of <|image_2|>" +input_image_1 = load_image("https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/3.png") +input_image_2 = load_image("https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/4.png") +input_images=[input_image_1, input_image_2] +image = pipe( + prompt=prompt, + input_images=input_images, + height=1024, + width=1024, + guidance_scale=2.5, + img_guidance_scale=1.6, + generator=torch.Generator(device="cpu").manual_seed(666) +).images[0] +image.save("output.png") +``` + +
+
+ +
input_image_1
+
+
+ +
input_image_2
+
+
+ +
generated image
+
+
+ +```py +import torch +from diffusers import OmniGenPipeline +from diffusers.utils import load_image + +pipe = OmniGenPipeline.from_pretrained( + "Shitao/OmniGen-v1-diffusers", + torch_dtype=torch.bfloat16 +) +pipe.to("cuda") + +prompt="A woman is walking down the street, wearing a white long-sleeve blouse with lace details on the sleeves, paired with a blue pleated skirt. The woman is <|image_1|>. The long-sleeve blouse and a pleated skirt are <|image_2|>." +input_image_1 = load_image("https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/emma.jpeg") +input_image_2 = load_image("https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/dress.jpg") +input_images=[input_image_1, input_image_2] +image = pipe( + prompt=prompt, + input_images=input_images, + height=1024, + width=1024, + guidance_scale=2.5, + img_guidance_scale=1.6, + generator=torch.Generator(device="cpu").manual_seed(666) +).images[0] +image.save("output.png") +``` + +
+
+ +
person image
+
+
+ +
clothe image
+
+
+ +
generated image
+
+
+ +## Optimization when using multiple images + +For text-to-image task, OmniGen requires minimal memory and time costs (9GB memory and 31s for a 1024x1024 image on A800 GPU). +However, when using input images, the computational cost increases. + +Here are some guidelines to help you reduce computational costs when using multiple images. The experiments are conducted on an A800 GPU with two input images. + +Like other pipelines, you can reduce memory usage by offloading the model: `pipe.enable_model_cpu_offload()` or `pipe.enable_sequential_cpu_offload() `. +In OmniGen, you can also decrease computational overhead by reducing the `max_input_image_size`. +The memory consumption for different image sizes is shown in the table below: + +| Method | Memory Usage | +|---------------------------|--------------| +| max_input_image_size=1024 | 40GB | +| max_input_image_size=512 | 17GB | +| max_input_image_size=256 | 14GB | + + + ## OmniGenPipeline [[autodoc]] OmniGenPipeline diff --git a/docs/source/en/api/pipelines/pag.md b/docs/source/en/api/pipelines/pag.md index 7b87e58a87e2..863172d815a6 100644 --- a/docs/source/en/api/pipelines/pag.md +++ b/docs/source/en/api/pipelines/pag.md @@ -37,6 +37,340 @@ Since RegEx is supported as a way for matching layer identifiers, it is crucial +## General tasks + +You can apply PAG to the [`StableDiffusionXLPipeline`] for tasks such as text-to-image, image-to-image, and inpainting. To enable PAG for a specific task, load the pipeline using the [AutoPipeline](../api/pipelines/auto_pipeline) API with the `enable_pag=True` flag and the `pag_applied_layers` argument. + +> [!TIP] +> 🤗 Diffusers currently only supports using PAG with selected SDXL pipelines and [`PixArtSigmaPAGPipeline`]. But feel free to open a [feature request](https://github.com/huggingface/diffusers/issues/new/choose) if you want to add PAG support to a new pipeline! + + + + +```py +from diffusers import AutoPipelineForText2Image +from diffusers.utils import load_image +import torch + +pipeline = AutoPipelineForText2Image.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", + enable_pag=True, + pag_applied_layers=["mid"], + torch_dtype=torch.float16 +) +pipeline.enable_model_cpu_offload() +``` + +> [!TIP] +> The `pag_applied_layers` argument allows you to specify which layers PAG is applied to. Additionally, you can use `set_pag_applied_layers` method to update these layers after the pipeline has been created. Check out the [pag_applied_layers](#pag_applied_layers) section to learn more about applying PAG to other layers. + +If you already have a pipeline created and loaded, you can enable PAG on it using the `from_pipe` API with the `enable_pag` flag. Internally, a PAG pipeline is created based on the pipeline and task you specified. In the example below, since we used `AutoPipelineForText2Image` and passed a `StableDiffusionXLPipeline`, a `StableDiffusionXLPAGPipeline` is created accordingly. Note that this does not require additional memory, and you will have both `StableDiffusionXLPipeline` and `StableDiffusionXLPAGPipeline` loaded and ready to use. You can read more about the `from_pipe` API and how to reuse pipelines in diffuser [here](https://huggingface.co/docs/diffusers/using-diffusers/loading#reuse-a-pipeline). + +```py +pipeline_sdxl = AutoPipelineForText2Image.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16) +pipeline = AutoPipelineForText2Image.from_pipe(pipeline_sdxl, enable_pag=True) +``` + +To generate an image, you will also need to pass a `pag_scale`. When `pag_scale` increases, images gain more semantically coherent structures and exhibit fewer artifacts. However overly large guidance scale can lead to smoother textures and slight saturation in the images, similarly to CFG. `pag_scale=3.0` is used in the official demo and works well in most of the use cases, but feel free to experiment and select the appropriate value according to your needs! PAG is disabled when `pag_scale=0`. + +```py +prompt = "an insect robot preparing a delicious meal, anime style" + +for pag_scale in [0.0, 3.0]: + generator = torch.Generator(device="cpu").manual_seed(0) + images = pipeline( + prompt=prompt, + num_inference_steps=25, + guidance_scale=7.0, + generator=generator, + pag_scale=pag_scale, + ).images +``` + +
+
+ +
generated image without PAG
+
+
+ +
generated image with PAG
+
+
+ +
+ + +You can use PAG with image-to-image pipelines. + +```py +from diffusers import AutoPipelineForImage2Image +from diffusers.utils import load_image +import torch + +pipeline = AutoPipelineForImage2Image.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", + enable_pag=True, + pag_applied_layers=["mid"], + torch_dtype=torch.float16 +) +pipeline.enable_model_cpu_offload() +``` + +If you already have a image-to-image pipeline and would like enable PAG on it, you can run this + +```py +pipeline_t2i = AutoPipelineForImage2Image.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16) +pipeline = AutoPipelineForImage2Image.from_pipe(pipeline_t2i, enable_pag=True) +``` + +It is also very easy to directly switch from a text-to-image pipeline to PAG enabled image-to-image pipeline + +```py +pipeline_pag = AutoPipelineForText2Image.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16) +pipeline = AutoPipelineForImage2Image.from_pipe(pipeline_t2i, enable_pag=True) +``` + +If you have a PAG enabled text-to-image pipeline, you can directly switch to a image-to-image pipeline with PAG still enabled + +```py +pipeline_pag = AutoPipelineForText2Image.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", enable_pag=True, torch_dtype=torch.float16) +pipeline = AutoPipelineForImage2Image.from_pipe(pipeline_t2i) +``` + +Now let's generate an image! + +```py +pag_scales = 4.0 +guidance_scales = 7.0 + +url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-text2img.png" +init_image = load_image(url) +prompt = "a dog catching a frisbee in the jungle" + +generator = torch.Generator(device="cpu").manual_seed(0) +image = pipeline( + prompt, + image=init_image, + strength=0.8, + guidance_scale=guidance_scale, + pag_scale=pag_scale, + generator=generator).images[0] +``` + + + + +```py +from diffusers import AutoPipelineForInpainting +from diffusers.utils import load_image +import torch + +pipeline = AutoPipelineForInpainting.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", + enable_pag=True, + torch_dtype=torch.float16 +) +pipeline.enable_model_cpu_offload() +``` + +You can enable PAG on an existing inpainting pipeline like this + +```py +pipeline_inpaint = AutoPipelineForInpainting.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16) +pipeline = AutoPipelineForInpainting.from_pipe(pipeline_inpaint, enable_pag=True) +``` + +This still works when your pipeline has a different task: + +```py +pipeline_t2i = AutoPipelineForText2Image.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16) +pipeline = AutoPipelineForInpaiting.from_pipe(pipeline_t2i, enable_pag=True) +``` + +Let's generate an image! + +```py +img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png" +mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png" +init_image = load_image(img_url).convert("RGB") +mask_image = load_image(mask_url).convert("RGB") + +prompt = "A majestic tiger sitting on a bench" + +pag_scales = 3.0 +guidance_scales = 7.5 + +generator = torch.Generator(device="cpu").manual_seed(1) +images = pipeline( + prompt=prompt, + image=init_image, + mask_image=mask_image, + strength=0.8, + num_inference_steps=50, + guidance_scale=guidance_scale, + generator=generator, + pag_scale=pag_scale, +).images +images[0] +``` + +
+ +## PAG with ControlNet + +To use PAG with ControlNet, first create a `controlnet`. Then, pass the `controlnet` and other PAG arguments to the `from_pretrained` method of the AutoPipeline for the specified task. + +```py +from diffusers import AutoPipelineForText2Image, ControlNetModel +import torch + +controlnet = ControlNetModel.from_pretrained( + "diffusers/controlnet-canny-sdxl-1.0", torch_dtype=torch.float16 +) + +pipeline = AutoPipelineForText2Image.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", + controlnet=controlnet, + enable_pag=True, + pag_applied_layers="mid", + torch_dtype=torch.float16 +) +pipeline.enable_model_cpu_offload() +``` + + + +If you already have a controlnet pipeline and want to enable PAG, you can use the `from_pipe` API: `AutoPipelineForText2Image.from_pipe(pipeline_controlnet, enable_pag=True)` + + + +You can use the pipeline in the same way you normally use ControlNet pipelines, with the added option to specify a `pag_scale` parameter. Note that PAG works well for unconditional generation. In this example, we will generate an image without a prompt. + +```py +from diffusers.utils import load_image +canny_image = load_image( + "https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/pag_control_input.png" +) + +for pag_scale in [0.0, 3.0]: + generator = torch.Generator(device="cpu").manual_seed(1) + images = pipeline( + prompt="", + controlnet_conditioning_scale=controlnet_conditioning_scale, + image=canny_image, + num_inference_steps=50, + guidance_scale=0, + generator=generator, + pag_scale=pag_scale, + ).images + images[0] +``` + +
+
+ +
generated image without PAG
+
+
+ +
generated image with PAG
+
+
+ +## PAG with IP-Adapter + +[IP-Adapter](https://hf.co/papers/2308.06721) is a popular model that can be plugged into diffusion models to enable image prompting without any changes to the underlying model. You can enable PAG on a pipeline with IP-Adapter loaded. + +```py +from diffusers import AutoPipelineForText2Image +from diffusers.utils import load_image +from transformers import CLIPVisionModelWithProjection +import torch + +image_encoder = CLIPVisionModelWithProjection.from_pretrained( + "h94/IP-Adapter", + subfolder="models/image_encoder", + torch_dtype=torch.float16 +) + +pipeline = AutoPipelineForText2Image.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", + image_encoder=image_encoder, + enable_pag=True, + torch_dtype=torch.float16 +).to("cuda") + +pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name="ip-adapter-plus_sdxl_vit-h.bin") + +pag_scales = 5.0 +ip_adapter_scales = 0.8 + +image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_diner.png") + +pipeline.set_ip_adapter_scale(ip_adapter_scale) +generator = torch.Generator(device="cpu").manual_seed(0) +images = pipeline( + prompt="a polar bear sitting in a chair drinking a milkshake", + ip_adapter_image=image, + negative_prompt="deformed, ugly, wrong proportion, low res, bad anatomy, worst quality, low quality", + num_inference_steps=25, + guidance_scale=3.0, + generator=generator, + pag_scale=pag_scale, +).images +images[0] + +``` + +PAG reduces artifacts and improves the overall compposition. + +
+
+ +
generated image without PAG
+
+
+ +
generated image with PAG
+
+
+ + +## Configure parameters + +### pag_applied_layers + +The `pag_applied_layers` argument allows you to specify which layers PAG is applied to. By default, it applies only to the mid blocks. Changing this setting will significantly impact the output. You can use the `set_pag_applied_layers` method to adjust the PAG layers after the pipeline is created, helping you find the optimal layers for your model. + +As an example, here is the images generated with `pag_layers = ["down.block_2"]` and `pag_layers = ["down.block_2", "up.block_1.attentions_0"]` + +```py +prompt = "an insect robot preparing a delicious meal, anime style" +pipeline.set_pag_applied_layers(pag_layers) +generator = torch.Generator(device="cpu").manual_seed(0) +images = pipeline( + prompt=prompt, + num_inference_steps=25, + guidance_scale=guidance_scale, + generator=generator, + pag_scale=pag_scale, +).images +images[0] +``` + +
+
+ +
down.block_2 + up.block1.attentions_0
+
+
+ +
down.block_2
+
+
+ + ## AnimateDiffPAGPipeline [[autodoc]] AnimateDiffPAGPipeline - all diff --git a/docs/source/en/api/pipelines/shap_e.md b/docs/source/en/api/pipelines/shap_e.md index 5e5af0656a63..1f5500083456 100644 --- a/docs/source/en/api/pipelines/shap_e.md +++ b/docs/source/en/api/pipelines/shap_e.md @@ -23,6 +23,170 @@ See the [reuse components across pipelines](../../using-diffusers/loading#reuse- +## Text-to-3D + +To generate a gif of a 3D object, pass a text prompt to the [`ShapEPipeline`]. The pipeline generates a list of image frames which are used to create the 3D object. + +```py +import torch +from diffusers import ShapEPipeline + +device = torch.device("cuda" if torch.cuda.is_available() else "cpu") + +pipe = ShapEPipeline.from_pretrained("openai/shap-e", torch_dtype=torch.float16, variant="fp16") +pipe = pipe.to(device) + +guidance_scale = 15.0 +prompt = ["A firecracker", "A birthday cupcake"] + +images = pipe( + prompt, + guidance_scale=guidance_scale, + num_inference_steps=64, + frame_size=256, +).images +``` + +이제 [`~utils.export_to_gif`] 함수를 사용해 이미지 프레임 리스트를 3D 오브젝트의 gif로 변환합니다. + +```py +from diffusers.utils import export_to_gif + +export_to_gif(images[0], "firecracker_3d.gif") +export_to_gif(images[1], "cake_3d.gif") +``` + +
+
+ +
prompt = "A firecracker"
+
+
+ +
prompt = "A birthday cupcake"
+
+
+ +## Image-to-3D + +To generate a 3D object from another image, use the [`ShapEImg2ImgPipeline`]. You can use an existing image or generate an entirely new one. Let's use the [Kandinsky 2.1](../api/pipelines/kandinsky) model to generate a new image. + +```py +from diffusers import DiffusionPipeline +import torch + +prior_pipeline = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-prior", torch_dtype=torch.float16, use_safetensors=True).to("cuda") +pipeline = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16, use_safetensors=True).to("cuda") + +prompt = "A cheeseburger, white background" + +image_embeds, negative_image_embeds = prior_pipeline(prompt, guidance_scale=1.0).to_tuple() +image = pipeline( + prompt, + image_embeds=image_embeds, + negative_image_embeds=negative_image_embeds, +).images[0] + +image.save("burger.png") +``` + +Pass the cheeseburger to the [`ShapEImg2ImgPipeline`] to generate a 3D representation of it. + +```py +from PIL import Image +from diffusers import ShapEImg2ImgPipeline +from diffusers.utils import export_to_gif + +pipe = ShapEImg2ImgPipeline.from_pretrained("openai/shap-e-img2img", torch_dtype=torch.float16, variant="fp16").to("cuda") + +guidance_scale = 3.0 +image = Image.open("burger.png").resize((256, 256)) + +images = pipe( + image, + guidance_scale=guidance_scale, + num_inference_steps=64, + frame_size=256, +).images + +gif_path = export_to_gif(images[0], "burger_3d.gif") +``` + +
+
+ +
cheeseburger
+
+
+ +
3D cheeseburger
+
+
+ +## Generate mesh + +Shap-E is a flexible model that can also generate textured mesh outputs to be rendered for downstream applications. In this example, you'll convert the output into a `glb` file because the 🤗 Datasets library supports mesh visualization of `glb` files which can be rendered by the [Dataset viewer](https://huggingface.co/docs/hub/datasets-viewer#dataset-preview). + +You can generate mesh outputs for both the [`ShapEPipeline`] and [`ShapEImg2ImgPipeline`] by specifying the `output_type` parameter as `"mesh"`: + +```py +import torch +from diffusers import ShapEPipeline + +device = torch.device("cuda" if torch.cuda.is_available() else "cpu") + +pipe = ShapEPipeline.from_pretrained("openai/shap-e", torch_dtype=torch.float16, variant="fp16") +pipe = pipe.to(device) + +guidance_scale = 15.0 +prompt = "A birthday cupcake" + +images = pipe(prompt, guidance_scale=guidance_scale, num_inference_steps=64, frame_size=256, output_type="mesh").images +``` + +Use the [`~utils.export_to_ply`] function to save the mesh output as a `ply` file: + + + +You can optionally save the mesh output as an `obj` file with the [`~utils.export_to_obj`] function. The ability to save the mesh output in a variety of formats makes it more flexible for downstream usage! + + + +```py +from diffusers.utils import export_to_ply + +ply_path = export_to_ply(images[0], "3d_cake.ply") +print(f"Saved to folder: {ply_path}") +``` + +Then you can convert the `ply` file to a `glb` file with the trimesh library: + +```py +import trimesh + +mesh = trimesh.load("3d_cake.ply") +mesh_export = mesh.export("3d_cake.glb", file_type="glb") +``` + +By default, the mesh output is focused from the bottom viewpoint but you can change the default viewpoint by applying a rotation transform: + +```py +import trimesh +import numpy as np + +mesh = trimesh.load("3d_cake.ply") +rot = trimesh.transformations.rotation_matrix(-np.pi / 2, [1, 0, 0]) +mesh = mesh.apply_transform(rot) +mesh_export = mesh.export("3d_cake.glb", file_type="glb") +``` + +Upload the mesh file to your dataset repository to visualize it with the Dataset viewer! + +
+ +
+ + ## ShapEPipeline [[autodoc]] ShapEPipeline - all diff --git a/docs/source/en/api/pipelines/stable_diffusion/sdxl_turbo.md b/docs/source/en/api/pipelines/stable_diffusion/sdxl_turbo.md index aac3a3d870c8..f0e822a39db1 100644 --- a/docs/source/en/api/pipelines/stable_diffusion/sdxl_turbo.md +++ b/docs/source/en/api/pipelines/stable_diffusion/sdxl_turbo.md @@ -18,6 +18,97 @@ The abstract from the paper is: *We introduce Adversarial Diffusion Distillation (ADD), a novel training approach that efficiently samples large-scale foundational image diffusion models in just 1–4 steps while maintaining high image quality. We use score distillation to leverage large-scale off-the-shelf image diffusion models as a teacher signal in combination with an adversarial loss to ensure high image fidelity even in the low-step regime of one or two sampling steps. Our analyses show that our model clearly outperforms existing few-step methods (GANs,Latent Consistency Models) in a single step and reaches the performance of state-of-the-art diffusion models (SDXL) in only four steps. ADD is the first method to unlock single-step, real-time image synthesis with foundation models.* +## Load model checkpoints + +Model weights may be stored in separate subfolders on the Hub or locally, in which case, you should use the [`~StableDiffusionXLPipeline.from_pretrained`] method: + +```py +from diffusers import AutoPipelineForText2Image +import torch + +pipeline = AutoPipelineForText2Image.from_pretrained("stabilityai/sdxl-turbo", torch_dtype=torch.float16, variant="fp16") +pipeline = pipeline.to("cuda") +``` + +You can also use the [`~StableDiffusionXLPipeline.from_single_file`] method to load a model checkpoint stored in a single file format (`.ckpt` or `.safetensors`) from the Hub or locally. For this loading method, you need to set `timestep_spacing="trailing"` (feel free to experiment with the other scheduler config values to get better results): + +```py +from diffusers import StableDiffusionXLPipeline, EulerAncestralDiscreteScheduler +import torch + +pipeline = StableDiffusionXLPipeline.from_single_file( + "https://huggingface.co/stabilityai/sdxl-turbo/blob/main/sd_xl_turbo_1.0_fp16.safetensors", + torch_dtype=torch.float16, variant="fp16") +pipeline = pipeline.to("cuda") +pipeline.scheduler = EulerAncestralDiscreteScheduler.from_config(pipeline.scheduler.config, timestep_spacing="trailing") +``` + +## Text-to-image + +For text-to-image, pass a text prompt. By default, SDXL Turbo generates a 512x512 image, and that resolution gives the best results. You can try setting the `height` and `width` parameters to 768x768 or 1024x1024, but you should expect quality degradations when doing so. + +Make sure to set `guidance_scale` to 0.0 to disable, as the model was trained without it. A single inference step is enough to generate high quality images. +Increasing the number of steps to 2, 3 or 4 should improve image quality. + +```py +from diffusers import AutoPipelineForText2Image +import torch + +pipeline_text2image = AutoPipelineForText2Image.from_pretrained("stabilityai/sdxl-turbo", torch_dtype=torch.float16, variant="fp16") +pipeline_text2image = pipeline_text2image.to("cuda") + +prompt = "A cinematic shot of a baby racoon wearing an intricate italian priest robe." + +image = pipeline_text2image(prompt=prompt, guidance_scale=0.0, num_inference_steps=1).images[0] +image +``` + +
+ generated image of a racoon in a robe +
+ +## Image-to-image + +For image-to-image generation, make sure that `num_inference_steps * strength` is larger or equal to 1. +The image-to-image pipeline will run for `int(num_inference_steps * strength)` steps, e.g. `0.5 * 2.0 = 1` step in +our example below. + +```py +from diffusers import AutoPipelineForImage2Image +from diffusers.utils import load_image, make_image_grid + +# use from_pipe to avoid consuming additional memory when loading a checkpoint +pipeline_image2image = AutoPipelineForImage2Image.from_pipe(pipeline_text2image).to("cuda") + +init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png") +init_image = init_image.resize((512, 512)) + +prompt = "cat wizard, gandalf, lord of the rings, detailed, fantasy, cute, adorable, Pixar, Disney, 8k" + +image = pipeline_image2image(prompt, image=init_image, strength=0.5, guidance_scale=0.0, num_inference_steps=2).images[0] +make_image_grid([init_image, image], rows=1, cols=2) +``` + +
+ Image-to-image generation sample using SDXL Turbo +
+ +## Speed-up SDXL Turbo even more + +- Compile the UNet if you are using PyTorch version 2.0 or higher. The first inference run will be very slow, but subsequent ones will be much faster. + +```py +pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True) +``` + +- When using the default VAE, keep it in `float32` to avoid costly `dtype` conversions before and after each generation. You only need to do this one before your first generation: + +```py +pipe.upcast_vae() +``` + +As an alternative, you can also use a [16-bit VAE](https://huggingface.co/madebyollin/sdxl-vae-fp16-fix) created by community member [`@madebyollin`](https://huggingface.co/madebyollin) that does not need to be upcasted to `float32`. + ## Tips - SDXL Turbo uses the exact same architecture as [SDXL](./stable_diffusion_xl), which means it also has the same API. Please refer to the [SDXL](./stable_diffusion_xl) API reference for more details. diff --git a/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_xl.md b/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_xl.md index 30e43790663d..1ecd18bbbcce 100644 --- a/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_xl.md +++ b/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_xl.md @@ -23,6 +23,420 @@ The abstract from the paper is: *We present SDXL, a latent diffusion model for text-to-image synthesis. Compared to previous versions of Stable Diffusion, SDXL leverages a three times larger UNet backbone: The increase of model parameters is mainly due to more attention blocks and a larger cross-attention context as SDXL uses a second text encoder. We design multiple novel conditioning schemes and train SDXL on multiple aspect ratios. We also introduce a refinement model which is used to improve the visual fidelity of samples generated by SDXL using a post-hoc image-to-image technique. We demonstrate that SDXL shows drastically improved performance compared the previous versions of Stable Diffusion and achieves results competitive with those of black-box state-of-the-art image generators.* +## Load model checkpoints + +Model weights may be stored in separate subfolders on the Hub or locally, in which case, you should use the [`~StableDiffusionXLPipeline.from_pretrained`] method: + +```py +from diffusers import StableDiffusionXLPipeline, StableDiffusionXLImg2ImgPipeline +import torch + +pipeline = StableDiffusionXLPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True +).to("cuda") + +refiner = StableDiffusionXLImg2ImgPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-refiner-1.0", torch_dtype=torch.float16, use_safetensors=True, variant="fp16" +).to("cuda") +``` + +You can also use the [`~StableDiffusionXLPipeline.from_single_file`] method to load a model checkpoint stored in a single file format (`.ckpt` or `.safetensors`) from the Hub or locally: + +```py +from diffusers import StableDiffusionXLPipeline, StableDiffusionXLImg2ImgPipeline +import torch + +pipeline = StableDiffusionXLPipeline.from_single_file( + "https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/blob/main/sd_xl_base_1.0.safetensors", + torch_dtype=torch.float16 +).to("cuda") + +refiner = StableDiffusionXLImg2ImgPipeline.from_single_file( + "https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0/blob/main/sd_xl_refiner_1.0.safetensors", torch_dtype=torch.float16 +).to("cuda") +``` + +## Text-to-image + +For text-to-image, pass a text prompt. By default, SDXL generates a 1024x1024 image for the best results. You can try setting the `height` and `width` parameters to 768x768 or 512x512, but anything below 512x512 is not likely to work. + +```py +from diffusers import AutoPipelineForText2Image +import torch + +pipeline_text2image = AutoPipelineForText2Image.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True +).to("cuda") + +prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" +image = pipeline_text2image(prompt=prompt).images[0] +image +``` + +
+ generated image of an astronaut in a jungle +
+ +## Image-to-image + +For image-to-image, SDXL works especially well with image sizes between 768x768 and 1024x1024. Pass an initial image, and a text prompt to condition the image with: + +```py +from diffusers import AutoPipelineForImage2Image +from diffusers.utils import load_image, make_image_grid + +# use from_pipe to avoid consuming additional memory when loading a checkpoint +pipeline = AutoPipelineForImage2Image.from_pipe(pipeline_text2image).to("cuda") + +url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-text2img.png" +init_image = load_image(url) +prompt = "a dog catching a frisbee in the jungle" +image = pipeline(prompt, image=init_image, strength=0.8, guidance_scale=10.5).images[0] +make_image_grid([init_image, image], rows=1, cols=2) +``` + +
+ generated image of a dog catching a frisbee in a jungle +
+ +## Inpainting + +For inpainting, you'll need the original image and a mask of what you want to replace in the original image. Create a prompt to describe what you want to replace the masked area with. + +```py +from diffusers import AutoPipelineForInpainting +from diffusers.utils import load_image, make_image_grid + +# use from_pipe to avoid consuming additional memory when loading a checkpoint +pipeline = AutoPipelineForInpainting.from_pipe(pipeline_text2image).to("cuda") + +img_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-text2img.png" +mask_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-inpaint-mask.png" + +init_image = load_image(img_url) +mask_image = load_image(mask_url) + +prompt = "A deep sea diver floating" +image = pipeline(prompt=prompt, image=init_image, mask_image=mask_image, strength=0.85, guidance_scale=12.5).images[0] +make_image_grid([init_image, mask_image, image], rows=1, cols=3) +``` + +
+ generated image of a deep sea diver in a jungle +
+ +## Refine image quality + +SDXL includes a [refiner model](https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0) specialized in denoising low-noise stage images to generate higher-quality images from the base model. There are two ways to use the refiner: + +1. use the base and refiner models together to produce a refined image +2. use the base model to produce an image, and subsequently use the refiner model to add more details to the image (this is how SDXL was originally trained) + +### Base + refiner model + +When you use the base and refiner model together to generate an image, this is known as an [*ensemble of expert denoisers*](https://research.nvidia.com/labs/dir/eDiff-I/). The ensemble of expert denoisers approach requires fewer overall denoising steps versus passing the base model's output to the refiner model, so it should be significantly faster to run. However, you won't be able to inspect the base model's output because it still contains a large amount of noise. + +As an ensemble of expert denoisers, the base model serves as the expert during the high-noise diffusion stage and the refiner model serves as the expert during the low-noise diffusion stage. Load the base and refiner model: + +```py +from diffusers import DiffusionPipeline +import torch + +base = DiffusionPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True +).to("cuda") + +refiner = DiffusionPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-refiner-1.0", + text_encoder_2=base.text_encoder_2, + vae=base.vae, + torch_dtype=torch.float16, + use_safetensors=True, + variant="fp16", +).to("cuda") +``` + +To use this approach, you need to define the number of timesteps for each model to run through their respective stages. For the base model, this is controlled by the [`denoising_end`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/stable_diffusion_xl#diffusers.StableDiffusionXLPipeline.__call__.denoising_end) parameter and for the refiner model, it is controlled by the [`denoising_start`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/stable_diffusion_xl#diffusers.StableDiffusionXLImg2ImgPipeline.__call__.denoising_start) parameter. + + + +The `denoising_end` and `denoising_start` parameters should be a float between 0 and 1. These parameters are represented as a proportion of discrete timesteps as defined by the scheduler. If you're also using the `strength` parameter, it'll be ignored because the number of denoising steps is determined by the discrete timesteps the model is trained on and the declared fractional cutoff. + + + +Let's set `denoising_end=0.8` so the base model performs the first 80% of denoising the **high-noise** timesteps and set `denoising_start=0.8` so the refiner model performs the last 20% of denoising the **low-noise** timesteps. The base model output should be in **latent** space instead of a PIL image. + +```py +prompt = "A majestic lion jumping from a big stone at night" + +image = base( + prompt=prompt, + num_inference_steps=40, + denoising_end=0.8, + output_type="latent", +).images +image = refiner( + prompt=prompt, + num_inference_steps=40, + denoising_start=0.8, + image=image, +).images[0] +image +``` + +
+
+ generated image of a lion on a rock at night +
default base model
+
+
+ generated image of a lion on a rock at night in higher quality +
ensemble of expert denoisers
+
+
+ +The refiner model can also be used for inpainting in the [`StableDiffusionXLInpaintPipeline`]: + +```py +from diffusers import StableDiffusionXLInpaintPipeline +from diffusers.utils import load_image, make_image_grid +import torch + +base = StableDiffusionXLInpaintPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True +).to("cuda") + +refiner = StableDiffusionXLInpaintPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-refiner-1.0", + text_encoder_2=base.text_encoder_2, + vae=base.vae, + torch_dtype=torch.float16, + use_safetensors=True, + variant="fp16", +).to("cuda") + +img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png" +mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png" + +init_image = load_image(img_url) +mask_image = load_image(mask_url) + +prompt = "A majestic tiger sitting on a bench" +num_inference_steps = 75 +high_noise_frac = 0.7 + +image = base( + prompt=prompt, + image=init_image, + mask_image=mask_image, + num_inference_steps=num_inference_steps, + denoising_end=high_noise_frac, + output_type="latent", +).images +image = refiner( + prompt=prompt, + image=image, + mask_image=mask_image, + num_inference_steps=num_inference_steps, + denoising_start=high_noise_frac, +).images[0] +make_image_grid([init_image, mask_image, image.resize((512, 512))], rows=1, cols=3) +``` + +This ensemble of expert denoisers method works well for all available schedulers! + +### Base to refiner model + +SDXL gets a boost in image quality by using the refiner model to add additional high-quality details to the fully-denoised image from the base model, in an image-to-image setting. + +Load the base and refiner models: + +```py +from diffusers import DiffusionPipeline +import torch + +base = DiffusionPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True +).to("cuda") + +refiner = DiffusionPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-refiner-1.0", + text_encoder_2=base.text_encoder_2, + vae=base.vae, + torch_dtype=torch.float16, + use_safetensors=True, + variant="fp16", +).to("cuda") +``` + + + +You can use SDXL refiner with a different base model. For example, you can use the [Hunyuan-DiT](../../api/pipelines/hunyuandit) or [PixArt-Sigma](../../api/pipelines/pixart_sigma) pipelines to generate images with better prompt adherence. Once you have generated an image, you can pass it to the SDXL refiner model to enhance final generation quality. + + + +Generate an image from the base model, and set the model output to **latent** space: + +```py +prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" + +image = base(prompt=prompt, output_type="latent").images[0] +``` + +Pass the generated image to the refiner model: + +```py +image = refiner(prompt=prompt, image=image[None, :]).images[0] +``` + +
+
+ generated image of an astronaut riding a green horse on Mars +
base model
+
+
+ higher quality generated image of an astronaut riding a green horse on Mars +
base model + refiner model
+
+
+ +For inpainting, load the base and the refiner model in the [`StableDiffusionXLInpaintPipeline`], remove the `denoising_end` and `denoising_start` parameters, and choose a smaller number of inference steps for the refiner. + +## Micro-conditioning + +SDXL training involves several additional conditioning techniques, which are referred to as *micro-conditioning*. These include original image size, target image size, and cropping parameters. The micro-conditionings can be used at inference time to create high-quality, centered images. + + + +You can use both micro-conditioning and negative micro-conditioning parameters thanks to classifier-free guidance. They are available in the [`StableDiffusionXLPipeline`], [`StableDiffusionXLImg2ImgPipeline`], [`StableDiffusionXLInpaintPipeline`], and [`StableDiffusionXLControlNetPipeline`]. + + + +### Size conditioning + +There are two types of size conditioning: + +- [`original_size`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/stable_diffusion_xl#diffusers.StableDiffusionXLPipeline.__call__.original_size) conditioning comes from upscaled images in the training batch (because it would be wasteful to discard the smaller images which make up almost 40% of the total training data). This way, SDXL learns that upscaling artifacts are not supposed to be present in high-resolution images. During inference, you can use `original_size` to indicate the original image resolution. Using the default value of `(1024, 1024)` produces higher-quality images that resemble the 1024x1024 images in the dataset. If you choose to use a lower resolution, such as `(256, 256)`, the model still generates 1024x1024 images, but they'll look like the low resolution images (simpler patterns, blurring) in the dataset. + +- [`target_size`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/stable_diffusion_xl#diffusers.StableDiffusionXLPipeline.__call__.target_size) conditioning comes from finetuning SDXL to support different image aspect ratios. During inference, if you use the default value of `(1024, 1024)`, you'll get an image that resembles the composition of square images in the dataset. We recommend using the same value for `target_size` and `original_size`, but feel free to experiment with other options! + +🤗 Diffusers also lets you specify negative conditions about an image's size to steer generation away from certain image resolutions: + +```py +from diffusers import StableDiffusionXLPipeline +import torch + +pipe = StableDiffusionXLPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True +).to("cuda") + +prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" +image = pipe( + prompt=prompt, + negative_original_size=(512, 512), + negative_target_size=(1024, 1024), +).images[0] +``` + +
+ +
Images negatively conditioned on image resolutions of (128, 128), (256, 256), and (512, 512).
+
+ +### Crop conditioning + +Images generated by previous Stable Diffusion models may sometimes appear to be cropped. This is because images are actually cropped during training so that all the images in a batch have the same size. By conditioning on crop coordinates, SDXL *learns* that no cropping - coordinates `(0, 0)` - usually correlates with centered subjects and complete faces (this is the default value in 🤗 Diffusers). You can experiment with different coordinates if you want to generate off-centered compositions! + +```py +from diffusers import StableDiffusionXLPipeline +import torch + +pipeline = StableDiffusionXLPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True +).to("cuda") + +prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" +image = pipeline(prompt=prompt, crops_coords_top_left=(256, 0)).images[0] +image +``` + +
+ generated image of an astronaut in a jungle, slightly cropped +
+ +You can also specify negative cropping coordinates to steer generation away from certain cropping parameters: + +```py +from diffusers import StableDiffusionXLPipeline +import torch + +pipe = StableDiffusionXLPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True +).to("cuda") + +prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" +image = pipe( + prompt=prompt, + negative_original_size=(512, 512), + negative_crops_coords_top_left=(0, 0), + negative_target_size=(1024, 1024), +).images[0] +image +``` + +## Use a different prompt for each text-encoder + +SDXL uses two text-encoders, so it is possible to pass a different prompt to each text-encoder, which can [improve quality](https://github.com/huggingface/diffusers/issues/4004#issuecomment-1627764201). Pass your original prompt to `prompt` and the second prompt to `prompt_2` (use `negative_prompt` and `negative_prompt_2` if you're using negative prompts): + +```py +from diffusers import StableDiffusionXLPipeline +import torch + +pipeline = StableDiffusionXLPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True +).to("cuda") + +# prompt is passed to OAI CLIP-ViT/L-14 +prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" +# prompt_2 is passed to OpenCLIP-ViT/bigG-14 +prompt_2 = "Van Gogh painting" +image = pipeline(prompt=prompt, prompt_2=prompt_2).images[0] +image +``` + +
+ generated image of an astronaut in a jungle in the style of a van gogh painting +
+ +The dual text-encoders also support textual inversion embeddings that need to be loaded separately as explained in the [SDXL textual inversion](textual_inversion_inference#stable-diffusion-xl) section. + +## Optimizations + +SDXL is a large model, and you may need to optimize memory to get it to run on your hardware. Here are some tips to save memory and speed up inference. + +1. Offload the model to the CPU with [`~StableDiffusionXLPipeline.enable_model_cpu_offload`] for out-of-memory errors: + +```diff +- base.to("cuda") +- refiner.to("cuda") ++ base.enable_model_cpu_offload() ++ refiner.enable_model_cpu_offload() +``` + +2. Use `torch.compile` for ~20% speed-up (you need `torch>=2.0`): + +```diff ++ base.unet = torch.compile(base.unet, mode="reduce-overhead", fullgraph=True) ++ refiner.unet = torch.compile(refiner.unet, mode="reduce-overhead", fullgraph=True) +``` + +3. Enable [xFormers](../optimization/xformers) to run SDXL if `torch<2.0`: + +```diff ++ base.enable_xformers_memory_efficient_attention() ++ refiner.enable_xformers_memory_efficient_attention() +``` + ## Tips - Using SDXL with a DPM++ scheduler for less than 50 steps is known to produce [visual artifacts](https://github.com/huggingface/diffusers/issues/5433) because the solver becomes numerically unstable. To fix this issue, take a look at this [PR](https://github.com/huggingface/diffusers/pull/5541) which recommends for ODE/SDE solvers: diff --git a/docs/source/en/api/pipelines/stable_diffusion/svd.md b/docs/source/en/api/pipelines/stable_diffusion/svd.md index ab51f9b66398..3e50c338c024 100644 --- a/docs/source/en/api/pipelines/stable_diffusion/svd.md +++ b/docs/source/en/api/pipelines/stable_diffusion/svd.md @@ -28,6 +28,102 @@ Check out the [Stability AI](https://huggingface.co/stabilityai) Hub organizatio +The are two variants of this model, [SVD](https://huggingface.co/stabilityai/stable-video-diffusion-img2vid) and [SVD-XT](https://huggingface.co/stabilityai/stable-video-diffusion-img2vid-xt). The SVD checkpoint is trained to generate 14 frames and the SVD-XT checkpoint is further finetuned to generate 25 frames. + +You'll use the SVD-XT checkpoint for this guide. + +```python +import torch + +from diffusers import StableVideoDiffusionPipeline +from diffusers.utils import load_image, export_to_video + +pipe = StableVideoDiffusionPipeline.from_pretrained( + "stabilityai/stable-video-diffusion-img2vid-xt", torch_dtype=torch.float16, variant="fp16" +) +pipe.enable_model_cpu_offload() + +# Load the conditioning image +image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket.png") +image = image.resize((1024, 576)) + +generator = torch.manual_seed(42) +frames = pipe(image, decode_chunk_size=8, generator=generator).frames[0] + +export_to_video(frames, "generated.mp4", fps=7) +``` + +
+
+ +
"source image of a rocket"
+
+
+ +
"generated video from source image"
+
+
+ +## torch.compile + +You can gain a 20-25% speedup at the expense of slightly increased memory by [compiling](../optimization/fp16#torchcompile) the UNet. + +```diff +- pipe.enable_model_cpu_offload() ++ pipe.to("cuda") ++ pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True) +``` + +## Reduce memory usage + +Video generation is very memory intensive because you're essentially generating `num_frames` all at once, similar to text-to-image generation with a high batch size. To reduce the memory requirement, there are multiple options that trade-off inference speed for lower memory requirement: + +- enable model offloading: each component of the pipeline is offloaded to the CPU once it's not needed anymore. +- enable feed-forward chunking: the feed-forward layer runs in a loop instead of running a single feed-forward with a huge batch size. +- reduce `decode_chunk_size`: the VAE decodes frames in chunks instead of decoding them all together. Setting `decode_chunk_size=1` decodes one frame at a time and uses the least amount of memory (we recommend adjusting this value based on your GPU memory) but the video might have some flickering. + +```diff +- pipe.enable_model_cpu_offload() +- frames = pipe(image, decode_chunk_size=8, generator=generator).frames[0] ++ pipe.enable_model_cpu_offload() ++ pipe.unet.enable_forward_chunking() ++ frames = pipe(image, decode_chunk_size=2, generator=generator, num_frames=25).frames[0] +``` + +Using all these tricks together should lower the memory requirement to less than 8GB VRAM. + +## Micro-conditioning + +Stable Diffusion Video also accepts micro-conditioning, in addition to the conditioning image, which allows more control over the generated video: + +- `fps`: the frames per second of the generated video. +- `motion_bucket_id`: the motion bucket id to use for the generated video. This can be used to control the motion of the generated video. Increasing the motion bucket id increases the motion of the generated video. +- `noise_aug_strength`: the amount of noise added to the conditioning image. The higher the values the less the video resembles the conditioning image. Increasing this value also increases the motion of the generated video. + +For example, to generate a video with more motion, use the `motion_bucket_id` and `noise_aug_strength` micro-conditioning parameters: + +```python +import torch + +from diffusers import StableVideoDiffusionPipeline +from diffusers.utils import load_image, export_to_video + +pipe = StableVideoDiffusionPipeline.from_pretrained( + "stabilityai/stable-video-diffusion-img2vid-xt", torch_dtype=torch.float16, variant="fp16" +) +pipe.enable_model_cpu_offload() + +# Load the conditioning image +image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket.png") +image = image.resize((1024, 576)) + +generator = torch.manual_seed(42) +frames = pipe(image, decode_chunk_size=8, generator=generator, motion_bucket_id=180, noise_aug_strength=0.1).frames[0] +export_to_video(frames, "generated.mp4", fps=7) +``` + +![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/output_rocket_with_conditions.gif) + ## Tips Video generation is memory-intensive and one way to reduce your memory usage is to set `enable_forward_chunking` on the pipeline's UNet so you don't run the entire feedforward layer at once. Breaking it up into chunks in a loop is more efficient. diff --git a/docs/source/en/api/schedulers/tcd.md b/docs/source/en/api/schedulers/tcd.md index 951bc2705592..f25da6d68996 100644 --- a/docs/source/en/api/schedulers/tcd.md +++ b/docs/source/en/api/schedulers/tcd.md @@ -20,6 +20,397 @@ The abstract from the paper is: The original codebase can be found at [jabir-zheng/TCD](https://github.com/jabir-zheng/TCD). +## General tasks + +In this guide, let's use the [`StableDiffusionXLPipeline`] and the [`TCDScheduler`]. Use the [`~StableDiffusionPipeline.load_lora_weights`] method to load the SDXL-compatible TCD-LoRA weights. + +A few tips to keep in mind for TCD-LoRA inference are to: + +- Keep the `num_inference_steps` between 4 and 50 +- Set `eta` (used to control stochasticity at each step) between 0 and 1. You should use a higher `eta` when increasing the number of inference steps, but the downside is that a larger `eta` in [`TCDScheduler`] leads to blurrier images. A value of 0.3 is recommended to produce good results. + + + + +```python +import torch +from diffusers import StableDiffusionXLPipeline, TCDScheduler + +device = "cuda" +base_model_id = "stabilityai/stable-diffusion-xl-base-1.0" +tcd_lora_id = "h1t/TCD-SDXL-LoRA" + +pipe = StableDiffusionXLPipeline.from_pretrained(base_model_id, torch_dtype=torch.float16, variant="fp16").to(device) +pipe.scheduler = TCDScheduler.from_config(pipe.scheduler.config) + +pipe.load_lora_weights(tcd_lora_id) +pipe.fuse_lora() + +prompt = "Painting of the orange cat Otto von Garfield, Count of Bismarck-Schönhausen, Duke of Lauenburg, Minister-President of Prussia. Depicted wearing a Prussian Pickelhaube and eating his favorite meal - lasagna." + +image = pipe( + prompt=prompt, + num_inference_steps=4, + guidance_scale=0, + eta=0.3, + generator=torch.Generator(device=device).manual_seed(0), +).images[0] +``` + +![](https://github.com/jabir-zheng/TCD/raw/main/assets/demo_image.png) + + + + + +```python +import torch +from diffusers import AutoPipelineForInpainting, TCDScheduler +from diffusers.utils import load_image, make_image_grid + +device = "cuda" +base_model_id = "diffusers/stable-diffusion-xl-1.0-inpainting-0.1" +tcd_lora_id = "h1t/TCD-SDXL-LoRA" + +pipe = AutoPipelineForInpainting.from_pretrained(base_model_id, torch_dtype=torch.float16, variant="fp16").to(device) +pipe.scheduler = TCDScheduler.from_config(pipe.scheduler.config) + +pipe.load_lora_weights(tcd_lora_id) +pipe.fuse_lora() + +img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png" +mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png" + +init_image = load_image(img_url).resize((1024, 1024)) +mask_image = load_image(mask_url).resize((1024, 1024)) + +prompt = "a tiger sitting on a park bench" + +image = pipe( + prompt=prompt, + image=init_image, + mask_image=mask_image, + num_inference_steps=8, + guidance_scale=0, + eta=0.3, + strength=0.99, # make sure to use `strength` below 1.0 + generator=torch.Generator(device=device).manual_seed(0), +).images[0] + +grid_image = make_image_grid([init_image, mask_image, image], rows=1, cols=3) +``` + +![](https://github.com/jabir-zheng/TCD/raw/main/assets/inpainting_tcd.png) + + + + + +## Community models + +TCD-LoRA also works with many community finetuned models and plugins. For example, load the [animagine-xl-3.0](https://huggingface.co/cagliostrolab/animagine-xl-3.0) checkpoint which is a community finetuned version of SDXL for generating anime images. + +```python +import torch +from diffusers import StableDiffusionXLPipeline, TCDScheduler + +device = "cuda" +base_model_id = "cagliostrolab/animagine-xl-3.0" +tcd_lora_id = "h1t/TCD-SDXL-LoRA" + +pipe = StableDiffusionXLPipeline.from_pretrained(base_model_id, torch_dtype=torch.float16, variant="fp16").to(device) +pipe.scheduler = TCDScheduler.from_config(pipe.scheduler.config) + +pipe.load_lora_weights(tcd_lora_id) +pipe.fuse_lora() + +prompt = "A man, clad in a meticulously tailored military uniform, stands with unwavering resolve. The uniform boasts intricate details, and his eyes gleam with determination. Strands of vibrant, windswept hair peek out from beneath the brim of his cap." + +image = pipe( + prompt=prompt, + num_inference_steps=8, + guidance_scale=0, + eta=0.3, + generator=torch.Generator(device=device).manual_seed(0), +).images[0] +``` + +![](https://github.com/jabir-zheng/TCD/raw/main/assets/animagine_xl.png) + +TCD-LoRA also supports other LoRAs trained on different styles. For example, let's load the [TheLastBen/Papercut_SDXL](https://huggingface.co/TheLastBen/Papercut_SDXL) LoRA and fuse it with the TCD-LoRA with the [`~loaders.UNet2DConditionLoadersMixin.set_adapters`] method. + +> [!TIP] +> Check out the [Merge LoRAs](merge_loras) guide to learn more about efficient merging methods. + +```python +import torch +from diffusers import StableDiffusionXLPipeline +from scheduling_tcd import TCDScheduler + +device = "cuda" +base_model_id = "stabilityai/stable-diffusion-xl-base-1.0" +tcd_lora_id = "h1t/TCD-SDXL-LoRA" +styled_lora_id = "TheLastBen/Papercut_SDXL" + +pipe = StableDiffusionXLPipeline.from_pretrained(base_model_id, torch_dtype=torch.float16, variant="fp16").to(device) +pipe.scheduler = TCDScheduler.from_config(pipe.scheduler.config) + +pipe.load_lora_weights(tcd_lora_id, adapter_name="tcd") +pipe.load_lora_weights(styled_lora_id, adapter_name="style") +pipe.set_adapters(["tcd", "style"], adapter_weights=[1.0, 1.0]) + +prompt = "papercut of a winter mountain, snow" + +image = pipe( + prompt=prompt, + num_inference_steps=4, + guidance_scale=0, + eta=0.3, + generator=torch.Generator(device=device).manual_seed(0), +).images[0] +``` + +![](https://github.com/jabir-zheng/TCD/raw/main/assets/styled_lora.png) + + +## Adapters + +TCD-LoRA is very versatile, and it can be combined with other adapter types like ControlNets, IP-Adapter, and AnimateDiff. + + + + +### Depth ControlNet + +```python +import torch +import numpy as np +from PIL import Image +from transformers import DPTImageProcessor, DPTForDepthEstimation +from diffusers import ControlNetModel, StableDiffusionXLControlNetPipeline +from diffusers.utils import load_image, make_image_grid +from scheduling_tcd import TCDScheduler + +device = "cuda" +depth_estimator = DPTForDepthEstimation.from_pretrained("Intel/dpt-hybrid-midas").to(device) +feature_extractor = DPTImageProcessor.from_pretrained("Intel/dpt-hybrid-midas") + +def get_depth_map(image): + image = feature_extractor(images=image, return_tensors="pt").pixel_values.to(device) + with torch.no_grad(), torch.autocast(device): + depth_map = depth_estimator(image).predicted_depth + + depth_map = torch.nn.functional.interpolate( + depth_map.unsqueeze(1), + size=(1024, 1024), + mode="bicubic", + align_corners=False, + ) + depth_min = torch.amin(depth_map, dim=[1, 2, 3], keepdim=True) + depth_max = torch.amax(depth_map, dim=[1, 2, 3], keepdim=True) + depth_map = (depth_map - depth_min) / (depth_max - depth_min) + image = torch.cat([depth_map] * 3, dim=1) + + image = image.permute(0, 2, 3, 1).cpu().numpy()[0] + image = Image.fromarray((image * 255.0).clip(0, 255).astype(np.uint8)) + return image + +base_model_id = "stabilityai/stable-diffusion-xl-base-1.0" +controlnet_id = "diffusers/controlnet-depth-sdxl-1.0" +tcd_lora_id = "h1t/TCD-SDXL-LoRA" + +controlnet = ControlNetModel.from_pretrained( + controlnet_id, + torch_dtype=torch.float16, + variant="fp16", +) +pipe = StableDiffusionXLControlNetPipeline.from_pretrained( + base_model_id, + controlnet=controlnet, + torch_dtype=torch.float16, + variant="fp16", +) +pipe.enable_model_cpu_offload() + +pipe.scheduler = TCDScheduler.from_config(pipe.scheduler.config) + +pipe.load_lora_weights(tcd_lora_id) +pipe.fuse_lora() + +prompt = "stormtrooper lecture, photorealistic" + +image = load_image("https://huggingface.co/lllyasviel/sd-controlnet-depth/resolve/main/images/stormtrooper.png") +depth_image = get_depth_map(image) + +controlnet_conditioning_scale = 0.5 # recommended for good generalization + +image = pipe( + prompt, + image=depth_image, + num_inference_steps=4, + guidance_scale=0, + eta=0.3, + controlnet_conditioning_scale=controlnet_conditioning_scale, + generator=torch.Generator(device=device).manual_seed(0), +).images[0] + +grid_image = make_image_grid([depth_image, image], rows=1, cols=2) +``` + +![](https://github.com/jabir-zheng/TCD/raw/main/assets/controlnet_depth_tcd.png) + +### Canny ControlNet +```python +import torch +from diffusers import ControlNetModel, StableDiffusionXLControlNetPipeline +from diffusers.utils import load_image, make_image_grid +from scheduling_tcd import TCDScheduler + +device = "cuda" +base_model_id = "stabilityai/stable-diffusion-xl-base-1.0" +controlnet_id = "diffusers/controlnet-canny-sdxl-1.0" +tcd_lora_id = "h1t/TCD-SDXL-LoRA" + +controlnet = ControlNetModel.from_pretrained( + controlnet_id, + torch_dtype=torch.float16, + variant="fp16", +) +pipe = StableDiffusionXLControlNetPipeline.from_pretrained( + base_model_id, + controlnet=controlnet, + torch_dtype=torch.float16, + variant="fp16", +) +pipe.enable_model_cpu_offload() + +pipe.scheduler = TCDScheduler.from_config(pipe.scheduler.config) + +pipe.load_lora_weights(tcd_lora_id) +pipe.fuse_lora() + +prompt = "ultrarealistic shot of a furry blue bird" + +canny_image = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/bird_canny.png") + +controlnet_conditioning_scale = 0.5 # recommended for good generalization + +image = pipe( + prompt, + image=canny_image, + num_inference_steps=4, + guidance_scale=0, + eta=0.3, + controlnet_conditioning_scale=controlnet_conditioning_scale, + generator=torch.Generator(device=device).manual_seed(0), +).images[0] + +grid_image = make_image_grid([canny_image, image], rows=1, cols=2) +``` +![](https://github.com/jabir-zheng/TCD/raw/main/assets/controlnet_canny_tcd.png) + + +The inference parameters in this example might not work for all examples, so we recommend you to try different values for `num_inference_steps`, `guidance_scale`, `controlnet_conditioning_scale` and `cross_attention_kwargs` parameters and choose the best one. + + + + + +This example shows how to use the TCD-LoRA with the [IP-Adapter](https://github.com/tencent-ailab/IP-Adapter/tree/main) and SDXL. + +```python +import torch +from diffusers import StableDiffusionXLPipeline +from diffusers.utils import load_image, make_image_grid + +from ip_adapter import IPAdapterXL +from scheduling_tcd import TCDScheduler + +device = "cuda" +base_model_path = "stabilityai/stable-diffusion-xl-base-1.0" +image_encoder_path = "sdxl_models/image_encoder" +ip_ckpt = "sdxl_models/ip-adapter_sdxl.bin" +tcd_lora_id = "h1t/TCD-SDXL-LoRA" + +pipe = StableDiffusionXLPipeline.from_pretrained( + base_model_path, + torch_dtype=torch.float16, + variant="fp16" +) +pipe.scheduler = TCDScheduler.from_config(pipe.scheduler.config) + +pipe.load_lora_weights(tcd_lora_id) +pipe.fuse_lora() + +ip_model = IPAdapterXL(pipe, image_encoder_path, ip_ckpt, device) + +ref_image = load_image("https://raw.githubusercontent.com/tencent-ailab/IP-Adapter/main/assets/images/woman.png").resize((512, 512)) + +prompt = "best quality, high quality, wearing sunglasses" + +image = ip_model.generate( + pil_image=ref_image, + prompt=prompt, + scale=0.5, + num_samples=1, + num_inference_steps=4, + guidance_scale=0, + eta=0.3, + seed=0, +)[0] + +grid_image = make_image_grid([ref_image, image], rows=1, cols=2) +``` + +![](https://github.com/jabir-zheng/TCD/raw/main/assets/ip_adapter.png) + + + + + + +[`AnimateDiff`] allows animating images using Stable Diffusion models. TCD-LoRA can substantially accelerate the process without degrading image quality. The quality of animation with TCD-LoRA and AnimateDiff has a more lucid outcome. + +```python +import torch +from diffusers import MotionAdapter, AnimateDiffPipeline, DDIMScheduler +from scheduling_tcd import TCDScheduler +from diffusers.utils import export_to_gif + +adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5") +pipe = AnimateDiffPipeline.from_pretrained( + "frankjoshua/toonyou_beta6", + motion_adapter=adapter, +).to("cuda") + +# set TCDScheduler +pipe.scheduler = TCDScheduler.from_config(pipe.scheduler.config) + +# load TCD LoRA +pipe.load_lora_weights("h1t/TCD-SD15-LoRA", adapter_name="tcd") +pipe.load_lora_weights("guoyww/animatediff-motion-lora-zoom-in", weight_name="diffusion_pytorch_model.safetensors", adapter_name="motion-lora") + +pipe.set_adapters(["tcd", "motion-lora"], adapter_weights=[1.0, 1.2]) + +prompt = "best quality, masterpiece, 1girl, looking at viewer, blurry background, upper body, contemporary, dress" +generator = torch.manual_seed(0) +frames = pipe( + prompt=prompt, + num_inference_steps=5, + guidance_scale=0, + cross_attention_kwargs={"scale": 1}, + num_frames=24, + eta=0.3, + generator=generator +).frames[0] +export_to_gif(frames, "animation.gif") +``` + +![](https://github.com/jabir-zheng/TCD/raw/main/assets/animation_example.gif) + + + + ## TCDScheduler [[autodoc]] TCDScheduler diff --git a/docs/source/en/using-diffusers/consisid.md b/docs/source/en/using-diffusers/consisid.md deleted file mode 100644 index b6b04ddaf57e..000000000000 --- a/docs/source/en/using-diffusers/consisid.md +++ /dev/null @@ -1,96 +0,0 @@ - -# ConsisID - -[ConsisID](https://github.com/PKU-YuanGroup/ConsisID) is an identity-preserving text-to-video generation model that keeps the face consistent in the generated video by frequency decomposition. The main features of ConsisID are: - -- Frequency decomposition: The characteristics of the DiT architecture are analyzed from the frequency domain perspective, and based on these characteristics, a reasonable control information injection method is designed. -- Consistency training strategy: A coarse-to-fine training strategy, dynamic masking loss, and dynamic cross-face loss further enhance the model's generalization ability and identity preservation performance. -- Inference without finetuning: Previous methods required case-by-case finetuning of the input ID before inference, leading to significant time and computational costs. In contrast, ConsisID is tuning-free. - -This guide will walk you through using ConsisID for use cases. - -## Load Model Checkpoints - -Model weights may be stored in separate subfolders on the Hub or locally, in which case, you should use the [`~DiffusionPipeline.from_pretrained`] method. - -```python -# !pip install consisid_eva_clip insightface facexlib -import torch -from diffusers import ConsisIDPipeline -from diffusers.pipelines.consisid.consisid_utils import prepare_face_models, process_face_embeddings_infer -from huggingface_hub import snapshot_download - -# Download ckpts -snapshot_download(repo_id="BestWishYsh/ConsisID-preview", local_dir="BestWishYsh/ConsisID-preview") - -# Load face helper model to preprocess input face image -face_helper_1, face_helper_2, face_clip_model, face_main_model, eva_transform_mean, eva_transform_std = prepare_face_models("BestWishYsh/ConsisID-preview", device="cuda", dtype=torch.bfloat16) - -# Load consisid base model -pipe = ConsisIDPipeline.from_pretrained("BestWishYsh/ConsisID-preview", torch_dtype=torch.bfloat16) -pipe.to("cuda") -``` - -## Identity-Preserving Text-to-Video - -For identity-preserving text-to-video, pass a text prompt and an image contain clear face (e.g., preferably half-body or full-body). By default, ConsisID generates a 720x480 video for the best results. - -```python -from diffusers.utils import export_to_video - -prompt = "The video captures a boy walking along a city street, filmed in black and white on a classic 35mm camera. His expression is thoughtful, his brow slightly furrowed as if he's lost in contemplation. The film grain adds a textured, timeless quality to the image, evoking a sense of nostalgia. Around him, the cityscape is filled with vintage buildings, cobblestone sidewalks, and softly blurred figures passing by, their outlines faint and indistinct. Streetlights cast a gentle glow, while shadows play across the boy's path, adding depth to the scene. The lighting highlights the boy's subtle smile, hinting at a fleeting moment of curiosity. The overall cinematic atmosphere, complete with classic film still aesthetics and dramatic contrasts, gives the scene an evocative and introspective feel." -image = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/consisid/consisid_input.png?download=true" - -id_cond, id_vit_hidden, image, face_kps = process_face_embeddings_infer(face_helper_1, face_clip_model, face_helper_2, eva_transform_mean, eva_transform_std, face_main_model, "cuda", torch.bfloat16, image, is_align_face=True) - -video = pipe(image=image, prompt=prompt, num_inference_steps=50, guidance_scale=6.0, use_dynamic_cfg=False, id_vit_hidden=id_vit_hidden, id_cond=id_cond, kps_cond=face_kps, generator=torch.Generator("cuda").manual_seed(42)) -export_to_video(video.frames[0], "output.mp4", fps=8) -``` - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Face ImageVideoDescription
The video, in a beautifully crafted animated style, features a confident woman riding a horse through a lush forest clearing. Her expression is focused yet serene as she adjusts her wide-brimmed hat with a practiced hand. She wears a flowy bohemian dress, which moves gracefully with the rhythm of the horse, the fabric flowing fluidly in the animated motion. The dappled sunlight filters through the trees, casting soft, painterly patterns on the forest floor. Her posture is poised, showing both control and elegance as she guides the horse with ease. The animation's gentle, fluid style adds a dreamlike quality to the scene, with the woman’s calm demeanor and the peaceful surroundings evoking a sense of freedom and harmony.
The video, in a captivating animated style, shows a woman standing in the center of a snowy forest, her eyes narrowed in concentration as she extends her hand forward. She is dressed in a deep blue cloak, her breath visible in the cold air, which is rendered with soft, ethereal strokes. A faint smile plays on her lips as she summons a wisp of ice magic, watching with focus as the surrounding trees and ground begin to shimmer and freeze, covered in delicate ice crystals. The animation’s fluid motion brings the magic to life, with the frost spreading outward in intricate, sparkling patterns. The environment is painted with soft, watercolor-like hues, enhancing the magical, dreamlike atmosphere. The overall mood is serene yet powerful, with the quiet winter air amplifying the delicate beauty of the frozen scene.
The animation features a whimsical portrait of a balloon seller standing in a gentle breeze, captured with soft, hazy brushstrokes that evoke the feel of a serene spring day. His face is framed by a gentle smile, his eyes squinting slightly against the sun, while a few wisps of hair flutter in the wind. He is dressed in a light, pastel-colored shirt, and the balloons around him sway with the wind, adding a sense of playfulness to the scene. The background blurs softly, with hints of a vibrant market or park, enhancing the light-hearted, yet tender mood of the moment.
The video captures a boy walking along a city street, filmed in black and white on a classic 35mm camera. His expression is thoughtful, his brow slightly furrowed as if he's lost in contemplation. The film grain adds a textured, timeless quality to the image, evoking a sense of nostalgia. Around him, the cityscape is filled with vintage buildings, cobblestone sidewalks, and softly blurred figures passing by, their outlines faint and indistinct. Streetlights cast a gentle glow, while shadows play across the boy's path, adding depth to the scene. The lighting highlights the boy's subtle smile, hinting at a fleeting moment of curiosity. The overall cinematic atmosphere, complete with classic film still aesthetics and dramatic contrasts, gives the scene an evocative and introspective feel.
The video features a baby wearing a bright superhero cape, standing confidently with arms raised in a powerful pose. The baby has a determined look on their face, with eyes wide and lips pursed in concentration, as if ready to take on a challenge. The setting appears playful, with colorful toys scattered around and a soft rug underfoot, while sunlight streams through a nearby window, highlighting the fluttering cape and adding to the impression of heroism. The overall atmosphere is lighthearted and fun, with the baby's expressions capturing a mix of innocence and an adorable attempt at bravery, as if truly ready to save the day.
- -## Resources - -Learn more about ConsisID with the following resources. -- A [video](https://www.youtube.com/watch?v=PhlgC-bI5SQ) demonstrating ConsisID's main features. -- The research paper, [Identity-Preserving Text-to-Video Generation by Frequency Decomposition](https://hf.co/papers/2411.17440) for more details. diff --git a/docs/source/en/using-diffusers/diffedit.md b/docs/source/en/using-diffusers/diffedit.md deleted file mode 100644 index bb1c234dd62d..000000000000 --- a/docs/source/en/using-diffusers/diffedit.md +++ /dev/null @@ -1,285 +0,0 @@ - - -# DiffEdit - -[[open-in-colab]] - -Image editing typically requires providing a mask of the area to be edited. DiffEdit automatically generates the mask for you based on a text query, making it easier overall to create a mask without image editing software. The DiffEdit algorithm works in three steps: - -1. the diffusion model denoises an image conditioned on some query text and reference text which produces different noise estimates for different areas of the image; the difference is used to infer a mask to identify which area of the image needs to be changed to match the query text -2. the input image is encoded into latent space with DDIM -3. the latents are decoded with the diffusion model conditioned on the text query, using the mask as a guide such that pixels outside the mask remain the same as in the input image - -This guide will show you how to use DiffEdit to edit images without manually creating a mask. - -Before you begin, make sure you have the following libraries installed: - -```py -# uncomment to install the necessary libraries in Colab -#!pip install -q diffusers transformers accelerate -``` - -The [`StableDiffusionDiffEditPipeline`] requires an image mask and a set of partially inverted latents. The image mask is generated from the [`~StableDiffusionDiffEditPipeline.generate_mask`] function, and includes two parameters, `source_prompt` and `target_prompt`. These parameters determine what to edit in the image. For example, if you want to change a bowl of *fruits* to a bowl of *pears*, then: - -```py -source_prompt = "a bowl of fruits" -target_prompt = "a bowl of pears" -``` - -The partially inverted latents are generated from the [`~StableDiffusionDiffEditPipeline.invert`] function, and it is generally a good idea to include a `prompt` or *caption* describing the image to help guide the inverse latent sampling process. The caption can often be your `source_prompt`, but feel free to experiment with other text descriptions! - -Let's load the pipeline, scheduler, inverse scheduler, and enable some optimizations to reduce memory usage: - -```py -import torch -from diffusers import DDIMScheduler, DDIMInverseScheduler, StableDiffusionDiffEditPipeline - -pipeline = StableDiffusionDiffEditPipeline.from_pretrained( - "stabilityai/stable-diffusion-2-1", - torch_dtype=torch.float16, - safety_checker=None, - use_safetensors=True, -) -pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config) -pipeline.inverse_scheduler = DDIMInverseScheduler.from_config(pipeline.scheduler.config) -pipeline.enable_model_cpu_offload() -pipeline.enable_vae_slicing() -``` - -Load the image to edit: - -```py -from diffusers.utils import load_image, make_image_grid - -img_url = "https://github.com/Xiang-cd/DiffEdit-stable-diffusion/raw/main/assets/origin.png" -raw_image = load_image(img_url).resize((768, 768)) -raw_image -``` - -Use the [`~StableDiffusionDiffEditPipeline.generate_mask`] function to generate the image mask. You'll need to pass it the `source_prompt` and `target_prompt` to specify what to edit in the image: - -```py -from PIL import Image - -source_prompt = "a bowl of fruits" -target_prompt = "a basket of pears" -mask_image = pipeline.generate_mask( - image=raw_image, - source_prompt=source_prompt, - target_prompt=target_prompt, -) -Image.fromarray((mask_image.squeeze()*255).astype("uint8"), "L").resize((768, 768)) -``` - -Next, create the inverted latents and pass it a caption describing the image: - -```py -inv_latents = pipeline.invert(prompt=source_prompt, image=raw_image).latents -``` - -Finally, pass the image mask and inverted latents to the pipeline. The `target_prompt` becomes the `prompt` now, and the `source_prompt` is used as the `negative_prompt`: - -```py -output_image = pipeline( - prompt=target_prompt, - mask_image=mask_image, - image_latents=inv_latents, - negative_prompt=source_prompt, -).images[0] -mask_image = Image.fromarray((mask_image.squeeze()*255).astype("uint8"), "L").resize((768, 768)) -make_image_grid([raw_image, mask_image, output_image], rows=1, cols=3) -``` - -
-
- -
original image
-
-
- -
edited image
-
-
- -## Generate source and target embeddings - -The source and target embeddings can be automatically generated with the [Flan-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5) model instead of creating them manually. - -Load the Flan-T5 model and tokenizer from the 🤗 Transformers library: - -```py -import torch -from transformers import AutoTokenizer, T5ForConditionalGeneration - -tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-large") -model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-large", device_map="auto", torch_dtype=torch.float16) -``` - -Provide some initial text to prompt the model to generate the source and target prompts. - -```py -source_concept = "bowl" -target_concept = "basket" - -source_text = f"Provide a caption for images containing a {source_concept}. " -"The captions should be in English and should be no longer than 150 characters." - -target_text = f"Provide a caption for images containing a {target_concept}. " -"The captions should be in English and should be no longer than 150 characters." -``` - -Next, create a utility function to generate the prompts: - -```py -@torch.no_grad() -def generate_prompts(input_prompt): - input_ids = tokenizer(input_prompt, return_tensors="pt").input_ids.to("cuda") - - outputs = model.generate( - input_ids, temperature=0.8, num_return_sequences=16, do_sample=True, max_new_tokens=128, top_k=10 - ) - return tokenizer.batch_decode(outputs, skip_special_tokens=True) - -source_prompts = generate_prompts(source_text) -target_prompts = generate_prompts(target_text) -print(source_prompts) -print(target_prompts) -``` - - - -Check out the [generation strategy](https://huggingface.co/docs/transformers/main/en/generation_strategies) guide if you're interested in learning more about strategies for generating different quality text. - - - -Load the text encoder model used by the [`StableDiffusionDiffEditPipeline`] to encode the text. You'll use the text encoder to compute the text embeddings: - -```py -import torch -from diffusers import StableDiffusionDiffEditPipeline - -pipeline = StableDiffusionDiffEditPipeline.from_pretrained( - "stabilityai/stable-diffusion-2-1", torch_dtype=torch.float16, use_safetensors=True -) -pipeline.enable_model_cpu_offload() -pipeline.enable_vae_slicing() - -@torch.no_grad() -def embed_prompts(sentences, tokenizer, text_encoder, device="cuda"): - embeddings = [] - for sent in sentences: - text_inputs = tokenizer( - sent, - padding="max_length", - max_length=tokenizer.model_max_length, - truncation=True, - return_tensors="pt", - ) - text_input_ids = text_inputs.input_ids - prompt_embeds = text_encoder(text_input_ids.to(device), attention_mask=None)[0] - embeddings.append(prompt_embeds) - return torch.concatenate(embeddings, dim=0).mean(dim=0).unsqueeze(0) - -source_embeds = embed_prompts(source_prompts, pipeline.tokenizer, pipeline.text_encoder) -target_embeds = embed_prompts(target_prompts, pipeline.tokenizer, pipeline.text_encoder) -``` - -Finally, pass the embeddings to the [`~StableDiffusionDiffEditPipeline.generate_mask`] and [`~StableDiffusionDiffEditPipeline.invert`] functions, and pipeline to generate the image: - -```diff - from diffusers import DDIMInverseScheduler, DDIMScheduler - from diffusers.utils import load_image, make_image_grid - from PIL import Image - - pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config) - pipeline.inverse_scheduler = DDIMInverseScheduler.from_config(pipeline.scheduler.config) - - img_url = "https://github.com/Xiang-cd/DiffEdit-stable-diffusion/raw/main/assets/origin.png" - raw_image = load_image(img_url).resize((768, 768)) - - mask_image = pipeline.generate_mask( - image=raw_image, -- source_prompt=source_prompt, -- target_prompt=target_prompt, -+ source_prompt_embeds=source_embeds, -+ target_prompt_embeds=target_embeds, - ) - - inv_latents = pipeline.invert( -- prompt=source_prompt, -+ prompt_embeds=source_embeds, - image=raw_image, - ).latents - - output_image = pipeline( - mask_image=mask_image, - image_latents=inv_latents, -- prompt=target_prompt, -- negative_prompt=source_prompt, -+ prompt_embeds=target_embeds, -+ negative_prompt_embeds=source_embeds, - ).images[0] - mask_image = Image.fromarray((mask_image.squeeze()*255).astype("uint8"), "L") - make_image_grid([raw_image, mask_image, output_image], rows=1, cols=3) -``` - -## Generate a caption for inversion - -While you can use the `source_prompt` as a caption to help generate the partially inverted latents, you can also use the [BLIP](https://huggingface.co/docs/transformers/model_doc/blip) model to automatically generate a caption. - -Load the BLIP model and processor from the 🤗 Transformers library: - -```py -import torch -from transformers import BlipForConditionalGeneration, BlipProcessor - -processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base") -model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base", torch_dtype=torch.float16, low_cpu_mem_usage=True) -``` - -Create a utility function to generate a caption from the input image: - -```py -@torch.no_grad() -def generate_caption(images, caption_generator, caption_processor): - text = "a photograph of" - - inputs = caption_processor(images, text, return_tensors="pt").to(device="cuda", dtype=caption_generator.dtype) - caption_generator.to("cuda") - outputs = caption_generator.generate(**inputs, max_new_tokens=128) - - # offload caption generator - caption_generator.to("cpu") - - caption = caption_processor.batch_decode(outputs, skip_special_tokens=True)[0] - return caption -``` - -Load an input image and generate a caption for it using the `generate_caption` function: - -```py -from diffusers.utils import load_image - -img_url = "https://github.com/Xiang-cd/DiffEdit-stable-diffusion/raw/main/assets/origin.png" -raw_image = load_image(img_url).resize((768, 768)) -caption = generate_caption(raw_image, model, processor) -``` - -
-
- -
generated caption: "a photograph of a bowl of fruit on a table"
-
-
- -Now you can drop the caption into the [`~StableDiffusionDiffEditPipeline.invert`] function to generate the partially inverted latents! diff --git a/docs/source/en/using-diffusers/inference_with_lcm.md b/docs/source/en/using-diffusers/inference_with_lcm.md deleted file mode 100644 index d0a47449ad88..000000000000 --- a/docs/source/en/using-diffusers/inference_with_lcm.md +++ /dev/null @@ -1,631 +0,0 @@ - - -# Latent Consistency Model - -[[open-in-colab]] - -[Latent Consistency Models (LCMs)](https://hf.co/papers/2310.04378) enable fast high-quality image generation by directly predicting the reverse diffusion process in the latent rather than pixel space. In other words, LCMs try to predict the noiseless image from the noisy image in contrast to typical diffusion models that iteratively remove noise from the noisy image. By avoiding the iterative sampling process, LCMs are able to generate high-quality images in 2-4 steps instead of 20-30 steps. - -LCMs are distilled from pretrained models which requires ~32 hours of A100 compute. To speed this up, [LCM-LoRAs](https://hf.co/papers/2311.05556) train a [LoRA adapter](https://huggingface.co/docs/peft/conceptual_guides/adapter#low-rank-adaptation-lora) which have much fewer parameters to train compared to the full model. The LCM-LoRA can be plugged into a diffusion model once it has been trained. - -This guide will show you how to use LCMs and LCM-LoRAs for fast inference on tasks and how to use them with other adapters like ControlNet or T2I-Adapter. - -> [!TIP] -> LCMs and LCM-LoRAs are available for Stable Diffusion v1.5, Stable Diffusion XL, and the SSD-1B model. You can find their checkpoints on the [Latent Consistency](https://hf.co/collections/latent-consistency/latent-consistency-models-weights-654ce61a95edd6dffccef6a8) Collections. - -## Text-to-image - - - - -To use LCMs, you need to load the LCM checkpoint for your supported model into [`UNet2DConditionModel`] and replace the scheduler with the [`LCMScheduler`]. Then you can use the pipeline as usual, and pass a text prompt to generate an image in just 4 steps. - -A couple of notes to keep in mind when using LCMs are: - -* Typically, batch size is doubled inside the pipeline for classifier-free guidance. But LCM applies guidance with guidance embeddings and doesn't need to double the batch size, which leads to faster inference. The downside is that negative prompts don't work with LCM because they don't have any effect on the denoising process. -* The ideal range for `guidance_scale` is [3., 13.] because that is what the UNet was trained with. However, disabling `guidance_scale` with a value of 1.0 is also effective in most cases. - -```python -from diffusers import StableDiffusionXLPipeline, UNet2DConditionModel, LCMScheduler -import torch - -unet = UNet2DConditionModel.from_pretrained( - "latent-consistency/lcm-sdxl", - torch_dtype=torch.float16, - variant="fp16", -) -pipe = StableDiffusionXLPipeline.from_pretrained( - "stabilityai/stable-diffusion-xl-base-1.0", unet=unet, torch_dtype=torch.float16, variant="fp16", -).to("cuda") -pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config) - -prompt = "Self-portrait oil painting, a beautiful cyborg with golden hair, 8k" -generator = torch.manual_seed(0) -image = pipe( - prompt=prompt, num_inference_steps=4, generator=generator, guidance_scale=8.0 -).images[0] -image -``` - -
- -
- -
- - -To use LCM-LoRAs, you need to replace the scheduler with the [`LCMScheduler`] and load the LCM-LoRA weights with the [`~loaders.StableDiffusionLoraLoaderMixin.load_lora_weights`] method. Then you can use the pipeline as usual, and pass a text prompt to generate an image in just 4 steps. - -A couple of notes to keep in mind when using LCM-LoRAs are: - -* Typically, batch size is doubled inside the pipeline for classifier-free guidance. But LCM applies guidance with guidance embeddings and doesn't need to double the batch size, which leads to faster inference. The downside is that negative prompts don't work with LCM because they don't have any effect on the denoising process. -* You could use guidance with LCM-LoRAs, but it is very sensitive to high `guidance_scale` values and can lead to artifacts in the generated image. The best values we've found are between [1.0, 2.0]. -* Replace [stabilityai/stable-diffusion-xl-base-1.0](https://hf.co/stabilityai/stable-diffusion-xl-base-1.0) with any finetuned model. For example, try using the [animagine-xl](https://huggingface.co/Linaqruf/animagine-xl) checkpoint to generate anime images with SDXL. - -```py -import torch -from diffusers import DiffusionPipeline, LCMScheduler - -pipe = DiffusionPipeline.from_pretrained( - "stabilityai/stable-diffusion-xl-base-1.0", - variant="fp16", - torch_dtype=torch.float16 -).to("cuda") -pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config) -pipe.load_lora_weights("latent-consistency/lcm-lora-sdxl") - -prompt = "Self-portrait oil painting, a beautiful cyborg with golden hair, 8k" -generator = torch.manual_seed(42) -image = pipe( - prompt=prompt, num_inference_steps=4, generator=generator, guidance_scale=1.0 -).images[0] -image -``` - -
- -
- -
-
- -## Image-to-image - - - - -To use LCMs for image-to-image, you need to load the LCM checkpoint for your supported model into [`UNet2DConditionModel`] and replace the scheduler with the [`LCMScheduler`]. Then you can use the pipeline as usual, and pass a text prompt and initial image to generate an image in just 4 steps. - -> [!TIP] -> Experiment with different values for `num_inference_steps`, `strength`, and `guidance_scale` to get the best results. - -```python -import torch -from diffusers import AutoPipelineForImage2Image, UNet2DConditionModel, LCMScheduler -from diffusers.utils import load_image - -unet = UNet2DConditionModel.from_pretrained( - "SimianLuo/LCM_Dreamshaper_v7", - subfolder="unet", - torch_dtype=torch.float16, -) - -pipe = AutoPipelineForImage2Image.from_pretrained( - "Lykon/dreamshaper-7", - unet=unet, - torch_dtype=torch.float16, - variant="fp16", -).to("cuda") -pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config) - -init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-init.png") -prompt = "Astronauts in a jungle, cold color palette, muted colors, detailed, 8k" -generator = torch.manual_seed(0) -image = pipe( - prompt, - image=init_image, - num_inference_steps=4, - guidance_scale=7.5, - strength=0.5, - generator=generator -).images[0] -image -``` - -
-
- -
initial image
-
-
- -
generated image
-
-
- -
- - -To use LCM-LoRAs for image-to-image, you need to replace the scheduler with the [`LCMScheduler`] and load the LCM-LoRA weights with the [`~loaders.StableDiffusionLoraLoaderMixin.load_lora_weights`] method. Then you can use the pipeline as usual, and pass a text prompt and initial image to generate an image in just 4 steps. - -> [!TIP] -> Experiment with different values for `num_inference_steps`, `strength`, and `guidance_scale` to get the best results. - -```py -import torch -from diffusers import AutoPipelineForImage2Image, LCMScheduler -from diffusers.utils import make_image_grid, load_image - -pipe = AutoPipelineForImage2Image.from_pretrained( - "Lykon/dreamshaper-7", - torch_dtype=torch.float16, - variant="fp16", -).to("cuda") - -pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config) - -pipe.load_lora_weights("latent-consistency/lcm-lora-sdv1-5") - -init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-init.png") -prompt = "Astronauts in a jungle, cold color palette, muted colors, detailed, 8k" - -generator = torch.manual_seed(0) -image = pipe( - prompt, - image=init_image, - num_inference_steps=4, - guidance_scale=1, - strength=0.6, - generator=generator -).images[0] -image -``` - -
-
- -
initial image
-
-
- -
generated image
-
-
- -
-
- -## Inpainting - -To use LCM-LoRAs for inpainting, you need to replace the scheduler with the [`LCMScheduler`] and load the LCM-LoRA weights with the [`~loaders.StableDiffusionLoraLoaderMixin.load_lora_weights`] method. Then you can use the pipeline as usual, and pass a text prompt, initial image, and mask image to generate an image in just 4 steps. - -```py -import torch -from diffusers import AutoPipelineForInpainting, LCMScheduler -from diffusers.utils import load_image, make_image_grid - -pipe = AutoPipelineForInpainting.from_pretrained( - "runwayml/stable-diffusion-inpainting", - torch_dtype=torch.float16, - variant="fp16", -).to("cuda") - -pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config) - -pipe.load_lora_weights("latent-consistency/lcm-lora-sdv1-5") - -init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png") -mask_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint_mask.png") - -prompt = "concept art digital painting of an elven castle, inspired by lord of the rings, highly detailed, 8k" -generator = torch.manual_seed(0) -image = pipe( - prompt=prompt, - image=init_image, - mask_image=mask_image, - generator=generator, - num_inference_steps=4, - guidance_scale=4, -).images[0] -image -``` - -
-
- -
initial image
-
-
- -
generated image
-
-
- -## Adapters - -LCMs are compatible with adapters like LoRA, ControlNet, T2I-Adapter, and AnimateDiff. You can bring the speed of LCMs to these adapters to generate images in a certain style or condition the model on another input like a canny image. - -### LoRA - -[LoRA](../using-diffusers/loading_adapters#lora) adapters can be rapidly finetuned to learn a new style from just a few images and plugged into a pretrained model to generate images in that style. - - - - -Load the LCM checkpoint for your supported model into [`UNet2DConditionModel`] and replace the scheduler with the [`LCMScheduler`]. Then you can use the [`~loaders.StableDiffusionLoraLoaderMixin.load_lora_weights`] method to load the LoRA weights into the LCM and generate a styled image in a few steps. - -```python -from diffusers import StableDiffusionXLPipeline, UNet2DConditionModel, LCMScheduler -import torch - -unet = UNet2DConditionModel.from_pretrained( - "latent-consistency/lcm-sdxl", - torch_dtype=torch.float16, - variant="fp16", -) -pipe = StableDiffusionXLPipeline.from_pretrained( - "stabilityai/stable-diffusion-xl-base-1.0", unet=unet, torch_dtype=torch.float16, variant="fp16", -).to("cuda") -pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config) -pipe.load_lora_weights("TheLastBen/Papercut_SDXL", weight_name="papercut.safetensors", adapter_name="papercut") - -prompt = "papercut, a cute fox" -generator = torch.manual_seed(0) -image = pipe( - prompt=prompt, num_inference_steps=4, generator=generator, guidance_scale=8.0 -).images[0] -image -``` - -
- -
- -
- - -Replace the scheduler with the [`LCMScheduler`]. Then you can use the [`~loaders.StableDiffusionLoraLoaderMixin.load_lora_weights`] method to load the LCM-LoRA weights and the style LoRA you want to use. Combine both LoRA adapters with the [`~loaders.UNet2DConditionLoadersMixin.set_adapters`] method and generate a styled image in a few steps. - -```py -import torch -from diffusers import DiffusionPipeline, LCMScheduler - -pipe = DiffusionPipeline.from_pretrained( - "stabilityai/stable-diffusion-xl-base-1.0", - variant="fp16", - torch_dtype=torch.float16 -).to("cuda") - -pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config) - -pipe.load_lora_weights("latent-consistency/lcm-lora-sdxl", adapter_name="lcm") -pipe.load_lora_weights("TheLastBen/Papercut_SDXL", weight_name="papercut.safetensors", adapter_name="papercut") - -pipe.set_adapters(["lcm", "papercut"], adapter_weights=[1.0, 0.8]) - -prompt = "papercut, a cute fox" -generator = torch.manual_seed(0) -image = pipe(prompt, num_inference_steps=4, guidance_scale=1, generator=generator).images[0] -image -``` - -
- -
- -
-
- -### ControlNet - -[ControlNet](./controlnet) are adapters that can be trained on a variety of inputs like canny edge, pose estimation, or depth. The ControlNet can be inserted into the pipeline to provide additional conditioning and control to the model for more accurate generation. - -You can find additional ControlNet models trained on other inputs in [lllyasviel's](https://hf.co/lllyasviel) repository. - - - - -Load a ControlNet model trained on canny images and pass it to the [`ControlNetModel`]. Then you can load a LCM model into [`StableDiffusionControlNetPipeline`] and replace the scheduler with the [`LCMScheduler`]. Now pass the canny image to the pipeline and generate an image. - -> [!TIP] -> Experiment with different values for `num_inference_steps`, `controlnet_conditioning_scale`, `cross_attention_kwargs`, and `guidance_scale` to get the best results. - -```python -import torch -import cv2 -import numpy as np -from PIL import Image - -from diffusers import StableDiffusionControlNetPipeline, ControlNetModel, LCMScheduler -from diffusers.utils import load_image, make_image_grid - -image = load_image( - "https://hf.co/datasets/huggingface/documentation-images/resolve/main/diffusers/input_image_vermeer.png" -).resize((512, 512)) - -image = np.array(image) - -low_threshold = 100 -high_threshold = 200 - -image = cv2.Canny(image, low_threshold, high_threshold) -image = image[:, :, None] -image = np.concatenate([image, image, image], axis=2) -canny_image = Image.fromarray(image) - -controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny", torch_dtype=torch.float16) -pipe = StableDiffusionControlNetPipeline.from_pretrained( - "SimianLuo/LCM_Dreamshaper_v7", - controlnet=controlnet, - torch_dtype=torch.float16, - safety_checker=None, -).to("cuda") -pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config) - -generator = torch.manual_seed(0) -image = pipe( - "the mona lisa", - image=canny_image, - num_inference_steps=4, - generator=generator, -).images[0] -make_image_grid([canny_image, image], rows=1, cols=2) -``` - -
- -
- -
- - -Load a ControlNet model trained on canny images and pass it to the [`ControlNetModel`]. Then you can load a Stable Diffusion v1.5 model into [`StableDiffusionControlNetPipeline`] and replace the scheduler with the [`LCMScheduler`]. Use the [`~loaders.StableDiffusionLoraLoaderMixin.load_lora_weights`] method to load the LCM-LoRA weights, and pass the canny image to the pipeline and generate an image. - -> [!TIP] -> Experiment with different values for `num_inference_steps`, `controlnet_conditioning_scale`, `cross_attention_kwargs`, and `guidance_scale` to get the best results. - -```py -import torch -import cv2 -import numpy as np -from PIL import Image - -from diffusers import StableDiffusionControlNetPipeline, ControlNetModel, LCMScheduler -from diffusers.utils import load_image - -image = load_image( - "https://hf.co/datasets/huggingface/documentation-images/resolve/main/diffusers/input_image_vermeer.png" -).resize((512, 512)) - -image = np.array(image) - -low_threshold = 100 -high_threshold = 200 - -image = cv2.Canny(image, low_threshold, high_threshold) -image = image[:, :, None] -image = np.concatenate([image, image, image], axis=2) -canny_image = Image.fromarray(image) - -controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny", torch_dtype=torch.float16) -pipe = StableDiffusionControlNetPipeline.from_pretrained( - "stable-diffusion-v1-5/stable-diffusion-v1-5", - controlnet=controlnet, - torch_dtype=torch.float16, - safety_checker=None, - variant="fp16" -).to("cuda") - -pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config) - -pipe.load_lora_weights("latent-consistency/lcm-lora-sdv1-5") - -generator = torch.manual_seed(0) -image = pipe( - "the mona lisa", - image=canny_image, - num_inference_steps=4, - guidance_scale=1.5, - controlnet_conditioning_scale=0.8, - cross_attention_kwargs={"scale": 1}, - generator=generator, -).images[0] -image -``` - -
- -
- -
-
- -### T2I-Adapter - -[T2I-Adapter](./t2i_adapter) is an even more lightweight adapter than ControlNet, that provides an additional input to condition a pretrained model with. It is faster than ControlNet but the results may be slightly worse. - -You can find additional T2I-Adapter checkpoints trained on other inputs in [TencentArc's](https://hf.co/TencentARC) repository. - - - - -Load a T2IAdapter trained on canny images and pass it to the [`StableDiffusionXLAdapterPipeline`]. Then load a LCM checkpoint into [`UNet2DConditionModel`] and replace the scheduler with the [`LCMScheduler`]. Now pass the canny image to the pipeline and generate an image. - -```python -import torch -import cv2 -import numpy as np -from PIL import Image - -from diffusers import StableDiffusionXLAdapterPipeline, UNet2DConditionModel, T2IAdapter, LCMScheduler -from diffusers.utils import load_image, make_image_grid - -# detect the canny map in low resolution to avoid high-frequency details -image = load_image( - "https://hf.co/datasets/huggingface/documentation-images/resolve/main/diffusers/input_image_vermeer.png" -).resize((384, 384)) - -image = np.array(image) - -low_threshold = 100 -high_threshold = 200 - -image = cv2.Canny(image, low_threshold, high_threshold) -image = image[:, :, None] -image = np.concatenate([image, image, image], axis=2) -canny_image = Image.fromarray(image).resize((1024, 1216)) - -adapter = T2IAdapter.from_pretrained("TencentARC/t2i-adapter-canny-sdxl-1.0", torch_dtype=torch.float16, variant="fp16").to("cuda") - -unet = UNet2DConditionModel.from_pretrained( - "latent-consistency/lcm-sdxl", - torch_dtype=torch.float16, - variant="fp16", -) -pipe = StableDiffusionXLAdapterPipeline.from_pretrained( - "stabilityai/stable-diffusion-xl-base-1.0", - unet=unet, - adapter=adapter, - torch_dtype=torch.float16, - variant="fp16", -).to("cuda") - -pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config) - -prompt = "the mona lisa, 4k picture, high quality" -negative_prompt = "extra digit, fewer digits, cropped, worst quality, low quality, glitch, deformed, mutated, ugly, disfigured" - -generator = torch.manual_seed(0) -image = pipe( - prompt=prompt, - negative_prompt=negative_prompt, - image=canny_image, - num_inference_steps=4, - guidance_scale=5, - adapter_conditioning_scale=0.8, - adapter_conditioning_factor=1, - generator=generator, -).images[0] -``` - -
- -
- -
- - -Load a T2IAdapter trained on canny images and pass it to the [`StableDiffusionXLAdapterPipeline`]. Replace the scheduler with the [`LCMScheduler`], and use the [`~loaders.StableDiffusionLoraLoaderMixin.load_lora_weights`] method to load the LCM-LoRA weights. Pass the canny image to the pipeline and generate an image. - -```py -import torch -import cv2 -import numpy as np -from PIL import Image - -from diffusers import StableDiffusionXLAdapterPipeline, UNet2DConditionModel, T2IAdapter, LCMScheduler -from diffusers.utils import load_image, make_image_grid - -# detect the canny map in low resolution to avoid high-frequency details -image = load_image( - "https://hf.co/datasets/huggingface/documentation-images/resolve/main/diffusers/input_image_vermeer.png" -).resize((384, 384)) - -image = np.array(image) - -low_threshold = 100 -high_threshold = 200 - -image = cv2.Canny(image, low_threshold, high_threshold) -image = image[:, :, None] -image = np.concatenate([image, image, image], axis=2) -canny_image = Image.fromarray(image).resize((1024, 1024)) - -adapter = T2IAdapter.from_pretrained("TencentARC/t2i-adapter-canny-sdxl-1.0", torch_dtype=torch.float16, variant="fp16").to("cuda") - -pipe = StableDiffusionXLAdapterPipeline.from_pretrained( - "stabilityai/stable-diffusion-xl-base-1.0", - adapter=adapter, - torch_dtype=torch.float16, - variant="fp16", -).to("cuda") - -pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config) - -pipe.load_lora_weights("latent-consistency/lcm-lora-sdxl") - -prompt = "the mona lisa, 4k picture, high quality" -negative_prompt = "extra digit, fewer digits, cropped, worst quality, low quality, glitch, deformed, mutated, ugly, disfigured" - -generator = torch.manual_seed(0) -image = pipe( - prompt=prompt, - negative_prompt=negative_prompt, - image=canny_image, - num_inference_steps=4, - guidance_scale=1.5, - adapter_conditioning_scale=0.8, - adapter_conditioning_factor=1, - generator=generator, -).images[0] -``` - -
- -
- -
-
- -### AnimateDiff - -[AnimateDiff](../api/pipelines/animatediff) is an adapter that adds motion to an image. It can be used with most Stable Diffusion models, effectively turning them into "video generation" models. Generating good results with a video model usually requires generating multiple frames (16-24), which can be very slow with a regular Stable Diffusion model. LCM-LoRA can speed up this process by only taking 4-8 steps for each frame. - -Load a [`AnimateDiffPipeline`] and pass a [`MotionAdapter`] to it. Then replace the scheduler with the [`LCMScheduler`], and combine both LoRA adapters with the [`~loaders.UNet2DConditionLoadersMixin.set_adapters`] method. Now you can pass a prompt to the pipeline and generate an animated image. - -```py -import torch -from diffusers import MotionAdapter, AnimateDiffPipeline, DDIMScheduler, LCMScheduler -from diffusers.utils import export_to_gif - -adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5") -pipe = AnimateDiffPipeline.from_pretrained( - "frankjoshua/toonyou_beta6", - motion_adapter=adapter, -).to("cuda") - -# set scheduler -pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config) - -# load LCM-LoRA -pipe.load_lora_weights("latent-consistency/lcm-lora-sdv1-5", adapter_name="lcm") -pipe.load_lora_weights("guoyww/animatediff-motion-lora-zoom-in", weight_name="diffusion_pytorch_model.safetensors", adapter_name="motion-lora") - -pipe.set_adapters(["lcm", "motion-lora"], adapter_weights=[0.55, 1.2]) - -prompt = "best quality, masterpiece, 1girl, looking at viewer, blurry background, upper body, contemporary, dress" -generator = torch.manual_seed(0) -frames = pipe( - prompt=prompt, - num_inference_steps=5, - guidance_scale=1.25, - cross_attention_kwargs={"scale": 1}, - num_frames=24, - generator=generator -).frames[0] -export_to_gif(frames, "animation.gif") -``` - -
- -
diff --git a/docs/source/en/using-diffusers/inference_with_tcd_lora.md b/docs/source/en/using-diffusers/inference_with_tcd_lora.md deleted file mode 100644 index 88dd4733b5c3..000000000000 --- a/docs/source/en/using-diffusers/inference_with_tcd_lora.md +++ /dev/null @@ -1,438 +0,0 @@ - - -[[open-in-colab]] - -# Trajectory Consistency Distillation-LoRA - -Trajectory Consistency Distillation (TCD) enables a model to generate higher quality and more detailed images with fewer steps. Moreover, owing to the effective error mitigation during the distillation process, TCD demonstrates superior performance even under conditions of large inference steps. - -The major advantages of TCD are: - -- Better than Teacher: TCD demonstrates superior generative quality at both small and large inference steps and exceeds the performance of [DPM-Solver++(2S)](../../api/schedulers/multistep_dpm_solver) with Stable Diffusion XL (SDXL). There is no additional discriminator or LPIPS supervision included during TCD training. - -- Flexible Inference Steps: The inference steps for TCD sampling can be freely adjusted without adversely affecting the image quality. - -- Freely change detail level: During inference, the level of detail in the image can be adjusted with a single hyperparameter, *gamma*. - -> [!TIP] -> For more technical details of TCD, please refer to the [paper](https://huggingface.co/papers/2402.19159) or official [project page](https://mhh0318.github.io/tcd/). - -For large models like SDXL, TCD is trained with [LoRA](https://huggingface.co/docs/peft/conceptual_guides/adapter#low-rank-adaptation-lora) to reduce memory usage. This is also useful because you can reuse LoRAs between different finetuned models, as long as they share the same base model, without further training. - - - -This guide will show you how to perform inference with TCD-LoRAs for a variety of tasks like text-to-image and inpainting, as well as how you can easily combine TCD-LoRAs with other adapters. Choose one of the supported base model and it's corresponding TCD-LoRA checkpoint from the table below to get started. - -| Base model | TCD-LoRA checkpoint | -|-------------------------------------------------------------------------------------------------|----------------------------------------------------------------| -| [stable-diffusion-v1-5](https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5) | [TCD-SD15](https://huggingface.co/h1t/TCD-SD15-LoRA) | -| [stable-diffusion-2-1-base](https://huggingface.co/stabilityai/stable-diffusion-2-1-base) | [TCD-SD21-base](https://huggingface.co/h1t/TCD-SD21-base-LoRA) | -| [stable-diffusion-xl-base-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0) | [TCD-SDXL](https://huggingface.co/h1t/TCD-SDXL-LoRA) | - - -Make sure you have [PEFT](https://github.com/huggingface/peft) installed for better LoRA support. - -```bash -pip install -U peft -``` - -## General tasks - -In this guide, let's use the [`StableDiffusionXLPipeline`] and the [`TCDScheduler`]. Use the [`~StableDiffusionPipeline.load_lora_weights`] method to load the SDXL-compatible TCD-LoRA weights. - -A few tips to keep in mind for TCD-LoRA inference are to: - -- Keep the `num_inference_steps` between 4 and 50 -- Set `eta` (used to control stochasticity at each step) between 0 and 1. You should use a higher `eta` when increasing the number of inference steps, but the downside is that a larger `eta` in [`TCDScheduler`] leads to blurrier images. A value of 0.3 is recommended to produce good results. - - - - -```python -import torch -from diffusers import StableDiffusionXLPipeline, TCDScheduler - -device = "cuda" -base_model_id = "stabilityai/stable-diffusion-xl-base-1.0" -tcd_lora_id = "h1t/TCD-SDXL-LoRA" - -pipe = StableDiffusionXLPipeline.from_pretrained(base_model_id, torch_dtype=torch.float16, variant="fp16").to(device) -pipe.scheduler = TCDScheduler.from_config(pipe.scheduler.config) - -pipe.load_lora_weights(tcd_lora_id) -pipe.fuse_lora() - -prompt = "Painting of the orange cat Otto von Garfield, Count of Bismarck-Schönhausen, Duke of Lauenburg, Minister-President of Prussia. Depicted wearing a Prussian Pickelhaube and eating his favorite meal - lasagna." - -image = pipe( - prompt=prompt, - num_inference_steps=4, - guidance_scale=0, - eta=0.3, - generator=torch.Generator(device=device).manual_seed(0), -).images[0] -``` - -![](https://github.com/jabir-zheng/TCD/raw/main/assets/demo_image.png) - - - - - -```python -import torch -from diffusers import AutoPipelineForInpainting, TCDScheduler -from diffusers.utils import load_image, make_image_grid - -device = "cuda" -base_model_id = "diffusers/stable-diffusion-xl-1.0-inpainting-0.1" -tcd_lora_id = "h1t/TCD-SDXL-LoRA" - -pipe = AutoPipelineForInpainting.from_pretrained(base_model_id, torch_dtype=torch.float16, variant="fp16").to(device) -pipe.scheduler = TCDScheduler.from_config(pipe.scheduler.config) - -pipe.load_lora_weights(tcd_lora_id) -pipe.fuse_lora() - -img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png" -mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png" - -init_image = load_image(img_url).resize((1024, 1024)) -mask_image = load_image(mask_url).resize((1024, 1024)) - -prompt = "a tiger sitting on a park bench" - -image = pipe( - prompt=prompt, - image=init_image, - mask_image=mask_image, - num_inference_steps=8, - guidance_scale=0, - eta=0.3, - strength=0.99, # make sure to use `strength` below 1.0 - generator=torch.Generator(device=device).manual_seed(0), -).images[0] - -grid_image = make_image_grid([init_image, mask_image, image], rows=1, cols=3) -``` - -![](https://github.com/jabir-zheng/TCD/raw/main/assets/inpainting_tcd.png) - - - - - -## Community models - -TCD-LoRA also works with many community finetuned models and plugins. For example, load the [animagine-xl-3.0](https://huggingface.co/cagliostrolab/animagine-xl-3.0) checkpoint which is a community finetuned version of SDXL for generating anime images. - -```python -import torch -from diffusers import StableDiffusionXLPipeline, TCDScheduler - -device = "cuda" -base_model_id = "cagliostrolab/animagine-xl-3.0" -tcd_lora_id = "h1t/TCD-SDXL-LoRA" - -pipe = StableDiffusionXLPipeline.from_pretrained(base_model_id, torch_dtype=torch.float16, variant="fp16").to(device) -pipe.scheduler = TCDScheduler.from_config(pipe.scheduler.config) - -pipe.load_lora_weights(tcd_lora_id) -pipe.fuse_lora() - -prompt = "A man, clad in a meticulously tailored military uniform, stands with unwavering resolve. The uniform boasts intricate details, and his eyes gleam with determination. Strands of vibrant, windswept hair peek out from beneath the brim of his cap." - -image = pipe( - prompt=prompt, - num_inference_steps=8, - guidance_scale=0, - eta=0.3, - generator=torch.Generator(device=device).manual_seed(0), -).images[0] -``` - -![](https://github.com/jabir-zheng/TCD/raw/main/assets/animagine_xl.png) - -TCD-LoRA also supports other LoRAs trained on different styles. For example, let's load the [TheLastBen/Papercut_SDXL](https://huggingface.co/TheLastBen/Papercut_SDXL) LoRA and fuse it with the TCD-LoRA with the [`~loaders.UNet2DConditionLoadersMixin.set_adapters`] method. - -> [!TIP] -> Check out the [Merge LoRAs](merge_loras) guide to learn more about efficient merging methods. - -```python -import torch -from diffusers import StableDiffusionXLPipeline -from scheduling_tcd import TCDScheduler - -device = "cuda" -base_model_id = "stabilityai/stable-diffusion-xl-base-1.0" -tcd_lora_id = "h1t/TCD-SDXL-LoRA" -styled_lora_id = "TheLastBen/Papercut_SDXL" - -pipe = StableDiffusionXLPipeline.from_pretrained(base_model_id, torch_dtype=torch.float16, variant="fp16").to(device) -pipe.scheduler = TCDScheduler.from_config(pipe.scheduler.config) - -pipe.load_lora_weights(tcd_lora_id, adapter_name="tcd") -pipe.load_lora_weights(styled_lora_id, adapter_name="style") -pipe.set_adapters(["tcd", "style"], adapter_weights=[1.0, 1.0]) - -prompt = "papercut of a winter mountain, snow" - -image = pipe( - prompt=prompt, - num_inference_steps=4, - guidance_scale=0, - eta=0.3, - generator=torch.Generator(device=device).manual_seed(0), -).images[0] -``` - -![](https://github.com/jabir-zheng/TCD/raw/main/assets/styled_lora.png) - - -## Adapters - -TCD-LoRA is very versatile, and it can be combined with other adapter types like ControlNets, IP-Adapter, and AnimateDiff. - - - - -### Depth ControlNet - -```python -import torch -import numpy as np -from PIL import Image -from transformers import DPTImageProcessor, DPTForDepthEstimation -from diffusers import ControlNetModel, StableDiffusionXLControlNetPipeline -from diffusers.utils import load_image, make_image_grid -from scheduling_tcd import TCDScheduler - -device = "cuda" -depth_estimator = DPTForDepthEstimation.from_pretrained("Intel/dpt-hybrid-midas").to(device) -feature_extractor = DPTImageProcessor.from_pretrained("Intel/dpt-hybrid-midas") - -def get_depth_map(image): - image = feature_extractor(images=image, return_tensors="pt").pixel_values.to(device) - with torch.no_grad(), torch.autocast(device): - depth_map = depth_estimator(image).predicted_depth - - depth_map = torch.nn.functional.interpolate( - depth_map.unsqueeze(1), - size=(1024, 1024), - mode="bicubic", - align_corners=False, - ) - depth_min = torch.amin(depth_map, dim=[1, 2, 3], keepdim=True) - depth_max = torch.amax(depth_map, dim=[1, 2, 3], keepdim=True) - depth_map = (depth_map - depth_min) / (depth_max - depth_min) - image = torch.cat([depth_map] * 3, dim=1) - - image = image.permute(0, 2, 3, 1).cpu().numpy()[0] - image = Image.fromarray((image * 255.0).clip(0, 255).astype(np.uint8)) - return image - -base_model_id = "stabilityai/stable-diffusion-xl-base-1.0" -controlnet_id = "diffusers/controlnet-depth-sdxl-1.0" -tcd_lora_id = "h1t/TCD-SDXL-LoRA" - -controlnet = ControlNetModel.from_pretrained( - controlnet_id, - torch_dtype=torch.float16, - variant="fp16", -) -pipe = StableDiffusionXLControlNetPipeline.from_pretrained( - base_model_id, - controlnet=controlnet, - torch_dtype=torch.float16, - variant="fp16", -) -pipe.enable_model_cpu_offload() - -pipe.scheduler = TCDScheduler.from_config(pipe.scheduler.config) - -pipe.load_lora_weights(tcd_lora_id) -pipe.fuse_lora() - -prompt = "stormtrooper lecture, photorealistic" - -image = load_image("https://huggingface.co/lllyasviel/sd-controlnet-depth/resolve/main/images/stormtrooper.png") -depth_image = get_depth_map(image) - -controlnet_conditioning_scale = 0.5 # recommended for good generalization - -image = pipe( - prompt, - image=depth_image, - num_inference_steps=4, - guidance_scale=0, - eta=0.3, - controlnet_conditioning_scale=controlnet_conditioning_scale, - generator=torch.Generator(device=device).manual_seed(0), -).images[0] - -grid_image = make_image_grid([depth_image, image], rows=1, cols=2) -``` - -![](https://github.com/jabir-zheng/TCD/raw/main/assets/controlnet_depth_tcd.png) - -### Canny ControlNet -```python -import torch -from diffusers import ControlNetModel, StableDiffusionXLControlNetPipeline -from diffusers.utils import load_image, make_image_grid -from scheduling_tcd import TCDScheduler - -device = "cuda" -base_model_id = "stabilityai/stable-diffusion-xl-base-1.0" -controlnet_id = "diffusers/controlnet-canny-sdxl-1.0" -tcd_lora_id = "h1t/TCD-SDXL-LoRA" - -controlnet = ControlNetModel.from_pretrained( - controlnet_id, - torch_dtype=torch.float16, - variant="fp16", -) -pipe = StableDiffusionXLControlNetPipeline.from_pretrained( - base_model_id, - controlnet=controlnet, - torch_dtype=torch.float16, - variant="fp16", -) -pipe.enable_model_cpu_offload() - -pipe.scheduler = TCDScheduler.from_config(pipe.scheduler.config) - -pipe.load_lora_weights(tcd_lora_id) -pipe.fuse_lora() - -prompt = "ultrarealistic shot of a furry blue bird" - -canny_image = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/bird_canny.png") - -controlnet_conditioning_scale = 0.5 # recommended for good generalization - -image = pipe( - prompt, - image=canny_image, - num_inference_steps=4, - guidance_scale=0, - eta=0.3, - controlnet_conditioning_scale=controlnet_conditioning_scale, - generator=torch.Generator(device=device).manual_seed(0), -).images[0] - -grid_image = make_image_grid([canny_image, image], rows=1, cols=2) -``` -![](https://github.com/jabir-zheng/TCD/raw/main/assets/controlnet_canny_tcd.png) - - -The inference parameters in this example might not work for all examples, so we recommend you to try different values for `num_inference_steps`, `guidance_scale`, `controlnet_conditioning_scale` and `cross_attention_kwargs` parameters and choose the best one. - - - - - -This example shows how to use the TCD-LoRA with the [IP-Adapter](https://github.com/tencent-ailab/IP-Adapter/tree/main) and SDXL. - -```python -import torch -from diffusers import StableDiffusionXLPipeline -from diffusers.utils import load_image, make_image_grid - -from ip_adapter import IPAdapterXL -from scheduling_tcd import TCDScheduler - -device = "cuda" -base_model_path = "stabilityai/stable-diffusion-xl-base-1.0" -image_encoder_path = "sdxl_models/image_encoder" -ip_ckpt = "sdxl_models/ip-adapter_sdxl.bin" -tcd_lora_id = "h1t/TCD-SDXL-LoRA" - -pipe = StableDiffusionXLPipeline.from_pretrained( - base_model_path, - torch_dtype=torch.float16, - variant="fp16" -) -pipe.scheduler = TCDScheduler.from_config(pipe.scheduler.config) - -pipe.load_lora_weights(tcd_lora_id) -pipe.fuse_lora() - -ip_model = IPAdapterXL(pipe, image_encoder_path, ip_ckpt, device) - -ref_image = load_image("https://raw.githubusercontent.com/tencent-ailab/IP-Adapter/main/assets/images/woman.png").resize((512, 512)) - -prompt = "best quality, high quality, wearing sunglasses" - -image = ip_model.generate( - pil_image=ref_image, - prompt=prompt, - scale=0.5, - num_samples=1, - num_inference_steps=4, - guidance_scale=0, - eta=0.3, - seed=0, -)[0] - -grid_image = make_image_grid([ref_image, image], rows=1, cols=2) -``` - -![](https://github.com/jabir-zheng/TCD/raw/main/assets/ip_adapter.png) - - - - - - -[`AnimateDiff`] allows animating images using Stable Diffusion models. TCD-LoRA can substantially accelerate the process without degrading image quality. The quality of animation with TCD-LoRA and AnimateDiff has a more lucid outcome. - -```python -import torch -from diffusers import MotionAdapter, AnimateDiffPipeline, DDIMScheduler -from scheduling_tcd import TCDScheduler -from diffusers.utils import export_to_gif - -adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5") -pipe = AnimateDiffPipeline.from_pretrained( - "frankjoshua/toonyou_beta6", - motion_adapter=adapter, -).to("cuda") - -# set TCDScheduler -pipe.scheduler = TCDScheduler.from_config(pipe.scheduler.config) - -# load TCD LoRA -pipe.load_lora_weights("h1t/TCD-SD15-LoRA", adapter_name="tcd") -pipe.load_lora_weights("guoyww/animatediff-motion-lora-zoom-in", weight_name="diffusion_pytorch_model.safetensors", adapter_name="motion-lora") - -pipe.set_adapters(["tcd", "motion-lora"], adapter_weights=[1.0, 1.2]) - -prompt = "best quality, masterpiece, 1girl, looking at viewer, blurry background, upper body, contemporary, dress" -generator = torch.manual_seed(0) -frames = pipe( - prompt=prompt, - num_inference_steps=5, - guidance_scale=0, - cross_attention_kwargs={"scale": 1}, - num_frames=24, - eta=0.3, - generator=generator -).frames[0] -export_to_gif(frames, "animation.gif") -``` - -![](https://github.com/jabir-zheng/TCD/raw/main/assets/animation_example.gif) - - - \ No newline at end of file diff --git a/docs/source/en/using-diffusers/kandinsky.md b/docs/source/en/using-diffusers/kandinsky.md deleted file mode 100644 index a482380524e7..000000000000 --- a/docs/source/en/using-diffusers/kandinsky.md +++ /dev/null @@ -1,768 +0,0 @@ - - -# Kandinsky - -[[open-in-colab]] - -The Kandinsky models are a series of multilingual text-to-image generation models. The Kandinsky 2.0 model uses two multilingual text encoders and concatenates those results for the UNet. - -[Kandinsky 2.1](../api/pipelines/kandinsky) changes the architecture to include an image prior model ([`CLIP`](https://huggingface.co/docs/transformers/model_doc/clip)) to generate a mapping between text and image embeddings. The mapping provides better text-image alignment and it is used with the text embeddings during training, leading to higher quality results. Finally, Kandinsky 2.1 uses a [Modulating Quantized Vectors (MoVQ)](https://huggingface.co/papers/2209.09002) decoder - which adds a spatial conditional normalization layer to increase photorealism - to decode the latents into images. - -[Kandinsky 2.2](../api/pipelines/kandinsky_v22) improves on the previous model by replacing the image encoder of the image prior model with a larger CLIP-ViT-G model to improve quality. The image prior model was also retrained on images with different resolutions and aspect ratios to generate higher-resolution images and different image sizes. - -[Kandinsky 3](../api/pipelines/kandinsky3) simplifies the architecture and shifts away from the two-stage generation process involving the prior model and diffusion model. Instead, Kandinsky 3 uses [Flan-UL2](https://huggingface.co/google/flan-ul2) to encode text, a UNet with [BigGan-deep](https://hf.co/papers/1809.11096) blocks, and [Sber-MoVQGAN](https://github.com/ai-forever/MoVQGAN) to decode the latents into images. Text understanding and generated image quality are primarily achieved by using a larger text encoder and UNet. - -This guide will show you how to use the Kandinsky models for text-to-image, image-to-image, inpainting, interpolation, and more. - -Before you begin, make sure you have the following libraries installed: - -```py -# uncomment to install the necessary libraries in Colab -#!pip install -q diffusers transformers accelerate -``` - - - -Kandinsky 2.1 and 2.2 usage is very similar! The only difference is Kandinsky 2.2 doesn't accept `prompt` as an input when decoding the latents. Instead, Kandinsky 2.2 only accepts `image_embeds` during decoding. - -
- -Kandinsky 3 has a more concise architecture and it doesn't require a prior model. This means it's usage is identical to other diffusion models like [Stable Diffusion XL](sdxl). - -
- -## Text-to-image - -To use the Kandinsky models for any task, you always start by setting up the prior pipeline to encode the prompt and generate the image embeddings. The prior pipeline also generates `negative_image_embeds` that correspond to the negative prompt `""`. For better results, you can pass an actual `negative_prompt` to the prior pipeline, but this'll increase the effective batch size of the prior pipeline by 2x. - - - - -```py -from diffusers import KandinskyPriorPipeline, KandinskyPipeline -import torch - -prior_pipeline = KandinskyPriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-prior", torch_dtype=torch.float16).to("cuda") -pipeline = KandinskyPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16).to("cuda") - -prompt = "A alien cheeseburger creature eating itself, claymation, cinematic, moody lighting" -negative_prompt = "low quality, bad quality" # optional to include a negative prompt, but results are usually better -image_embeds, negative_image_embeds = prior_pipeline(prompt, negative_prompt, guidance_scale=1.0).to_tuple() -``` - -Now pass all the prompts and embeddings to the [`KandinskyPipeline`] to generate an image: - -```py -image = pipeline(prompt, image_embeds=image_embeds, negative_prompt=negative_prompt, negative_image_embeds=negative_image_embeds, height=768, width=768).images[0] -image -``` - -
- -
- -
- - -```py -from diffusers import KandinskyV22PriorPipeline, KandinskyV22Pipeline -import torch - -prior_pipeline = KandinskyV22PriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16).to("cuda") -pipeline = KandinskyV22Pipeline.from_pretrained("kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16).to("cuda") - -prompt = "A alien cheeseburger creature eating itself, claymation, cinematic, moody lighting" -negative_prompt = "low quality, bad quality" # optional to include a negative prompt, but results are usually better -image_embeds, negative_image_embeds = prior_pipeline(prompt, guidance_scale=1.0).to_tuple() -``` - -Pass the `image_embeds` and `negative_image_embeds` to the [`KandinskyV22Pipeline`] to generate an image: - -```py -image = pipeline(image_embeds=image_embeds, negative_image_embeds=negative_image_embeds, height=768, width=768).images[0] -image -``` - -
- -
- -
- - -Kandinsky 3 doesn't require a prior model so you can directly load the [`Kandinsky3Pipeline`] and pass a prompt to generate an image: - -```py -from diffusers import Kandinsky3Pipeline -import torch - -pipeline = Kandinsky3Pipeline.from_pretrained("kandinsky-community/kandinsky-3", variant="fp16", torch_dtype=torch.float16) -pipeline.enable_model_cpu_offload() - -prompt = "A alien cheeseburger creature eating itself, claymation, cinematic, moody lighting" -image = pipeline(prompt).images[0] -image -``` - - -
- -🤗 Diffusers also provides an end-to-end API with the [`KandinskyCombinedPipeline`] and [`KandinskyV22CombinedPipeline`], meaning you don't have to separately load the prior and text-to-image pipeline. The combined pipeline automatically loads both the prior model and the decoder. You can still set different values for the prior pipeline with the `prior_guidance_scale` and `prior_num_inference_steps` parameters if you want. - -Use the [`AutoPipelineForText2Image`] to automatically call the combined pipelines under the hood: - - - - -```py -from diffusers import AutoPipelineForText2Image -import torch - -pipeline = AutoPipelineForText2Image.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16) -pipeline.enable_model_cpu_offload() - -prompt = "A alien cheeseburger creature eating itself, claymation, cinematic, moody lighting" -negative_prompt = "low quality, bad quality" - -image = pipeline(prompt=prompt, negative_prompt=negative_prompt, prior_guidance_scale=1.0, guidance_scale=4.0, height=768, width=768).images[0] -image -``` - - - - -```py -from diffusers import AutoPipelineForText2Image -import torch - -pipeline = AutoPipelineForText2Image.from_pretrained("kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16) -pipeline.enable_model_cpu_offload() - -prompt = "A alien cheeseburger creature eating itself, claymation, cinematic, moody lighting" -negative_prompt = "low quality, bad quality" - -image = pipeline(prompt=prompt, negative_prompt=negative_prompt, prior_guidance_scale=1.0, guidance_scale=4.0, height=768, width=768).images[0] -image -``` - - - - -## Image-to-image - -For image-to-image, pass the initial image and text prompt to condition the image to the pipeline. Start by loading the prior pipeline: - - - - -```py -import torch -from diffusers import KandinskyImg2ImgPipeline, KandinskyPriorPipeline - -prior_pipeline = KandinskyPriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-prior", torch_dtype=torch.float16, use_safetensors=True).to("cuda") -pipeline = KandinskyImg2ImgPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16, use_safetensors=True).to("cuda") -``` - - - - -```py -import torch -from diffusers import KandinskyV22Img2ImgPipeline, KandinskyPriorPipeline - -prior_pipeline = KandinskyPriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16, use_safetensors=True).to("cuda") -pipeline = KandinskyV22Img2ImgPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16, use_safetensors=True).to("cuda") -``` - - - - -Kandinsky 3 doesn't require a prior model so you can directly load the image-to-image pipeline: - -```py -from diffusers import Kandinsky3Img2ImgPipeline -from diffusers.utils import load_image -import torch - -pipeline = Kandinsky3Img2ImgPipeline.from_pretrained("kandinsky-community/kandinsky-3", variant="fp16", torch_dtype=torch.float16) -pipeline.enable_model_cpu_offload() -``` - - - - -Download an image to condition on: - -```py -from diffusers.utils import load_image - -# download image -url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg" -original_image = load_image(url) -original_image = original_image.resize((768, 512)) -``` - -
- -
- -Generate the `image_embeds` and `negative_image_embeds` with the prior pipeline: - -```py -prompt = "A fantasy landscape, Cinematic lighting" -negative_prompt = "low quality, bad quality" - -image_embeds, negative_image_embeds = prior_pipeline(prompt, negative_prompt).to_tuple() -``` - -Now pass the original image, and all the prompts and embeddings to the pipeline to generate an image: - - - - -```py -from diffusers.utils import make_image_grid - -image = pipeline(prompt, negative_prompt=negative_prompt, image=original_image, image_embeds=image_embeds, negative_image_embeds=negative_image_embeds, height=768, width=768, strength=0.3).images[0] -make_image_grid([original_image.resize((512, 512)), image.resize((512, 512))], rows=1, cols=2) -``` - -
- -
- -
- - -```py -from diffusers.utils import make_image_grid - -image = pipeline(image=original_image, image_embeds=image_embeds, negative_image_embeds=negative_image_embeds, height=768, width=768, strength=0.3).images[0] -make_image_grid([original_image.resize((512, 512)), image.resize((512, 512))], rows=1, cols=2) -``` - -
- -
- -
- - -```py -image = pipeline(prompt, negative_prompt=negative_prompt, image=image, strength=0.75, num_inference_steps=25).images[0] -image -``` - - -
- -🤗 Diffusers also provides an end-to-end API with the [`KandinskyImg2ImgCombinedPipeline`] and [`KandinskyV22Img2ImgCombinedPipeline`], meaning you don't have to separately load the prior and image-to-image pipeline. The combined pipeline automatically loads both the prior model and the decoder. You can still set different values for the prior pipeline with the `prior_guidance_scale` and `prior_num_inference_steps` parameters if you want. - -Use the [`AutoPipelineForImage2Image`] to automatically call the combined pipelines under the hood: - - - - -```py -from diffusers import AutoPipelineForImage2Image -from diffusers.utils import make_image_grid, load_image -import torch - -pipeline = AutoPipelineForImage2Image.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16, use_safetensors=True) -pipeline.enable_model_cpu_offload() - -prompt = "A fantasy landscape, Cinematic lighting" -negative_prompt = "low quality, bad quality" - -url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg" -original_image = load_image(url) - -original_image.thumbnail((768, 768)) - -image = pipeline(prompt=prompt, negative_prompt=negative_prompt, image=original_image, strength=0.3).images[0] -make_image_grid([original_image.resize((512, 512)), image.resize((512, 512))], rows=1, cols=2) -``` - - - - -```py -from diffusers import AutoPipelineForImage2Image -from diffusers.utils import make_image_grid, load_image -import torch - -pipeline = AutoPipelineForImage2Image.from_pretrained("kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16) -pipeline.enable_model_cpu_offload() - -prompt = "A fantasy landscape, Cinematic lighting" -negative_prompt = "low quality, bad quality" - -url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg" -original_image = load_image(url) - -original_image.thumbnail((768, 768)) - -image = pipeline(prompt=prompt, negative_prompt=negative_prompt, image=original_image, strength=0.3).images[0] -make_image_grid([original_image.resize((512, 512)), image.resize((512, 512))], rows=1, cols=2) -``` - - - - -## Inpainting - - - -⚠️ The Kandinsky models use ⬜️ **white pixels** to represent the masked area now instead of black pixels. If you are using [`KandinskyInpaintPipeline`] in production, you need to change the mask to use white pixels: - -```py -# For PIL input -import PIL.ImageOps -mask = PIL.ImageOps.invert(mask) - -# For PyTorch and NumPy input -mask = 1 - mask -``` - - - -For inpainting, you'll need the original image, a mask of the area to replace in the original image, and a text prompt of what to inpaint. Load the prior pipeline: - - - - -```py -from diffusers import KandinskyInpaintPipeline, KandinskyPriorPipeline -from diffusers.utils import load_image, make_image_grid -import torch -import numpy as np -from PIL import Image - -prior_pipeline = KandinskyPriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-prior", torch_dtype=torch.float16, use_safetensors=True).to("cuda") -pipeline = KandinskyInpaintPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-inpaint", torch_dtype=torch.float16, use_safetensors=True).to("cuda") -``` - - - - -```py -from diffusers import KandinskyV22InpaintPipeline, KandinskyV22PriorPipeline -from diffusers.utils import load_image, make_image_grid -import torch -import numpy as np -from PIL import Image - -prior_pipeline = KandinskyV22PriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16, use_safetensors=True).to("cuda") -pipeline = KandinskyV22InpaintPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-decoder-inpaint", torch_dtype=torch.float16, use_safetensors=True).to("cuda") -``` - - - - -Load an initial image and create a mask: - -```py -init_image = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/cat.png") -mask = np.zeros((768, 768), dtype=np.float32) -# mask area above cat's head -mask[:250, 250:-250] = 1 -``` - -Generate the embeddings with the prior pipeline: - -```py -prompt = "a hat" -prior_output = prior_pipeline(prompt) -``` - -Now pass the initial image, mask, and prompt and embeddings to the pipeline to generate an image: - - - - -```py -output_image = pipeline(prompt, image=init_image, mask_image=mask, **prior_output, height=768, width=768, num_inference_steps=150).images[0] -mask = Image.fromarray((mask*255).astype('uint8'), 'L') -make_image_grid([init_image, mask, output_image], rows=1, cols=3) -``` - -
- -
- -
- - -```py -output_image = pipeline(image=init_image, mask_image=mask, **prior_output, height=768, width=768, num_inference_steps=150).images[0] -mask = Image.fromarray((mask*255).astype('uint8'), 'L') -make_image_grid([init_image, mask, output_image], rows=1, cols=3) -``` - -
- -
- -
-
- -You can also use the end-to-end [`KandinskyInpaintCombinedPipeline`] and [`KandinskyV22InpaintCombinedPipeline`] to call the prior and decoder pipelines together under the hood. Use the [`AutoPipelineForInpainting`] for this: - - - - -```py -import torch -import numpy as np -from PIL import Image -from diffusers import AutoPipelineForInpainting -from diffusers.utils import load_image, make_image_grid - -pipe = AutoPipelineForInpainting.from_pretrained("kandinsky-community/kandinsky-2-1-inpaint", torch_dtype=torch.float16) -pipe.enable_model_cpu_offload() - -init_image = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/cat.png") -mask = np.zeros((768, 768), dtype=np.float32) -# mask area above cat's head -mask[:250, 250:-250] = 1 -prompt = "a hat" - -output_image = pipe(prompt=prompt, image=init_image, mask_image=mask).images[0] -mask = Image.fromarray((mask*255).astype('uint8'), 'L') -make_image_grid([init_image, mask, output_image], rows=1, cols=3) -``` - - - - -```py -import torch -import numpy as np -from PIL import Image -from diffusers import AutoPipelineForInpainting -from diffusers.utils import load_image, make_image_grid - -pipe = AutoPipelineForInpainting.from_pretrained("kandinsky-community/kandinsky-2-2-decoder-inpaint", torch_dtype=torch.float16) -pipe.enable_model_cpu_offload() - -init_image = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/cat.png") -mask = np.zeros((768, 768), dtype=np.float32) -# mask area above cat's head -mask[:250, 250:-250] = 1 -prompt = "a hat" - -output_image = pipe(prompt=prompt, image=original_image, mask_image=mask).images[0] -mask = Image.fromarray((mask*255).astype('uint8'), 'L') -make_image_grid([init_image, mask, output_image], rows=1, cols=3) -``` - - - - -## Interpolation - -Interpolation allows you to explore the latent space between the image and text embeddings which is a cool way to see some of the prior model's intermediate outputs. Load the prior pipeline and two images you'd like to interpolate: - - - - -```py -from diffusers import KandinskyPriorPipeline, KandinskyPipeline -from diffusers.utils import load_image, make_image_grid -import torch - -prior_pipeline = KandinskyPriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-prior", torch_dtype=torch.float16, use_safetensors=True).to("cuda") -img_1 = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/cat.png") -img_2 = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/starry_night.jpeg") -make_image_grid([img_1.resize((512,512)), img_2.resize((512,512))], rows=1, cols=2) -``` - - - - -```py -from diffusers import KandinskyV22PriorPipeline, KandinskyV22Pipeline -from diffusers.utils import load_image, make_image_grid -import torch - -prior_pipeline = KandinskyV22PriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16, use_safetensors=True).to("cuda") -img_1 = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/cat.png") -img_2 = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/starry_night.jpeg") -make_image_grid([img_1.resize((512,512)), img_2.resize((512,512))], rows=1, cols=2) -``` - - - - -
-
- -
a cat
-
-
- -
Van Gogh's Starry Night painting
-
-
- -Specify the text or images to interpolate, and set the weights for each text or image. Experiment with the weights to see how they affect the interpolation! - -```py -images_texts = ["a cat", img_1, img_2] -weights = [0.3, 0.3, 0.4] -``` - -Call the `interpolate` function to generate the embeddings, and then pass them to the pipeline to generate the image: - - - - -```py -# prompt can be left empty -prompt = "" -prior_out = prior_pipeline.interpolate(images_texts, weights) - -pipeline = KandinskyPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16, use_safetensors=True).to("cuda") - -image = pipeline(prompt, **prior_out, height=768, width=768).images[0] -image -``` - -
- -
- -
- - -```py -# prompt can be left empty -prompt = "" -prior_out = prior_pipeline.interpolate(images_texts, weights) - -pipeline = KandinskyV22Pipeline.from_pretrained("kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16, use_safetensors=True).to("cuda") - -image = pipeline(prompt, **prior_out, height=768, width=768).images[0] -image -``` - -
- -
- -
-
- -## ControlNet - - - -⚠️ ControlNet is only supported for Kandinsky 2.2! - - - -ControlNet enables conditioning large pretrained diffusion models with additional inputs such as a depth map or edge detection. For example, you can condition Kandinsky 2.2 with a depth map so the model understands and preserves the structure of the depth image. - -Let's load an image and extract it's depth map: - -```py -from diffusers.utils import load_image - -img = load_image( - "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinskyv22/cat.png" -).resize((768, 768)) -img -``` - -
- -
- -Then you can use the `depth-estimation` [`~transformers.Pipeline`] from 🤗 Transformers to process the image and retrieve the depth map: - -```py -import torch -import numpy as np - -from transformers import pipeline - -def make_hint(image, depth_estimator): - image = depth_estimator(image)["depth"] - image = np.array(image) - image = image[:, :, None] - image = np.concatenate([image, image, image], axis=2) - detected_map = torch.from_numpy(image).float() / 255.0 - hint = detected_map.permute(2, 0, 1) - return hint - -depth_estimator = pipeline("depth-estimation") -hint = make_hint(img, depth_estimator).unsqueeze(0).half().to("cuda") -``` - -### Text-to-image [[controlnet-text-to-image]] - -Load the prior pipeline and the [`KandinskyV22ControlnetPipeline`]: - -```py -from diffusers import KandinskyV22PriorPipeline, KandinskyV22ControlnetPipeline - -prior_pipeline = KandinskyV22PriorPipeline.from_pretrained( - "kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16, use_safetensors=True -).to("cuda") - -pipeline = KandinskyV22ControlnetPipeline.from_pretrained( - "kandinsky-community/kandinsky-2-2-controlnet-depth", torch_dtype=torch.float16 -).to("cuda") -``` - -Generate the image embeddings from a prompt and negative prompt: - -```py -prompt = "A robot, 4k photo" -negative_prior_prompt = "lowres, text, error, cropped, worst quality, low quality, jpeg artifacts, ugly, duplicate, morbid, mutilated, out of frame, extra fingers, mutated hands, poorly drawn hands, poorly drawn face, mutation, deformed, blurry, dehydrated, bad anatomy, bad proportions, extra limbs, cloned face, disfigured, gross proportions, malformed limbs, missing arms, missing legs, extra arms, extra legs, fused fingers, too many fingers, long neck, username, watermark, signature" - -generator = torch.Generator(device="cuda").manual_seed(43) - -image_emb, zero_image_emb = prior_pipeline( - prompt=prompt, negative_prompt=negative_prior_prompt, generator=generator -).to_tuple() -``` - -Finally, pass the image embeddings and the depth image to the [`KandinskyV22ControlnetPipeline`] to generate an image: - -```py -image = pipeline(image_embeds=image_emb, negative_image_embeds=zero_image_emb, hint=hint, num_inference_steps=50, generator=generator, height=768, width=768).images[0] -image -``` - -
- -
- -### Image-to-image [[controlnet-image-to-image]] - -For image-to-image with ControlNet, you'll need to use the: - -- [`KandinskyV22PriorEmb2EmbPipeline`] to generate the image embeddings from a text prompt and an image -- [`KandinskyV22ControlnetImg2ImgPipeline`] to generate an image from the initial image and the image embeddings - -Process and extract a depth map of an initial image of a cat with the `depth-estimation` [`~transformers.Pipeline`] from 🤗 Transformers: - -```py -import torch -import numpy as np - -from diffusers import KandinskyV22PriorEmb2EmbPipeline, KandinskyV22ControlnetImg2ImgPipeline -from diffusers.utils import load_image -from transformers import pipeline - -img = load_image( - "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinskyv22/cat.png" -).resize((768, 768)) - -def make_hint(image, depth_estimator): - image = depth_estimator(image)["depth"] - image = np.array(image) - image = image[:, :, None] - image = np.concatenate([image, image, image], axis=2) - detected_map = torch.from_numpy(image).float() / 255.0 - hint = detected_map.permute(2, 0, 1) - return hint - -depth_estimator = pipeline("depth-estimation") -hint = make_hint(img, depth_estimator).unsqueeze(0).half().to("cuda") -``` - -Load the prior pipeline and the [`KandinskyV22ControlnetImg2ImgPipeline`]: - -```py -prior_pipeline = KandinskyV22PriorEmb2EmbPipeline.from_pretrained( - "kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16, use_safetensors=True -).to("cuda") - -pipeline = KandinskyV22ControlnetImg2ImgPipeline.from_pretrained( - "kandinsky-community/kandinsky-2-2-controlnet-depth", torch_dtype=torch.float16 -).to("cuda") -``` - -Pass a text prompt and the initial image to the prior pipeline to generate the image embeddings: - -```py -prompt = "A robot, 4k photo" -negative_prior_prompt = "lowres, text, error, cropped, worst quality, low quality, jpeg artifacts, ugly, duplicate, morbid, mutilated, out of frame, extra fingers, mutated hands, poorly drawn hands, poorly drawn face, mutation, deformed, blurry, dehydrated, bad anatomy, bad proportions, extra limbs, cloned face, disfigured, gross proportions, malformed limbs, missing arms, missing legs, extra arms, extra legs, fused fingers, too many fingers, long neck, username, watermark, signature" - -generator = torch.Generator(device="cuda").manual_seed(43) - -img_emb = prior_pipeline(prompt=prompt, image=img, strength=0.85, generator=generator) -negative_emb = prior_pipeline(prompt=negative_prior_prompt, image=img, strength=1, generator=generator) -``` - -Now you can run the [`KandinskyV22ControlnetImg2ImgPipeline`] to generate an image from the initial image and the image embeddings: - -```py -image = pipeline(image=img, strength=0.5, image_embeds=img_emb.image_embeds, negative_image_embeds=negative_emb.image_embeds, hint=hint, num_inference_steps=50, generator=generator, height=768, width=768).images[0] -make_image_grid([img.resize((512, 512)), image.resize((512, 512))], rows=1, cols=2) -``` - -
- -
- -## Optimizations - -Kandinsky is unique because it requires a prior pipeline to generate the mappings, and a second pipeline to decode the latents into an image. Optimization efforts should be focused on the second pipeline because that is where the bulk of the computation is done. Here are some tips to improve Kandinsky during inference. - -1. Enable [xFormers](../optimization/xformers) if you're using PyTorch < 2.0: - -```diff - from diffusers import DiffusionPipeline - import torch - - pipe = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16) -+ pipe.enable_xformers_memory_efficient_attention() -``` - -2. Enable `torch.compile` if you're using PyTorch >= 2.0 to automatically use scaled dot-product attention (SDPA): - -```diff - pipe.unet.to(memory_format=torch.channels_last) -+ pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True) -``` - -This is the same as explicitly setting the attention processor to use [`~models.attention_processor.AttnAddedKVProcessor2_0`]: - -```py -from diffusers.models.attention_processor import AttnAddedKVProcessor2_0 - -pipe.unet.set_attn_processor(AttnAddedKVProcessor2_0()) -``` - -3. Offload the model to the CPU with [`~KandinskyPriorPipeline.enable_model_cpu_offload`] to avoid out-of-memory errors: - -```diff - from diffusers import DiffusionPipeline - import torch - - pipe = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16) -+ pipe.enable_model_cpu_offload() -``` - -4. By default, the text-to-image pipeline uses the [`DDIMScheduler`] but you can replace it with another scheduler like [`DDPMScheduler`] to see how that affects the tradeoff between inference speed and image quality: - -```py -from diffusers import DDPMScheduler -from diffusers import DiffusionPipeline - -scheduler = DDPMScheduler.from_pretrained("kandinsky-community/kandinsky-2-1", subfolder="ddpm_scheduler") -pipe = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", scheduler=scheduler, torch_dtype=torch.float16, use_safetensors=True).to("cuda") -``` diff --git a/docs/source/en/using-diffusers/marigold_usage.md b/docs/source/en/using-diffusers/marigold_usage.md deleted file mode 100644 index f66e47bada09..000000000000 --- a/docs/source/en/using-diffusers/marigold_usage.md +++ /dev/null @@ -1,605 +0,0 @@ - - -# Marigold Computer Vision - -**Marigold** is a diffusion-based [method](https://huggingface.co/papers/2312.02145) and a collection of [pipelines](../api/pipelines/marigold) designed for -dense computer vision tasks, including **monocular depth prediction**, **surface normals estimation**, and **intrinsic -image decomposition**. - -This guide will walk you through using Marigold to generate fast and high-quality predictions for images and videos. - -Each pipeline is tailored for a specific computer vision task, processing an input RGB image and generating a -corresponding prediction. -Currently, the following computer vision tasks are implemented: - -| Pipeline | Recommended Model Checkpoints | Spaces (Interactive Apps) | Predicted Modalities | -|---------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------:|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| [MarigoldDepthPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/marigold/pipeline_marigold_depth.py) | [prs-eth/marigold-depth-v1-1](https://huggingface.co/prs-eth/marigold-depth-v1-1) | [Depth Estimation](https://huggingface.co/spaces/prs-eth/marigold) | [Depth](https://en.wikipedia.org/wiki/Depth_map), [Disparity](https://en.wikipedia.org/wiki/Binocular_disparity) | -| [MarigoldNormalsPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/marigold/pipeline_marigold_normals.py) | [prs-eth/marigold-normals-v1-1](https://huggingface.co/prs-eth/marigold-normals-v1-1) | [Surface Normals Estimation](https://huggingface.co/spaces/prs-eth/marigold-normals) | [Surface normals](https://en.wikipedia.org/wiki/Normal_mapping) | -| [MarigoldIntrinsicsPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/marigold/pipeline_marigold_intrinsics.py) | [prs-eth/marigold-iid-appearance-v1-1](https://huggingface.co/prs-eth/marigold-iid-appearance-v1-1),
[prs-eth/marigold-iid-lighting-v1-1](https://huggingface.co/prs-eth/marigold-iid-lighting-v1-1) | [Intrinsic Image Decomposition](https://huggingface.co/spaces/prs-eth/marigold-iid) | [Albedo](https://en.wikipedia.org/wiki/Albedo), [Materials](https://www.n.aiq3d.com/wiki/roughnessmetalnessao-map), [Lighting](https://en.wikipedia.org/wiki/Diffuse_reflection) | - -All original checkpoints are available under the [PRS-ETH](https://huggingface.co/prs-eth/) organization on Hugging Face. -They are designed for use with diffusers pipelines and the [original codebase](https://github.com/prs-eth/marigold), which can also be used to train -new model checkpoints. -The following is a summary of the recommended checkpoints, all of which produce reliable results with 1 to 4 steps. - -| Checkpoint | Modality | Comment | -|-----------------------------------------------------------------------------------------------------|--------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| [prs-eth/marigold-depth-v1-1](https://huggingface.co/prs-eth/marigold-depth-v1-1) | Depth | Affine-invariant depth prediction assigns each pixel a value between 0 (near plane) and 1 (far plane), with both planes determined by the model during inference. | -| [prs-eth/marigold-normals-v0-1](https://huggingface.co/prs-eth/marigold-normals-v0-1) | Normals | The surface normals predictions are unit-length 3D vectors in the screen space camera, with values in the range from -1 to 1. | -| [prs-eth/marigold-iid-appearance-v1-1](https://huggingface.co/prs-eth/marigold-iid-appearance-v1-1) | Intrinsics | InteriorVerse decomposition is comprised of Albedo and two BRDF material properties: Roughness and Metallicity. | -| [prs-eth/marigold-iid-lighting-v1-1](https://huggingface.co/prs-eth/marigold-iid-lighting-v1-1) | Intrinsics | HyperSim decomposition of an image \\(I\\) is comprised of Albedo \\(A\\), Diffuse shading \\(S\\), and Non-diffuse residual \\(R\\): \\(I = A*S+R\\). | - -The examples below are mostly given for depth prediction, but they can be universally applied to other supported -modalities. -We showcase the predictions using the same input image of Albert Einstein generated by Midjourney. -This makes it easier to compare visualizations of the predictions across various modalities and checkpoints. - -
-
- -
- Example input image for all Marigold pipelines -
-
-
- -## Depth Prediction - -To get a depth prediction, load the `prs-eth/marigold-depth-v1-1` checkpoint into [`MarigoldDepthPipeline`], -put the image through the pipeline, and save the predictions: - -```python -import diffusers -import torch - -pipe = diffusers.MarigoldDepthPipeline.from_pretrained( - "prs-eth/marigold-depth-v1-1", variant="fp16", torch_dtype=torch.float16 -).to("cuda") - -image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg") - -depth = pipe(image) - -vis = pipe.image_processor.visualize_depth(depth.prediction) -vis[0].save("einstein_depth.png") - -depth_16bit = pipe.image_processor.export_depth_to_16bit_png(depth.prediction) -depth_16bit[0].save("einstein_depth_16bit.png") -``` - -The [`~pipelines.marigold.marigold_image_processing.MarigoldImageProcessor.visualize_depth`] function applies one of -[matplotlib's colormaps](https://matplotlib.org/stable/users/explain/colors/colormaps.html) (`Spectral` by default) to map the predicted pixel values from a single-channel `[0, 1]` -depth range into an RGB image. -With the `Spectral` colormap, pixels with near depth are painted red, and far pixels are blue. -The 16-bit PNG file stores the single channel values mapped linearly from the `[0, 1]` range into `[0, 65535]`. -Below are the raw and the visualized predictions. The darker and closer areas (mustache) are easier to distinguish in -the visualization. - -
-
- -
- Predicted depth (16-bit PNG) -
-
-
- -
- Predicted depth visualization (Spectral) -
-
-
- -## Surface Normals Estimation - -Load the `prs-eth/marigold-normals-v1-1` checkpoint into [`MarigoldNormalsPipeline`], put the image through the -pipeline, and save the predictions: - -```python -import diffusers -import torch - -pipe = diffusers.MarigoldNormalsPipeline.from_pretrained( - "prs-eth/marigold-normals-v1-1", variant="fp16", torch_dtype=torch.float16 -).to("cuda") - -image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg") - -normals = pipe(image) - -vis = pipe.image_processor.visualize_normals(normals.prediction) -vis[0].save("einstein_normals.png") -``` - -The [`~pipelines.marigold.marigold_image_processing.MarigoldImageProcessor.visualize_normals`] maps the three-dimensional -prediction with pixel values in the range `[-1, 1]` into an RGB image. -The visualization function supports flipping surface normals axes to make the visualization compatible with other -choices of the frame of reference. -Conceptually, each pixel is painted according to the surface normal vector in the frame of reference, where `X` axis -points right, `Y` axis points up, and `Z` axis points at the viewer. -Below is the visualized prediction: - -
-
- -
- Predicted surface normals visualization -
-
-
- -In this example, the nose tip almost certainly has a point on the surface, in which the surface normal vector points -straight at the viewer, meaning that its coordinates are `[0, 0, 1]`. -This vector maps to the RGB `[128, 128, 255]`, which corresponds to the violet-blue color. -Similarly, a surface normal on the cheek in the right part of the image has a large `X` component, which increases the -red hue. -Points on the shoulders pointing up with a large `Y` promote green color. - -## Intrinsic Image Decomposition - -Marigold provides two models for Intrinsic Image Decomposition (IID): "Appearance" and "Lighting". -Each model produces Albedo maps, derived from InteriorVerse and Hypersim annotations, respectively. - -- The "Appearance" model also estimates Material properties: Roughness and Metallicity. -- The "Lighting" model generates Diffuse Shading and Non-diffuse Residual. - -Here is the sample code saving predictions made by the "Appearance" model: - -```python -import diffusers -import torch - -pipe = diffusers.MarigoldIntrinsicsPipeline.from_pretrained( - "prs-eth/marigold-iid-appearance-v1-1", variant="fp16", torch_dtype=torch.float16 -).to("cuda") - -image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg") - -intrinsics = pipe(image) - -vis = pipe.image_processor.visualize_intrinsics(intrinsics.prediction, pipe.target_properties) -vis[0]["albedo"].save("einstein_albedo.png") -vis[0]["roughness"].save("einstein_roughness.png") -vis[0]["metallicity"].save("einstein_metallicity.png") -``` - -Another example demonstrating the predictions made by the "Lighting" model: - -```python -import diffusers -import torch - -pipe = diffusers.MarigoldIntrinsicsPipeline.from_pretrained( - "prs-eth/marigold-iid-lighting-v1-1", variant="fp16", torch_dtype=torch.float16 -).to("cuda") - -image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg") - -intrinsics = pipe(image) - -vis = pipe.image_processor.visualize_intrinsics(intrinsics.prediction, pipe.target_properties) -vis[0]["albedo"].save("einstein_albedo.png") -vis[0]["shading"].save("einstein_shading.png") -vis[0]["residual"].save("einstein_residual.png") -``` - -Both models share the same pipeline while supporting different decomposition types. -The exact decomposition parameterization (e.g., sRGB vs. linear space) is stored in the -`pipe.target_properties` dictionary, which is passed into the -[`~pipelines.marigold.marigold_image_processing.MarigoldImageProcessor.visualize_intrinsics`] function. - -Below are some examples showcasing the predicted decomposition outputs. -All modalities can be inspected in the -[Intrinsic Image Decomposition](https://huggingface.co/spaces/prs-eth/marigold-iid) Space. - -
-
- -
- Predicted albedo ("Appearance" model) -
-
-
- -
- Predicted diffuse shading ("Lighting" model) -
-
-
- -## Speeding up inference - -The above quick start snippets are already optimized for quality and speed, loading the checkpoint, utilizing the -`fp16` variant of weights and computation, and performing the default number (4) of denoising diffusion steps. -The first step to accelerate inference, at the expense of prediction quality, is to reduce the denoising diffusion -steps to the minimum: - -```diff - import diffusers - import torch - - pipe = diffusers.MarigoldDepthPipeline.from_pretrained( - "prs-eth/marigold-depth-v1-1", variant="fp16", torch_dtype=torch.float16 - ).to("cuda") - - image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg") - -- depth = pipe(image) -+ depth = pipe(image, num_inference_steps=1) -``` - -With this change, the `pipe` call completes in 280ms on RTX 3090 GPU. -Internally, the input image is first encoded using the Stable Diffusion VAE encoder, followed by a single denoising -step performed by the U-Net. -Finally, the prediction latent is decoded with the VAE decoder into pixel space. -In this setup, two out of three module calls are dedicated to converting between the pixel and latent spaces of the LDM. -Since Marigold's latent space is compatible with Stable Diffusion 2.0, inference can be accelerated by more than 3x, -reducing the call time to 85ms on an RTX 3090, by using a [lightweight replacement of the SD VAE](../api/models/autoencoder_tiny). -Note that using a lightweight VAE may slightly reduce the visual quality of the predictions. - -```diff - import diffusers - import torch - - pipe = diffusers.MarigoldDepthPipeline.from_pretrained( - "prs-eth/marigold-depth-v1-1", variant="fp16", torch_dtype=torch.float16 - ).to("cuda") - -+ pipe.vae = diffusers.AutoencoderTiny.from_pretrained( -+ "madebyollin/taesd", torch_dtype=torch.float16 -+ ).cuda() - - image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg") - - depth = pipe(image, num_inference_steps=1) -``` - -So far, we have optimized the number of diffusion steps and model components. Self-attention operations account for a -significant portion of computations. -Speeding them up can be achieved by using a more efficient attention processor: - -```diff - import diffusers - import torch -+ from diffusers.models.attention_processor import AttnProcessor2_0 - - pipe = diffusers.MarigoldDepthPipeline.from_pretrained( - "prs-eth/marigold-depth-v1-1", variant="fp16", torch_dtype=torch.float16 - ).to("cuda") - -+ pipe.vae.set_attn_processor(AttnProcessor2_0()) -+ pipe.unet.set_attn_processor(AttnProcessor2_0()) - - image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg") - - depth = pipe(image, num_inference_steps=1) -``` - -Finally, as suggested in [Optimizations](../optimization/fp16#torchcompile), enabling `torch.compile` can further enhance performance depending on -the target hardware. -However, compilation incurs a significant overhead during the first pipeline invocation, making it beneficial only when -the same pipeline instance is called repeatedly, such as within a loop. - -```diff - import diffusers - import torch - from diffusers.models.attention_processor import AttnProcessor2_0 - - pipe = diffusers.MarigoldDepthPipeline.from_pretrained( - "prs-eth/marigold-depth-v1-1", variant="fp16", torch_dtype=torch.float16 - ).to("cuda") - - pipe.vae.set_attn_processor(AttnProcessor2_0()) - pipe.unet.set_attn_processor(AttnProcessor2_0()) - -+ pipe.vae = torch.compile(pipe.vae, mode="reduce-overhead", fullgraph=True) -+ pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True) - - image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg") - - depth = pipe(image, num_inference_steps=1) -``` - -## Maximizing Precision and Ensembling - -Marigold pipelines have a built-in ensembling mechanism combining multiple predictions from different random latents. -This is a brute-force way of improving the precision of predictions, capitalizing on the generative nature of diffusion. -The ensembling path is activated automatically when the `ensemble_size` argument is set greater or equal than `3`. -When aiming for maximum precision, it makes sense to adjust `num_inference_steps` simultaneously with `ensemble_size`. -The recommended values vary across checkpoints but primarily depend on the scheduler type. -The effect of ensembling is particularly well-seen with surface normals: - -```diff - import diffusers - - pipe = diffusers.MarigoldNormalsPipeline.from_pretrained("prs-eth/marigold-normals-v1-1").to("cuda") - - image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg") - -- depth = pipe(image) -+ depth = pipe(image, num_inference_steps=10, ensemble_size=5) - - vis = pipe.image_processor.visualize_normals(depth.prediction) - vis[0].save("einstein_normals.png") -``` - -
-
- -
- Surface normals, no ensembling -
-
-
- -
- Surface normals, with ensembling -
-
-
- -As can be seen, all areas with fine-grained structurers, such as hair, got more conservative and on average more -correct predictions. -Such a result is more suitable for precision-sensitive downstream tasks, such as 3D reconstruction. - -## Frame-by-frame Video Processing with Temporal Consistency - -Due to Marigold's generative nature, each prediction is unique and defined by the random noise sampled for the latent -initialization. -This becomes an obvious drawback compared to traditional end-to-end dense regression networks, as exemplified in the -following videos: - -
-
- -
Input video
-
-
- -
Marigold Depth applied to input video frames independently
-
-
- -To address this issue, it is possible to pass `latents` argument to the pipelines, which defines the starting point of -diffusion. -Empirically, we found that a convex combination of the very same starting point noise latent and the latent -corresponding to the previous frame prediction give sufficiently smooth results, as implemented in the snippet below: - -```python -import imageio -import diffusers -import torch -from diffusers.models.attention_processor import AttnProcessor2_0 -from PIL import Image -from tqdm import tqdm - -device = "cuda" -path_in = "https://huggingface.co/spaces/prs-eth/marigold-lcm/resolve/c7adb5427947d2680944f898cd91d386bf0d4924/files/video/obama.mp4" -path_out = "obama_depth.gif" - -pipe = diffusers.MarigoldDepthPipeline.from_pretrained( - "prs-eth/marigold-depth-v1-1", variant="fp16", torch_dtype=torch.float16 -).to(device) -pipe.vae = diffusers.AutoencoderTiny.from_pretrained( - "madebyollin/taesd", torch_dtype=torch.float16 -).to(device) -pipe.unet.set_attn_processor(AttnProcessor2_0()) -pipe.vae = torch.compile(pipe.vae, mode="reduce-overhead", fullgraph=True) -pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True) -pipe.set_progress_bar_config(disable=True) - -with imageio.get_reader(path_in) as reader: - size = reader.get_meta_data()['size'] - last_frame_latent = None - latent_common = torch.randn( - (1, 4, 768 * size[1] // (8 * max(size)), 768 * size[0] // (8 * max(size))) - ).to(device=device, dtype=torch.float16) - - out = [] - for frame_id, frame in tqdm(enumerate(reader), desc="Processing Video"): - frame = Image.fromarray(frame) - latents = latent_common - if last_frame_latent is not None: - latents = 0.9 * latents + 0.1 * last_frame_latent - - depth = pipe( - frame, - num_inference_steps=1, - match_input_resolution=False, - latents=latents, - output_latent=True, - ) - last_frame_latent = depth.latent - out.append(pipe.image_processor.visualize_depth(depth.prediction)[0]) - - diffusers.utils.export_to_gif(out, path_out, fps=reader.get_meta_data()['fps']) -``` - -Here, the diffusion process starts from the given computed latent. -The pipeline sets `output_latent=True` to access `out.latent` and computes its contribution to the next frame's latent -initialization. -The result is much more stable now: - -
-
- -
Marigold Depth applied to input video frames independently
-
-
- -
Marigold Depth with forced latents initialization
-
-
- -## Marigold for ControlNet - -A very common application for depth prediction with diffusion models comes in conjunction with ControlNet. -Depth crispness plays a crucial role in obtaining high-quality results from ControlNet. -As seen in comparisons with other methods above, Marigold excels at that task. -The snippet below demonstrates how to load an image, compute depth, and pass it into ControlNet in a compatible format: - -```python -import torch -import diffusers - -device = "cuda" -generator = torch.Generator(device=device).manual_seed(2024) -image = diffusers.utils.load_image( - "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet_depth_source.png" -) - -pipe = diffusers.MarigoldDepthPipeline.from_pretrained( - "prs-eth/marigold-depth-v1-1", torch_dtype=torch.float16, variant="fp16" -).to(device) - -depth_image = pipe(image, generator=generator).prediction -depth_image = pipe.image_processor.visualize_depth(depth_image, color_map="binary") -depth_image[0].save("motorcycle_controlnet_depth.png") - -controlnet = diffusers.ControlNetModel.from_pretrained( - "diffusers/controlnet-depth-sdxl-1.0", torch_dtype=torch.float16, variant="fp16" -).to(device) -pipe = diffusers.StableDiffusionXLControlNetPipeline.from_pretrained( - "SG161222/RealVisXL_V4.0", torch_dtype=torch.float16, variant="fp16", controlnet=controlnet -).to(device) -pipe.scheduler = diffusers.DPMSolverMultistepScheduler.from_config(pipe.scheduler.config, use_karras_sigmas=True) - -controlnet_out = pipe( - prompt="high quality photo of a sports bike, city", - negative_prompt="", - guidance_scale=6.5, - num_inference_steps=25, - image=depth_image, - controlnet_conditioning_scale=0.7, - control_guidance_end=0.7, - generator=generator, -).images -controlnet_out[0].save("motorcycle_controlnet_out.png") -``` - -
-
- -
- Input image -
-
-
- -
- Depth in the format compatible with ControlNet -
-
-
- -
- ControlNet generation, conditioned on depth and prompt: "high quality photo of a sports bike, city" -
-
-
- -## Quantitative Evaluation - -To evaluate Marigold quantitatively in standard leaderboards and benchmarks (such as NYU, KITTI, and other datasets), -follow the evaluation protocol outlined in the paper: load the full precision fp32 model and use appropriate values -for `num_inference_steps` and `ensemble_size`. -Optionally seed randomness to ensure reproducibility. -Maximizing `batch_size` will deliver maximum device utilization. - -```python -import diffusers -import torch - -device = "cuda" -seed = 2024 - -generator = torch.Generator(device=device).manual_seed(seed) -pipe = diffusers.MarigoldDepthPipeline.from_pretrained("prs-eth/marigold-depth-v1-1").to(device) - -image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg") - -depth = pipe( - image, - num_inference_steps=4, # set according to the evaluation protocol from the paper - ensemble_size=10, # set according to the evaluation protocol from the paper - generator=generator, -) - -# evaluate metrics -``` - -## Using Predictive Uncertainty - -The ensembling mechanism built into Marigold pipelines combines multiple predictions obtained from different random -latents. -As a side effect, it can be used to quantify epistemic (model) uncertainty; simply specify `ensemble_size` greater -or equal than 3 and set `output_uncertainty=True`. -The resulting uncertainty will be available in the `uncertainty` field of the output. -It can be visualized as follows: - -```python -import diffusers -import torch - -pipe = diffusers.MarigoldDepthPipeline.from_pretrained( - "prs-eth/marigold-depth-v1-1", variant="fp16", torch_dtype=torch.float16 -).to("cuda") - -image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg") - -depth = pipe( - image, - ensemble_size=10, # any number >= 3 - output_uncertainty=True, -) - -uncertainty = pipe.image_processor.visualize_uncertainty(depth.uncertainty) -uncertainty[0].save("einstein_depth_uncertainty.png") -``` - -
-
- -
- Depth uncertainty -
-
-
- -
- Surface normals uncertainty -
-
-
- -
- Albedo uncertainty -
-
-
- -The interpretation of uncertainty is easy: higher values (white) correspond to pixels, where the model struggles to -make consistent predictions. -- The depth model exhibits the most uncertainty around discontinuities, where object depth changes abruptly. -- The surface normals model is least confident in fine-grained structures like hair and in dark regions such as the -collar area. -- Albedo uncertainty is represented as an RGB image, as it captures uncertainty independently for each color channel, -unlike depth and surface normals. It is also higher in shaded regions and at discontinuities. - -## Conclusion - -We hope Marigold proves valuable for your downstream tasks, whether as part of a broader generative workflow or for -perception-based applications like 3D reconstruction. \ No newline at end of file diff --git a/docs/source/en/using-diffusers/omnigen.md b/docs/source/en/using-diffusers/omnigen.md deleted file mode 100644 index 2880fedb3392..000000000000 --- a/docs/source/en/using-diffusers/omnigen.md +++ /dev/null @@ -1,317 +0,0 @@ - -# OmniGen - -OmniGen is an image generation model. Unlike existing text-to-image models, OmniGen is a single model designed to handle a variety of tasks (e.g., text-to-image, image editing, controllable generation). It has the following features: -- Minimalist model architecture, consisting of only a VAE and a transformer module, for joint modeling of text and images. -- Support for multimodal inputs. It can process any text-image mixed data as instructions for image generation, rather than relying solely on text. - -For more information, please refer to the [paper](https://huggingface.co/papers/2409.11340). -This guide will walk you through using OmniGen for various tasks and use cases. - -## Load model checkpoints - -Model weights may be stored in separate subfolders on the Hub or locally, in which case, you should use the [`~DiffusionPipeline.from_pretrained`] method. - -```python -import torch -from diffusers import OmniGenPipeline - -pipe = OmniGenPipeline.from_pretrained("Shitao/OmniGen-v1-diffusers", torch_dtype=torch.bfloat16) -``` - -## Text-to-image - -For text-to-image, pass a text prompt. By default, OmniGen generates a 1024x1024 image. -You can try setting the `height` and `width` parameters to generate images with different size. - -```python -import torch -from diffusers import OmniGenPipeline - -pipe = OmniGenPipeline.from_pretrained( - "Shitao/OmniGen-v1-diffusers", - torch_dtype=torch.bfloat16 -) -pipe.to("cuda") - -prompt = "Realistic photo. A young woman sits on a sofa, holding a book and facing the camera. She wears delicate silver hoop earrings adorned with tiny, sparkling diamonds that catch the light, with her long chestnut hair cascading over her shoulders. Her eyes are focused and gentle, framed by long, dark lashes. She is dressed in a cozy cream sweater, which complements her warm, inviting smile. Behind her, there is a table with a cup of water in a sleek, minimalist blue mug. The background is a serene indoor setting with soft natural light filtering through a window, adorned with tasteful art and flowers, creating a cozy and peaceful ambiance. 4K, HD." -image = pipe( - prompt=prompt, - height=1024, - width=1024, - guidance_scale=3, - generator=torch.Generator(device="cpu").manual_seed(111), -).images[0] -image.save("output.png") -``` - -
- generated image -
- -## Image edit - -OmniGen supports multimodal inputs. -When the input includes an image, you need to add a placeholder `<|image_1|>` in the text prompt to represent the image. -It is recommended to enable `use_input_image_size_as_output` to keep the edited image the same size as the original image. - -```python -import torch -from diffusers import OmniGenPipeline -from diffusers.utils import load_image - -pipe = OmniGenPipeline.from_pretrained( - "Shitao/OmniGen-v1-diffusers", - torch_dtype=torch.bfloat16 -) -pipe.to("cuda") - -prompt="<|image_1|> Remove the woman's earrings. Replace the mug with a clear glass filled with sparkling iced cola." -input_images=[load_image("https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/t2i_woman_with_book.png")] -image = pipe( - prompt=prompt, - input_images=input_images, - guidance_scale=2, - img_guidance_scale=1.6, - use_input_image_size_as_output=True, - generator=torch.Generator(device="cpu").manual_seed(222) -).images[0] -image.save("output.png") -``` - -
-
- -
original image
-
-
- -
edited image
-
-
- -OmniGen has some interesting features, such as visual reasoning, as shown in the example below. - -```python -prompt="If the woman is thirsty, what should she take? Find it in the image and highlight it in blue. <|image_1|>" -input_images=[load_image("https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/edit.png")] -image = pipe( - prompt=prompt, - input_images=input_images, - guidance_scale=2, - img_guidance_scale=1.6, - use_input_image_size_as_output=True, - generator=torch.Generator(device="cpu").manual_seed(0) -).images[0] -image.save("output.png") -``` - -
- generated image -
- -## Controllable generation - -OmniGen can handle several classic computer vision tasks. As shown below, OmniGen can detect human skeletons in input images, which can be used as control conditions to generate new images. - -```python -import torch -from diffusers import OmniGenPipeline -from diffusers.utils import load_image - -pipe = OmniGenPipeline.from_pretrained( - "Shitao/OmniGen-v1-diffusers", - torch_dtype=torch.bfloat16 -) -pipe.to("cuda") - -prompt="Detect the skeleton of human in this image: <|image_1|>" -input_images=[load_image("https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/edit.png")] -image1 = pipe( - prompt=prompt, - input_images=input_images, - guidance_scale=2, - img_guidance_scale=1.6, - use_input_image_size_as_output=True, - generator=torch.Generator(device="cpu").manual_seed(333) -).images[0] -image1.save("image1.png") - -prompt="Generate a new photo using the following picture and text as conditions: <|image_1|>\n A young boy is sitting on a sofa in the library, holding a book. His hair is neatly combed, and a faint smile plays on his lips, with a few freckles scattered across his cheeks. The library is quiet, with rows of shelves filled with books stretching out behind him." -input_images=[load_image("https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/skeletal.png")] -image2 = pipe( - prompt=prompt, - input_images=input_images, - guidance_scale=2, - img_guidance_scale=1.6, - use_input_image_size_as_output=True, - generator=torch.Generator(device="cpu").manual_seed(333) -).images[0] -image2.save("image2.png") -``` - -
-
- -
original image
-
-
- -
detected skeleton
-
-
- -
skeleton to image
-
-
- - -OmniGen can also directly use relevant information from input images to generate new images. - -```python -import torch -from diffusers import OmniGenPipeline -from diffusers.utils import load_image - -pipe = OmniGenPipeline.from_pretrained( - "Shitao/OmniGen-v1-diffusers", - torch_dtype=torch.bfloat16 -) -pipe.to("cuda") - -prompt="Following the pose of this image <|image_1|>, generate a new photo: A young boy is sitting on a sofa in the library, holding a book. His hair is neatly combed, and a faint smile plays on his lips, with a few freckles scattered across his cheeks. The library is quiet, with rows of shelves filled with books stretching out behind him." -input_images=[load_image("https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/edit.png")] -image = pipe( - prompt=prompt, - input_images=input_images, - guidance_scale=2, - img_guidance_scale=1.6, - use_input_image_size_as_output=True, - generator=torch.Generator(device="cpu").manual_seed(0) -).images[0] -image.save("output.png") -``` - -
-
- -
generated image
-
-
- -## ID and object preserving - -OmniGen can generate multiple images based on the people and objects in the input image and supports inputting multiple images simultaneously. -Additionally, OmniGen can extract desired objects from an image containing multiple objects based on instructions. - -```python -import torch -from diffusers import OmniGenPipeline -from diffusers.utils import load_image - -pipe = OmniGenPipeline.from_pretrained( - "Shitao/OmniGen-v1-diffusers", - torch_dtype=torch.bfloat16 -) -pipe.to("cuda") - -prompt="A man and a woman are sitting at a classroom desk. The man is the man with yellow hair in <|image_1|>. The woman is the woman on the left of <|image_2|>" -input_image_1 = load_image("https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/3.png") -input_image_2 = load_image("https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/4.png") -input_images=[input_image_1, input_image_2] -image = pipe( - prompt=prompt, - input_images=input_images, - height=1024, - width=1024, - guidance_scale=2.5, - img_guidance_scale=1.6, - generator=torch.Generator(device="cpu").manual_seed(666) -).images[0] -image.save("output.png") -``` - -
-
- -
input_image_1
-
-
- -
input_image_2
-
-
- -
generated image
-
-
- -```py -import torch -from diffusers import OmniGenPipeline -from diffusers.utils import load_image - -pipe = OmniGenPipeline.from_pretrained( - "Shitao/OmniGen-v1-diffusers", - torch_dtype=torch.bfloat16 -) -pipe.to("cuda") - -prompt="A woman is walking down the street, wearing a white long-sleeve blouse with lace details on the sleeves, paired with a blue pleated skirt. The woman is <|image_1|>. The long-sleeve blouse and a pleated skirt are <|image_2|>." -input_image_1 = load_image("https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/emma.jpeg") -input_image_2 = load_image("https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/dress.jpg") -input_images=[input_image_1, input_image_2] -image = pipe( - prompt=prompt, - input_images=input_images, - height=1024, - width=1024, - guidance_scale=2.5, - img_guidance_scale=1.6, - generator=torch.Generator(device="cpu").manual_seed(666) -).images[0] -image.save("output.png") -``` - -
-
- -
person image
-
-
- -
clothe image
-
-
- -
generated image
-
-
- -## Optimization when using multiple images - -For text-to-image task, OmniGen requires minimal memory and time costs (9GB memory and 31s for a 1024x1024 image on A800 GPU). -However, when using input images, the computational cost increases. - -Here are some guidelines to help you reduce computational costs when using multiple images. The experiments are conducted on an A800 GPU with two input images. - -Like other pipelines, you can reduce memory usage by offloading the model: `pipe.enable_model_cpu_offload()` or `pipe.enable_sequential_cpu_offload() `. -In OmniGen, you can also decrease computational overhead by reducing the `max_input_image_size`. -The memory consumption for different image sizes is shown in the table below: - -| Method | Memory Usage | -|---------------------------|--------------| -| max_input_image_size=1024 | 40GB | -| max_input_image_size=512 | 17GB | -| max_input_image_size=256 | 14GB | - diff --git a/docs/source/en/using-diffusers/pag.md b/docs/source/en/using-diffusers/pag.md deleted file mode 100644 index 46d716bcf8cc..000000000000 --- a/docs/source/en/using-diffusers/pag.md +++ /dev/null @@ -1,351 +0,0 @@ - - -# Perturbed-Attention Guidance - -[Perturbed-Attention Guidance (PAG)](https://ku-cvlab.github.io/Perturbed-Attention-Guidance/) is a new diffusion sampling guidance that improves sample quality across both unconditional and conditional settings, achieving this without requiring further training or the integration of external modules. PAG is designed to progressively enhance the structure of synthesized samples throughout the denoising process by considering the self-attention mechanisms' ability to capture structural information. It involves generating intermediate samples with degraded structure by substituting selected self-attention maps in diffusion U-Net with an identity matrix, and guiding the denoising process away from these degraded samples. - -This guide will show you how to use PAG for various tasks and use cases. - - -## General tasks - -You can apply PAG to the [`StableDiffusionXLPipeline`] for tasks such as text-to-image, image-to-image, and inpainting. To enable PAG for a specific task, load the pipeline using the [AutoPipeline](../api/pipelines/auto_pipeline) API with the `enable_pag=True` flag and the `pag_applied_layers` argument. - -> [!TIP] -> 🤗 Diffusers currently only supports using PAG with selected SDXL pipelines and [`PixArtSigmaPAGPipeline`]. But feel free to open a [feature request](https://github.com/huggingface/diffusers/issues/new/choose) if you want to add PAG support to a new pipeline! - - - - -```py -from diffusers import AutoPipelineForText2Image -from diffusers.utils import load_image -import torch - -pipeline = AutoPipelineForText2Image.from_pretrained( - "stabilityai/stable-diffusion-xl-base-1.0", - enable_pag=True, - pag_applied_layers=["mid"], - torch_dtype=torch.float16 -) -pipeline.enable_model_cpu_offload() -``` - -> [!TIP] -> The `pag_applied_layers` argument allows you to specify which layers PAG is applied to. Additionally, you can use `set_pag_applied_layers` method to update these layers after the pipeline has been created. Check out the [pag_applied_layers](#pag_applied_layers) section to learn more about applying PAG to other layers. - -If you already have a pipeline created and loaded, you can enable PAG on it using the `from_pipe` API with the `enable_pag` flag. Internally, a PAG pipeline is created based on the pipeline and task you specified. In the example below, since we used `AutoPipelineForText2Image` and passed a `StableDiffusionXLPipeline`, a `StableDiffusionXLPAGPipeline` is created accordingly. Note that this does not require additional memory, and you will have both `StableDiffusionXLPipeline` and `StableDiffusionXLPAGPipeline` loaded and ready to use. You can read more about the `from_pipe` API and how to reuse pipelines in diffuser [here](https://huggingface.co/docs/diffusers/using-diffusers/loading#reuse-a-pipeline). - -```py -pipeline_sdxl = AutoPipelineForText2Image.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16) -pipeline = AutoPipelineForText2Image.from_pipe(pipeline_sdxl, enable_pag=True) -``` - -To generate an image, you will also need to pass a `pag_scale`. When `pag_scale` increases, images gain more semantically coherent structures and exhibit fewer artifacts. However overly large guidance scale can lead to smoother textures and slight saturation in the images, similarly to CFG. `pag_scale=3.0` is used in the official demo and works well in most of the use cases, but feel free to experiment and select the appropriate value according to your needs! PAG is disabled when `pag_scale=0`. - -```py -prompt = "an insect robot preparing a delicious meal, anime style" - -for pag_scale in [0.0, 3.0]: - generator = torch.Generator(device="cpu").manual_seed(0) - images = pipeline( - prompt=prompt, - num_inference_steps=25, - guidance_scale=7.0, - generator=generator, - pag_scale=pag_scale, - ).images -``` - -
-
- -
generated image without PAG
-
-
- -
generated image with PAG
-
-
- -
- - -You can use PAG with image-to-image pipelines. - -```py -from diffusers import AutoPipelineForImage2Image -from diffusers.utils import load_image -import torch - -pipeline = AutoPipelineForImage2Image.from_pretrained( - "stabilityai/stable-diffusion-xl-base-1.0", - enable_pag=True, - pag_applied_layers=["mid"], - torch_dtype=torch.float16 -) -pipeline.enable_model_cpu_offload() -``` - -If you already have a image-to-image pipeline and would like enable PAG on it, you can run this - -```py -pipeline_t2i = AutoPipelineForImage2Image.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16) -pipeline = AutoPipelineForImage2Image.from_pipe(pipeline_t2i, enable_pag=True) -``` - -It is also very easy to directly switch from a text-to-image pipeline to PAG enabled image-to-image pipeline - -```py -pipeline_pag = AutoPipelineForText2Image.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16) -pipeline = AutoPipelineForImage2Image.from_pipe(pipeline_t2i, enable_pag=True) -``` - -If you have a PAG enabled text-to-image pipeline, you can directly switch to a image-to-image pipeline with PAG still enabled - -```py -pipeline_pag = AutoPipelineForText2Image.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", enable_pag=True, torch_dtype=torch.float16) -pipeline = AutoPipelineForImage2Image.from_pipe(pipeline_t2i) -``` - -Now let's generate an image! - -```py -pag_scales = 4.0 -guidance_scales = 7.0 - -url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-text2img.png" -init_image = load_image(url) -prompt = "a dog catching a frisbee in the jungle" - -generator = torch.Generator(device="cpu").manual_seed(0) -image = pipeline( - prompt, - image=init_image, - strength=0.8, - guidance_scale=guidance_scale, - pag_scale=pag_scale, - generator=generator).images[0] -``` - - - - -```py -from diffusers import AutoPipelineForInpainting -from diffusers.utils import load_image -import torch - -pipeline = AutoPipelineForInpainting.from_pretrained( - "stabilityai/stable-diffusion-xl-base-1.0", - enable_pag=True, - torch_dtype=torch.float16 -) -pipeline.enable_model_cpu_offload() -``` - -You can enable PAG on an existing inpainting pipeline like this - -```py -pipeline_inpaint = AutoPipelineForInpainting.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16) -pipeline = AutoPipelineForInpainting.from_pipe(pipeline_inpaint, enable_pag=True) -``` - -This still works when your pipeline has a different task: - -```py -pipeline_t2i = AutoPipelineForText2Image.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16) -pipeline = AutoPipelineForInpaiting.from_pipe(pipeline_t2i, enable_pag=True) -``` - -Let's generate an image! - -```py -img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png" -mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png" -init_image = load_image(img_url).convert("RGB") -mask_image = load_image(mask_url).convert("RGB") - -prompt = "A majestic tiger sitting on a bench" - -pag_scales = 3.0 -guidance_scales = 7.5 - -generator = torch.Generator(device="cpu").manual_seed(1) -images = pipeline( - prompt=prompt, - image=init_image, - mask_image=mask_image, - strength=0.8, - num_inference_steps=50, - guidance_scale=guidance_scale, - generator=generator, - pag_scale=pag_scale, -).images -images[0] -``` - -
- -## PAG with ControlNet - -To use PAG with ControlNet, first create a `controlnet`. Then, pass the `controlnet` and other PAG arguments to the `from_pretrained` method of the AutoPipeline for the specified task. - -```py -from diffusers import AutoPipelineForText2Image, ControlNetModel -import torch - -controlnet = ControlNetModel.from_pretrained( - "diffusers/controlnet-canny-sdxl-1.0", torch_dtype=torch.float16 -) - -pipeline = AutoPipelineForText2Image.from_pretrained( - "stabilityai/stable-diffusion-xl-base-1.0", - controlnet=controlnet, - enable_pag=True, - pag_applied_layers="mid", - torch_dtype=torch.float16 -) -pipeline.enable_model_cpu_offload() -``` - - - -If you already have a controlnet pipeline and want to enable PAG, you can use the `from_pipe` API: `AutoPipelineForText2Image.from_pipe(pipeline_controlnet, enable_pag=True)` - - - -You can use the pipeline in the same way you normally use ControlNet pipelines, with the added option to specify a `pag_scale` parameter. Note that PAG works well for unconditional generation. In this example, we will generate an image without a prompt. - -```py -from diffusers.utils import load_image -canny_image = load_image( - "https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/pag_control_input.png" -) - -for pag_scale in [0.0, 3.0]: - generator = torch.Generator(device="cpu").manual_seed(1) - images = pipeline( - prompt="", - controlnet_conditioning_scale=controlnet_conditioning_scale, - image=canny_image, - num_inference_steps=50, - guidance_scale=0, - generator=generator, - pag_scale=pag_scale, - ).images - images[0] -``` - -
-
- -
generated image without PAG
-
-
- -
generated image with PAG
-
-
- -## PAG with IP-Adapter - -[IP-Adapter](https://hf.co/papers/2308.06721) is a popular model that can be plugged into diffusion models to enable image prompting without any changes to the underlying model. You can enable PAG on a pipeline with IP-Adapter loaded. - -```py -from diffusers import AutoPipelineForText2Image -from diffusers.utils import load_image -from transformers import CLIPVisionModelWithProjection -import torch - -image_encoder = CLIPVisionModelWithProjection.from_pretrained( - "h94/IP-Adapter", - subfolder="models/image_encoder", - torch_dtype=torch.float16 -) - -pipeline = AutoPipelineForText2Image.from_pretrained( - "stabilityai/stable-diffusion-xl-base-1.0", - image_encoder=image_encoder, - enable_pag=True, - torch_dtype=torch.float16 -).to("cuda") - -pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name="ip-adapter-plus_sdxl_vit-h.bin") - -pag_scales = 5.0 -ip_adapter_scales = 0.8 - -image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_diner.png") - -pipeline.set_ip_adapter_scale(ip_adapter_scale) -generator = torch.Generator(device="cpu").manual_seed(0) -images = pipeline( - prompt="a polar bear sitting in a chair drinking a milkshake", - ip_adapter_image=image, - negative_prompt="deformed, ugly, wrong proportion, low res, bad anatomy, worst quality, low quality", - num_inference_steps=25, - guidance_scale=3.0, - generator=generator, - pag_scale=pag_scale, -).images -images[0] - -``` - -PAG reduces artifacts and improves the overall compposition. - -
-
- -
generated image without PAG
-
-
- -
generated image with PAG
-
-
- - -## Configure parameters - -### pag_applied_layers - -The `pag_applied_layers` argument allows you to specify which layers PAG is applied to. By default, it applies only to the mid blocks. Changing this setting will significantly impact the output. You can use the `set_pag_applied_layers` method to adjust the PAG layers after the pipeline is created, helping you find the optimal layers for your model. - -As an example, here is the images generated with `pag_layers = ["down.block_2"]` and `pag_layers = ["down.block_2", "up.block_1.attentions_0"]` - -```py -prompt = "an insect robot preparing a delicious meal, anime style" -pipeline.set_pag_applied_layers(pag_layers) -generator = torch.Generator(device="cpu").manual_seed(0) -images = pipeline( - prompt=prompt, - num_inference_steps=25, - guidance_scale=guidance_scale, - generator=generator, - pag_scale=pag_scale, -).images -images[0] -``` - -
-
- -
down.block_2 + up.block1.attentions_0
-
-
- -
down.block_2
-
-
diff --git a/docs/source/en/using-diffusers/sdxl.md b/docs/source/en/using-diffusers/sdxl.md deleted file mode 100644 index 106005c33807..000000000000 --- a/docs/source/en/using-diffusers/sdxl.md +++ /dev/null @@ -1,458 +0,0 @@ - - -# Stable Diffusion XL - -[[open-in-colab]] - -[Stable Diffusion XL](https://huggingface.co/papers/2307.01952) (SDXL) is a powerful text-to-image generation model that iterates on the previous Stable Diffusion models in three key ways: - -1. the UNet is 3x larger and SDXL combines a second text encoder (OpenCLIP ViT-bigG/14) with the original text encoder to significantly increase the number of parameters -2. introduces size and crop-conditioning to preserve training data from being discarded and gain more control over how a generated image should be cropped -3. introduces a two-stage model process; the *base* model (can also be run as a standalone model) generates an image as an input to the *refiner* model which adds additional high-quality details - -This guide will show you how to use SDXL for text-to-image, image-to-image, and inpainting. - -Before you begin, make sure you have the following libraries installed: - -```py -# uncomment to install the necessary libraries in Colab -#!pip install -q diffusers transformers accelerate invisible-watermark>=0.2.0 -``` - - - -We recommend installing the [invisible-watermark](https://pypi.org/project/invisible-watermark/) library to help identify images that are generated. If the invisible-watermark library is installed, it is used by default. To disable the watermarker: - -```py -pipeline = StableDiffusionXLPipeline.from_pretrained(..., add_watermarker=False) -``` - - - -## Load model checkpoints - -Model weights may be stored in separate subfolders on the Hub or locally, in which case, you should use the [`~StableDiffusionXLPipeline.from_pretrained`] method: - -```py -from diffusers import StableDiffusionXLPipeline, StableDiffusionXLImg2ImgPipeline -import torch - -pipeline = StableDiffusionXLPipeline.from_pretrained( - "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True -).to("cuda") - -refiner = StableDiffusionXLImg2ImgPipeline.from_pretrained( - "stabilityai/stable-diffusion-xl-refiner-1.0", torch_dtype=torch.float16, use_safetensors=True, variant="fp16" -).to("cuda") -``` - -You can also use the [`~StableDiffusionXLPipeline.from_single_file`] method to load a model checkpoint stored in a single file format (`.ckpt` or `.safetensors`) from the Hub or locally: - -```py -from diffusers import StableDiffusionXLPipeline, StableDiffusionXLImg2ImgPipeline -import torch - -pipeline = StableDiffusionXLPipeline.from_single_file( - "https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/blob/main/sd_xl_base_1.0.safetensors", - torch_dtype=torch.float16 -).to("cuda") - -refiner = StableDiffusionXLImg2ImgPipeline.from_single_file( - "https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0/blob/main/sd_xl_refiner_1.0.safetensors", torch_dtype=torch.float16 -).to("cuda") -``` - -## Text-to-image - -For text-to-image, pass a text prompt. By default, SDXL generates a 1024x1024 image for the best results. You can try setting the `height` and `width` parameters to 768x768 or 512x512, but anything below 512x512 is not likely to work. - -```py -from diffusers import AutoPipelineForText2Image -import torch - -pipeline_text2image = AutoPipelineForText2Image.from_pretrained( - "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True -).to("cuda") - -prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" -image = pipeline_text2image(prompt=prompt).images[0] -image -``` - -
- generated image of an astronaut in a jungle -
- -## Image-to-image - -For image-to-image, SDXL works especially well with image sizes between 768x768 and 1024x1024. Pass an initial image, and a text prompt to condition the image with: - -```py -from diffusers import AutoPipelineForImage2Image -from diffusers.utils import load_image, make_image_grid - -# use from_pipe to avoid consuming additional memory when loading a checkpoint -pipeline = AutoPipelineForImage2Image.from_pipe(pipeline_text2image).to("cuda") - -url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-text2img.png" -init_image = load_image(url) -prompt = "a dog catching a frisbee in the jungle" -image = pipeline(prompt, image=init_image, strength=0.8, guidance_scale=10.5).images[0] -make_image_grid([init_image, image], rows=1, cols=2) -``` - -
- generated image of a dog catching a frisbee in a jungle -
- -## Inpainting - -For inpainting, you'll need the original image and a mask of what you want to replace in the original image. Create a prompt to describe what you want to replace the masked area with. - -```py -from diffusers import AutoPipelineForInpainting -from diffusers.utils import load_image, make_image_grid - -# use from_pipe to avoid consuming additional memory when loading a checkpoint -pipeline = AutoPipelineForInpainting.from_pipe(pipeline_text2image).to("cuda") - -img_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-text2img.png" -mask_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-inpaint-mask.png" - -init_image = load_image(img_url) -mask_image = load_image(mask_url) - -prompt = "A deep sea diver floating" -image = pipeline(prompt=prompt, image=init_image, mask_image=mask_image, strength=0.85, guidance_scale=12.5).images[0] -make_image_grid([init_image, mask_image, image], rows=1, cols=3) -``` - -
- generated image of a deep sea diver in a jungle -
- -## Refine image quality - -SDXL includes a [refiner model](https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0) specialized in denoising low-noise stage images to generate higher-quality images from the base model. There are two ways to use the refiner: - -1. use the base and refiner models together to produce a refined image -2. use the base model to produce an image, and subsequently use the refiner model to add more details to the image (this is how SDXL was originally trained) - -### Base + refiner model - -When you use the base and refiner model together to generate an image, this is known as an [*ensemble of expert denoisers*](https://research.nvidia.com/labs/dir/eDiff-I/). The ensemble of expert denoisers approach requires fewer overall denoising steps versus passing the base model's output to the refiner model, so it should be significantly faster to run. However, you won't be able to inspect the base model's output because it still contains a large amount of noise. - -As an ensemble of expert denoisers, the base model serves as the expert during the high-noise diffusion stage and the refiner model serves as the expert during the low-noise diffusion stage. Load the base and refiner model: - -```py -from diffusers import DiffusionPipeline -import torch - -base = DiffusionPipeline.from_pretrained( - "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True -).to("cuda") - -refiner = DiffusionPipeline.from_pretrained( - "stabilityai/stable-diffusion-xl-refiner-1.0", - text_encoder_2=base.text_encoder_2, - vae=base.vae, - torch_dtype=torch.float16, - use_safetensors=True, - variant="fp16", -).to("cuda") -``` - -To use this approach, you need to define the number of timesteps for each model to run through their respective stages. For the base model, this is controlled by the [`denoising_end`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/stable_diffusion_xl#diffusers.StableDiffusionXLPipeline.__call__.denoising_end) parameter and for the refiner model, it is controlled by the [`denoising_start`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/stable_diffusion_xl#diffusers.StableDiffusionXLImg2ImgPipeline.__call__.denoising_start) parameter. - - - -The `denoising_end` and `denoising_start` parameters should be a float between 0 and 1. These parameters are represented as a proportion of discrete timesteps as defined by the scheduler. If you're also using the `strength` parameter, it'll be ignored because the number of denoising steps is determined by the discrete timesteps the model is trained on and the declared fractional cutoff. - - - -Let's set `denoising_end=0.8` so the base model performs the first 80% of denoising the **high-noise** timesteps and set `denoising_start=0.8` so the refiner model performs the last 20% of denoising the **low-noise** timesteps. The base model output should be in **latent** space instead of a PIL image. - -```py -prompt = "A majestic lion jumping from a big stone at night" - -image = base( - prompt=prompt, - num_inference_steps=40, - denoising_end=0.8, - output_type="latent", -).images -image = refiner( - prompt=prompt, - num_inference_steps=40, - denoising_start=0.8, - image=image, -).images[0] -image -``` - -
-
- generated image of a lion on a rock at night -
default base model
-
-
- generated image of a lion on a rock at night in higher quality -
ensemble of expert denoisers
-
-
- -The refiner model can also be used for inpainting in the [`StableDiffusionXLInpaintPipeline`]: - -```py -from diffusers import StableDiffusionXLInpaintPipeline -from diffusers.utils import load_image, make_image_grid -import torch - -base = StableDiffusionXLInpaintPipeline.from_pretrained( - "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True -).to("cuda") - -refiner = StableDiffusionXLInpaintPipeline.from_pretrained( - "stabilityai/stable-diffusion-xl-refiner-1.0", - text_encoder_2=base.text_encoder_2, - vae=base.vae, - torch_dtype=torch.float16, - use_safetensors=True, - variant="fp16", -).to("cuda") - -img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png" -mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png" - -init_image = load_image(img_url) -mask_image = load_image(mask_url) - -prompt = "A majestic tiger sitting on a bench" -num_inference_steps = 75 -high_noise_frac = 0.7 - -image = base( - prompt=prompt, - image=init_image, - mask_image=mask_image, - num_inference_steps=num_inference_steps, - denoising_end=high_noise_frac, - output_type="latent", -).images -image = refiner( - prompt=prompt, - image=image, - mask_image=mask_image, - num_inference_steps=num_inference_steps, - denoising_start=high_noise_frac, -).images[0] -make_image_grid([init_image, mask_image, image.resize((512, 512))], rows=1, cols=3) -``` - -This ensemble of expert denoisers method works well for all available schedulers! - -### Base to refiner model - -SDXL gets a boost in image quality by using the refiner model to add additional high-quality details to the fully-denoised image from the base model, in an image-to-image setting. - -Load the base and refiner models: - -```py -from diffusers import DiffusionPipeline -import torch - -base = DiffusionPipeline.from_pretrained( - "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True -).to("cuda") - -refiner = DiffusionPipeline.from_pretrained( - "stabilityai/stable-diffusion-xl-refiner-1.0", - text_encoder_2=base.text_encoder_2, - vae=base.vae, - torch_dtype=torch.float16, - use_safetensors=True, - variant="fp16", -).to("cuda") -``` - - - -You can use SDXL refiner with a different base model. For example, you can use the [Hunyuan-DiT](../../api/pipelines/hunyuandit) or [PixArt-Sigma](../../api/pipelines/pixart_sigma) pipelines to generate images with better prompt adherence. Once you have generated an image, you can pass it to the SDXL refiner model to enhance final generation quality. - - - -Generate an image from the base model, and set the model output to **latent** space: - -```py -prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" - -image = base(prompt=prompt, output_type="latent").images[0] -``` - -Pass the generated image to the refiner model: - -```py -image = refiner(prompt=prompt, image=image[None, :]).images[0] -``` - -
-
- generated image of an astronaut riding a green horse on Mars -
base model
-
-
- higher quality generated image of an astronaut riding a green horse on Mars -
base model + refiner model
-
-
- -For inpainting, load the base and the refiner model in the [`StableDiffusionXLInpaintPipeline`], remove the `denoising_end` and `denoising_start` parameters, and choose a smaller number of inference steps for the refiner. - -## Micro-conditioning - -SDXL training involves several additional conditioning techniques, which are referred to as *micro-conditioning*. These include original image size, target image size, and cropping parameters. The micro-conditionings can be used at inference time to create high-quality, centered images. - - - -You can use both micro-conditioning and negative micro-conditioning parameters thanks to classifier-free guidance. They are available in the [`StableDiffusionXLPipeline`], [`StableDiffusionXLImg2ImgPipeline`], [`StableDiffusionXLInpaintPipeline`], and [`StableDiffusionXLControlNetPipeline`]. - - - -### Size conditioning - -There are two types of size conditioning: - -- [`original_size`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/stable_diffusion_xl#diffusers.StableDiffusionXLPipeline.__call__.original_size) conditioning comes from upscaled images in the training batch (because it would be wasteful to discard the smaller images which make up almost 40% of the total training data). This way, SDXL learns that upscaling artifacts are not supposed to be present in high-resolution images. During inference, you can use `original_size` to indicate the original image resolution. Using the default value of `(1024, 1024)` produces higher-quality images that resemble the 1024x1024 images in the dataset. If you choose to use a lower resolution, such as `(256, 256)`, the model still generates 1024x1024 images, but they'll look like the low resolution images (simpler patterns, blurring) in the dataset. - -- [`target_size`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/stable_diffusion_xl#diffusers.StableDiffusionXLPipeline.__call__.target_size) conditioning comes from finetuning SDXL to support different image aspect ratios. During inference, if you use the default value of `(1024, 1024)`, you'll get an image that resembles the composition of square images in the dataset. We recommend using the same value for `target_size` and `original_size`, but feel free to experiment with other options! - -🤗 Diffusers also lets you specify negative conditions about an image's size to steer generation away from certain image resolutions: - -```py -from diffusers import StableDiffusionXLPipeline -import torch - -pipe = StableDiffusionXLPipeline.from_pretrained( - "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True -).to("cuda") - -prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" -image = pipe( - prompt=prompt, - negative_original_size=(512, 512), - negative_target_size=(1024, 1024), -).images[0] -``` - -
- -
Images negatively conditioned on image resolutions of (128, 128), (256, 256), and (512, 512).
-
- -### Crop conditioning - -Images generated by previous Stable Diffusion models may sometimes appear to be cropped. This is because images are actually cropped during training so that all the images in a batch have the same size. By conditioning on crop coordinates, SDXL *learns* that no cropping - coordinates `(0, 0)` - usually correlates with centered subjects and complete faces (this is the default value in 🤗 Diffusers). You can experiment with different coordinates if you want to generate off-centered compositions! - -```py -from diffusers import StableDiffusionXLPipeline -import torch - -pipeline = StableDiffusionXLPipeline.from_pretrained( - "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True -).to("cuda") - -prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" -image = pipeline(prompt=prompt, crops_coords_top_left=(256, 0)).images[0] -image -``` - -
- generated image of an astronaut in a jungle, slightly cropped -
- -You can also specify negative cropping coordinates to steer generation away from certain cropping parameters: - -```py -from diffusers import StableDiffusionXLPipeline -import torch - -pipe = StableDiffusionXLPipeline.from_pretrained( - "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True -).to("cuda") - -prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" -image = pipe( - prompt=prompt, - negative_original_size=(512, 512), - negative_crops_coords_top_left=(0, 0), - negative_target_size=(1024, 1024), -).images[0] -image -``` - -## Use a different prompt for each text-encoder - -SDXL uses two text-encoders, so it is possible to pass a different prompt to each text-encoder, which can [improve quality](https://github.com/huggingface/diffusers/issues/4004#issuecomment-1627764201). Pass your original prompt to `prompt` and the second prompt to `prompt_2` (use `negative_prompt` and `negative_prompt_2` if you're using negative prompts): - -```py -from diffusers import StableDiffusionXLPipeline -import torch - -pipeline = StableDiffusionXLPipeline.from_pretrained( - "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True -).to("cuda") - -# prompt is passed to OAI CLIP-ViT/L-14 -prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" -# prompt_2 is passed to OpenCLIP-ViT/bigG-14 -prompt_2 = "Van Gogh painting" -image = pipeline(prompt=prompt, prompt_2=prompt_2).images[0] -image -``` - -
- generated image of an astronaut in a jungle in the style of a van gogh painting -
- -The dual text-encoders also support textual inversion embeddings that need to be loaded separately as explained in the [SDXL textual inversion](textual_inversion_inference#stable-diffusion-xl) section. - -## Optimizations - -SDXL is a large model, and you may need to optimize memory to get it to run on your hardware. Here are some tips to save memory and speed up inference. - -1. Offload the model to the CPU with [`~StableDiffusionXLPipeline.enable_model_cpu_offload`] for out-of-memory errors: - -```diff -- base.to("cuda") -- refiner.to("cuda") -+ base.enable_model_cpu_offload() -+ refiner.enable_model_cpu_offload() -``` - -2. Use `torch.compile` for ~20% speed-up (you need `torch>=2.0`): - -```diff -+ base.unet = torch.compile(base.unet, mode="reduce-overhead", fullgraph=True) -+ refiner.unet = torch.compile(refiner.unet, mode="reduce-overhead", fullgraph=True) -``` - -3. Enable [xFormers](../optimization/xformers) to run SDXL if `torch<2.0`: - -```diff -+ base.enable_xformers_memory_efficient_attention() -+ refiner.enable_xformers_memory_efficient_attention() -``` - -## Other resources - -If you're interested in experimenting with a minimal version of the [`UNet2DConditionModel`] used in SDXL, take a look at the [minSDXL](https://github.com/cloneofsimo/minSDXL) implementation which is written in PyTorch and directly compatible with 🤗 Diffusers. diff --git a/docs/source/en/using-diffusers/sdxl_turbo.md b/docs/source/en/using-diffusers/sdxl_turbo.md deleted file mode 100644 index 83d591ced304..000000000000 --- a/docs/source/en/using-diffusers/sdxl_turbo.md +++ /dev/null @@ -1,118 +0,0 @@ - - -# Stable Diffusion XL Turbo - -[[open-in-colab]] - -SDXL Turbo is an adversarial time-distilled [Stable Diffusion XL](https://huggingface.co/papers/2307.01952) (SDXL) model capable -of running inference in as little as 1 step. - -This guide will show you how to use SDXL-Turbo for text-to-image and image-to-image. - -Before you begin, make sure you have the following libraries installed: - -```py -# uncomment to install the necessary libraries in Colab -#!pip install -q diffusers transformers accelerate -``` - -## Load model checkpoints - -Model weights may be stored in separate subfolders on the Hub or locally, in which case, you should use the [`~StableDiffusionXLPipeline.from_pretrained`] method: - -```py -from diffusers import AutoPipelineForText2Image -import torch - -pipeline = AutoPipelineForText2Image.from_pretrained("stabilityai/sdxl-turbo", torch_dtype=torch.float16, variant="fp16") -pipeline = pipeline.to("cuda") -``` - -You can also use the [`~StableDiffusionXLPipeline.from_single_file`] method to load a model checkpoint stored in a single file format (`.ckpt` or `.safetensors`) from the Hub or locally. For this loading method, you need to set `timestep_spacing="trailing"` (feel free to experiment with the other scheduler config values to get better results): - -```py -from diffusers import StableDiffusionXLPipeline, EulerAncestralDiscreteScheduler -import torch - -pipeline = StableDiffusionXLPipeline.from_single_file( - "https://huggingface.co/stabilityai/sdxl-turbo/blob/main/sd_xl_turbo_1.0_fp16.safetensors", - torch_dtype=torch.float16, variant="fp16") -pipeline = pipeline.to("cuda") -pipeline.scheduler = EulerAncestralDiscreteScheduler.from_config(pipeline.scheduler.config, timestep_spacing="trailing") -``` - -## Text-to-image - -For text-to-image, pass a text prompt. By default, SDXL Turbo generates a 512x512 image, and that resolution gives the best results. You can try setting the `height` and `width` parameters to 768x768 or 1024x1024, but you should expect quality degradations when doing so. - -Make sure to set `guidance_scale` to 0.0 to disable, as the model was trained without it. A single inference step is enough to generate high quality images. -Increasing the number of steps to 2, 3 or 4 should improve image quality. - -```py -from diffusers import AutoPipelineForText2Image -import torch - -pipeline_text2image = AutoPipelineForText2Image.from_pretrained("stabilityai/sdxl-turbo", torch_dtype=torch.float16, variant="fp16") -pipeline_text2image = pipeline_text2image.to("cuda") - -prompt = "A cinematic shot of a baby racoon wearing an intricate italian priest robe." - -image = pipeline_text2image(prompt=prompt, guidance_scale=0.0, num_inference_steps=1).images[0] -image -``` - -
- generated image of a racoon in a robe -
- -## Image-to-image - -For image-to-image generation, make sure that `num_inference_steps * strength` is larger or equal to 1. -The image-to-image pipeline will run for `int(num_inference_steps * strength)` steps, e.g. `0.5 * 2.0 = 1` step in -our example below. - -```py -from diffusers import AutoPipelineForImage2Image -from diffusers.utils import load_image, make_image_grid - -# use from_pipe to avoid consuming additional memory when loading a checkpoint -pipeline_image2image = AutoPipelineForImage2Image.from_pipe(pipeline_text2image).to("cuda") - -init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png") -init_image = init_image.resize((512, 512)) - -prompt = "cat wizard, gandalf, lord of the rings, detailed, fantasy, cute, adorable, Pixar, Disney, 8k" - -image = pipeline_image2image(prompt, image=init_image, strength=0.5, guidance_scale=0.0, num_inference_steps=2).images[0] -make_image_grid([init_image, image], rows=1, cols=2) -``` - -
- Image-to-image generation sample using SDXL Turbo -
- -## Speed-up SDXL Turbo even more - -- Compile the UNet if you are using PyTorch version 2.0 or higher. The first inference run will be very slow, but subsequent ones will be much faster. - -```py -pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True) -``` - -- When using the default VAE, keep it in `float32` to avoid costly `dtype` conversions before and after each generation. You only need to do this one before your first generation: - -```py -pipe.upcast_vae() -``` - -As an alternative, you can also use a [16-bit VAE](https://huggingface.co/madebyollin/sdxl-vae-fp16-fix) created by community member [`@madebyollin`](https://huggingface.co/madebyollin) that does not need to be upcasted to `float32`. diff --git a/docs/source/en/using-diffusers/shap-e.md b/docs/source/en/using-diffusers/shap-e.md deleted file mode 100644 index 51f0f53b0221..000000000000 --- a/docs/source/en/using-diffusers/shap-e.md +++ /dev/null @@ -1,192 +0,0 @@ - - -# Shap-E - -[[open-in-colab]] - -Shap-E is a conditional model for generating 3D assets which could be used for video game development, interior design, and architecture. It is trained on a large dataset of 3D assets, and post-processed to render more views of each object and produce 16K instead of 4K point clouds. The Shap-E model is trained in two steps: - -1. an encoder accepts the point clouds and rendered views of a 3D asset and outputs the parameters of implicit functions that represent the asset -2. a diffusion model is trained on the latents produced by the encoder to generate either neural radiance fields (NeRFs) or a textured 3D mesh, making it easier to render and use the 3D asset in downstream applications - -This guide will show you how to use Shap-E to start generating your own 3D assets! - -Before you begin, make sure you have the following libraries installed: - -```py -# uncomment to install the necessary libraries in Colab -#!pip install -q diffusers transformers accelerate trimesh -``` - -## Text-to-3D - -To generate a gif of a 3D object, pass a text prompt to the [`ShapEPipeline`]. The pipeline generates a list of image frames which are used to create the 3D object. - -```py -import torch -from diffusers import ShapEPipeline - -device = torch.device("cuda" if torch.cuda.is_available() else "cpu") - -pipe = ShapEPipeline.from_pretrained("openai/shap-e", torch_dtype=torch.float16, variant="fp16") -pipe = pipe.to(device) - -guidance_scale = 15.0 -prompt = ["A firecracker", "A birthday cupcake"] - -images = pipe( - prompt, - guidance_scale=guidance_scale, - num_inference_steps=64, - frame_size=256, -).images -``` - -이제 [`~utils.export_to_gif`] 함수를 사용해 이미지 프레임 리스트를 3D 오브젝트의 gif로 변환합니다. - -```py -from diffusers.utils import export_to_gif - -export_to_gif(images[0], "firecracker_3d.gif") -export_to_gif(images[1], "cake_3d.gif") -``` - -
-
- -
prompt = "A firecracker"
-
-
- -
prompt = "A birthday cupcake"
-
-
- -## Image-to-3D - -To generate a 3D object from another image, use the [`ShapEImg2ImgPipeline`]. You can use an existing image or generate an entirely new one. Let's use the [Kandinsky 2.1](../api/pipelines/kandinsky) model to generate a new image. - -```py -from diffusers import DiffusionPipeline -import torch - -prior_pipeline = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-prior", torch_dtype=torch.float16, use_safetensors=True).to("cuda") -pipeline = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16, use_safetensors=True).to("cuda") - -prompt = "A cheeseburger, white background" - -image_embeds, negative_image_embeds = prior_pipeline(prompt, guidance_scale=1.0).to_tuple() -image = pipeline( - prompt, - image_embeds=image_embeds, - negative_image_embeds=negative_image_embeds, -).images[0] - -image.save("burger.png") -``` - -Pass the cheeseburger to the [`ShapEImg2ImgPipeline`] to generate a 3D representation of it. - -```py -from PIL import Image -from diffusers import ShapEImg2ImgPipeline -from diffusers.utils import export_to_gif - -pipe = ShapEImg2ImgPipeline.from_pretrained("openai/shap-e-img2img", torch_dtype=torch.float16, variant="fp16").to("cuda") - -guidance_scale = 3.0 -image = Image.open("burger.png").resize((256, 256)) - -images = pipe( - image, - guidance_scale=guidance_scale, - num_inference_steps=64, - frame_size=256, -).images - -gif_path = export_to_gif(images[0], "burger_3d.gif") -``` - -
-
- -
cheeseburger
-
-
- -
3D cheeseburger
-
-
- -## Generate mesh - -Shap-E is a flexible model that can also generate textured mesh outputs to be rendered for downstream applications. In this example, you'll convert the output into a `glb` file because the 🤗 Datasets library supports mesh visualization of `glb` files which can be rendered by the [Dataset viewer](https://huggingface.co/docs/hub/datasets-viewer#dataset-preview). - -You can generate mesh outputs for both the [`ShapEPipeline`] and [`ShapEImg2ImgPipeline`] by specifying the `output_type` parameter as `"mesh"`: - -```py -import torch -from diffusers import ShapEPipeline - -device = torch.device("cuda" if torch.cuda.is_available() else "cpu") - -pipe = ShapEPipeline.from_pretrained("openai/shap-e", torch_dtype=torch.float16, variant="fp16") -pipe = pipe.to(device) - -guidance_scale = 15.0 -prompt = "A birthday cupcake" - -images = pipe(prompt, guidance_scale=guidance_scale, num_inference_steps=64, frame_size=256, output_type="mesh").images -``` - -Use the [`~utils.export_to_ply`] function to save the mesh output as a `ply` file: - - - -You can optionally save the mesh output as an `obj` file with the [`~utils.export_to_obj`] function. The ability to save the mesh output in a variety of formats makes it more flexible for downstream usage! - - - -```py -from diffusers.utils import export_to_ply - -ply_path = export_to_ply(images[0], "3d_cake.ply") -print(f"Saved to folder: {ply_path}") -``` - -Then you can convert the `ply` file to a `glb` file with the trimesh library: - -```py -import trimesh - -mesh = trimesh.load("3d_cake.ply") -mesh_export = mesh.export("3d_cake.glb", file_type="glb") -``` - -By default, the mesh output is focused from the bottom viewpoint but you can change the default viewpoint by applying a rotation transform: - -```py -import trimesh -import numpy as np - -mesh = trimesh.load("3d_cake.ply") -rot = trimesh.transformations.rotation_matrix(-np.pi / 2, [1, 0, 0]) -mesh = mesh.apply_transform(rot) -mesh_export = mesh.export("3d_cake.glb", file_type="glb") -``` - -Upload the mesh file to your dataset repository to visualize it with the Dataset viewer! - -
- -
diff --git a/docs/source/en/using-diffusers/svd.md b/docs/source/en/using-diffusers/svd.md deleted file mode 100644 index bd6d5c332c13..000000000000 --- a/docs/source/en/using-diffusers/svd.md +++ /dev/null @@ -1,122 +0,0 @@ - - -# Stable Video Diffusion - -[[open-in-colab]] - -[Stable Video Diffusion (SVD)](https://huggingface.co/papers/2311.15127) is a powerful image-to-video generation model that can generate 2-4 second high resolution (576x1024) videos conditioned on an input image. - -This guide will show you how to use SVD to generate short videos from images. - -Before you begin, make sure you have the following libraries installed: - -```py -# Colab에서 필요한 라이브러리를 설치하기 위해 주석을 제외하세요 -!pip install -q -U diffusers transformers accelerate -``` - -The are two variants of this model, [SVD](https://huggingface.co/stabilityai/stable-video-diffusion-img2vid) and [SVD-XT](https://huggingface.co/stabilityai/stable-video-diffusion-img2vid-xt). The SVD checkpoint is trained to generate 14 frames and the SVD-XT checkpoint is further finetuned to generate 25 frames. - -You'll use the SVD-XT checkpoint for this guide. - -```python -import torch - -from diffusers import StableVideoDiffusionPipeline -from diffusers.utils import load_image, export_to_video - -pipe = StableVideoDiffusionPipeline.from_pretrained( - "stabilityai/stable-video-diffusion-img2vid-xt", torch_dtype=torch.float16, variant="fp16" -) -pipe.enable_model_cpu_offload() - -# Load the conditioning image -image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket.png") -image = image.resize((1024, 576)) - -generator = torch.manual_seed(42) -frames = pipe(image, decode_chunk_size=8, generator=generator).frames[0] - -export_to_video(frames, "generated.mp4", fps=7) -``` - -
-
- -
"source image of a rocket"
-
-
- -
"generated video from source image"
-
-
- -## torch.compile - -You can gain a 20-25% speedup at the expense of slightly increased memory by [compiling](../optimization/fp16#torchcompile) the UNet. - -```diff -- pipe.enable_model_cpu_offload() -+ pipe.to("cuda") -+ pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True) -``` - -## Reduce memory usage - -Video generation is very memory intensive because you're essentially generating `num_frames` all at once, similar to text-to-image generation with a high batch size. To reduce the memory requirement, there are multiple options that trade-off inference speed for lower memory requirement: - -- enable model offloading: each component of the pipeline is offloaded to the CPU once it's not needed anymore. -- enable feed-forward chunking: the feed-forward layer runs in a loop instead of running a single feed-forward with a huge batch size. -- reduce `decode_chunk_size`: the VAE decodes frames in chunks instead of decoding them all together. Setting `decode_chunk_size=1` decodes one frame at a time and uses the least amount of memory (we recommend adjusting this value based on your GPU memory) but the video might have some flickering. - -```diff -- pipe.enable_model_cpu_offload() -- frames = pipe(image, decode_chunk_size=8, generator=generator).frames[0] -+ pipe.enable_model_cpu_offload() -+ pipe.unet.enable_forward_chunking() -+ frames = pipe(image, decode_chunk_size=2, generator=generator, num_frames=25).frames[0] -``` - -Using all these tricks together should lower the memory requirement to less than 8GB VRAM. - -## Micro-conditioning - -Stable Diffusion Video also accepts micro-conditioning, in addition to the conditioning image, which allows more control over the generated video: - -- `fps`: the frames per second of the generated video. -- `motion_bucket_id`: the motion bucket id to use for the generated video. This can be used to control the motion of the generated video. Increasing the motion bucket id increases the motion of the generated video. -- `noise_aug_strength`: the amount of noise added to the conditioning image. The higher the values the less the video resembles the conditioning image. Increasing this value also increases the motion of the generated video. - -For example, to generate a video with more motion, use the `motion_bucket_id` and `noise_aug_strength` micro-conditioning parameters: - -```python -import torch - -from diffusers import StableVideoDiffusionPipeline -from diffusers.utils import load_image, export_to_video - -pipe = StableVideoDiffusionPipeline.from_pretrained( - "stabilityai/stable-video-diffusion-img2vid-xt", torch_dtype=torch.float16, variant="fp16" -) -pipe.enable_model_cpu_offload() - -# Load the conditioning image -image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket.png") -image = image.resize((1024, 576)) - -generator = torch.manual_seed(42) -frames = pipe(image, decode_chunk_size=8, generator=generator, motion_bucket_id=180, noise_aug_strength=0.1).frames[0] -export_to_video(frames, "generated.mp4", fps=7) -``` - -![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/output_rocket_with_conditions.gif)