|
95 | 95 | "source": [ |
96 | 96 | "from datasets import load_dataset\n", |
97 | 97 | "\n", |
98 | | - "prompts = load_dataset(\"nateraw/parti-prompts\", split=\"train\")\n", |
99 | | - "prompts = prompts.shuffle()\n", |
100 | | - "sample_prompts = [prompts[i][\"Prompt\"] for i in range(5)]\n", |
| 98 | + "# prompts = load_dataset(\"nateraw/parti-prompts\", split=\"train\")\n", |
| 99 | + "# prompts = prompts.shuffle()\n", |
| 100 | + "# sample_prompts = [prompts[i][\"Prompt\"] for i in range(5)]\n", |
101 | 101 | "\n", |
| 102 | + "# Fixing these sample prompts in the interest of reproducibility.\n", |
102 | 103 | "sample_prompts = [\n", |
103 | 104 | " \"a corgi\",\n", |
104 | 105 | " \"a hot air balloon with a yin-yang symbol, with the moon visible in the daytime sky\",\n", |
|
169 | 170 | "\n", |
170 | 171 | "> 💡 **Tip:** It is useful to look at some inference samples while a model is training to measure the \n", |
171 | 172 | "training progress. In our [training scripts](https://github.com/huggingface/diffusers/tree/main/examples/), we support this utility with additional support for\n", |
172 | | - "logging to TensorBoard and Weights and Biases." |
| 173 | + "logging to TensorBoard and Weights & Biases." |
173 | 174 | ], |
174 | 175 | "metadata": { |
175 | 176 | "id": "tBQjJ36RI-gD" |
|
178 | 179 | { |
179 | 180 | "cell_type": "markdown", |
180 | 181 | "source": [ |
181 | | - "## Quantitative\n", |
| 182 | + "## Quantitative Evaluation\n", |
182 | 183 | "\n", |
183 | 184 | "In this section, we will walk you through how to evaluate three different diffusion pipelines using:\n", |
184 | 185 | "\n", |
|
268 | 269 | { |
269 | 270 | "cell_type": "markdown", |
270 | 271 | "source": [ |
271 | | - "In the above example, we generated one image per prompt. If we generated multiple images per prompt, we could uniformly sample just one from the pool of generated images.\n", |
| 272 | + "In the above example, we generated one image per prompt. If we generated multiple images per prompt, we would have to take the average score from the generated images per prompt.\n", |
272 | 273 | "\n", |
273 | 274 | "Now, if we wanted to compare two checkpoints compatible with the [`StableDiffusionPipeline`](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/overview) we should pass a generator while calling the pipeline. First, we generate images with a fixed seed with the [v1-4 Stable Diffusion checkpoint](https://huggingface.co/CompVis/stable-diffusion-v1-4):\n" |
274 | 275 | ], |
|
660 | 661 | "\n", |
661 | 662 | "We can use these metrics for similar pipelines such as the[`StableDiffusionPix2PixZeroPipeline`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/pix2pix_zero#diffusers.StableDiffusionPix2PixZeroPipeline)`.\n", |
662 | 663 | "\n", |
663 | | - "> Both CLIP score and CLIP direction similarity rely on the CLIP model, which can make the evaluations biased.\n", |
| 664 | + "> **Info**: Both CLIP score and CLIP direction similarity rely on the CLIP model, which can make the evaluations biased.\n", |
664 | 665 | "\n", |
665 | 666 | "***Extending metrics like IS, FID (discussed later), or KID can be difficult*** when the model under evaluation was pre-trained on a large image-captioning dataset (such as the [LAION-5B dataset](https://laion.ai/blog/laion-5b/)). This is because underlying these metrics is an InceptionNet (pre-trained on the ImageNet-1k dataset) used for extracting intermediate image features. The pre-training dataset of Stable Diffusion may have limited overlap with the pre-training dataset of InceptionNet, so it is not a good candidate here for feature extraction.\n", |
666 | 667 | "\n", |
|
675 | 676 | "source": [ |
676 | 677 | "### Class-conditioned image generation\n", |
677 | 678 | "\n", |
678 | | - "Class-conditioned generative models are usually pre-trained on a class-labeled dataset such as [ImageNet-1k](https://huggingface.co/datasets/imagenet-1k). Popular metrics for evaluating these models include Fréchet Inception Distance (FID), Kernel Inception Distance (KID), and Inception Score (IS). In this document, we focus on FID ([Heusel et al.](https://arxiv.org/abs/1706.08500)). We show how to compute it with the [`DiTPipeline`], which uses the [DiT model](https://arxiv.org/abs/2212.09748) under the hood.\n", |
| 679 | + "Class-conditioned generative models are usually pre-trained on a class-labeled dataset such as [ImageNet-1k](https://huggingface.co/datasets/imagenet-1k). Popular metrics for evaluating these models include Fréchet Inception Distance (FID), Kernel Inception Distance (KID), and Inception Score (IS). In this document, we focus on FID ([Heusel et al.](https://arxiv.org/abs/1706.08500)). We show how to compute it with the [`DiTPipeline`](https://huggingface.co/docs/diffusers/api/pipelines/dit), which uses the [DiT model](https://arxiv.org/abs/2212.09748) under the hood.\n", |
679 | 680 | "\n", |
680 | 681 | "FID aims to measure how similar are two datasets of images. As per [this resource](https://mmgeneration.readthedocs.io/en/latest/quick_run.html#fid):\n", |
681 | 682 | "\n", |
|
735 | 736 | { |
736 | 737 | "cell_type": "markdown", |
737 | 738 | "source": [ |
738 | | - "These images are from the following Imagenet-1k classes: \"cassette_player\", \"chain_saw\", \"church\", \"gas_pump\", \"parachute\", and \"tench\".\n", |
| 739 | + "These are 10 images from the following Imagenet-1k classes: \"cassette_player\", \"chain_saw\" (x2), \"church\", \"gas_pump\" (x3), \"parachute\" (x2), and \"tench\".\n", |
739 | 740 | "\n", |
740 | 741 | "Now that the images are loaded, let's apply some lightweight pre-processing on them to use them for FID calculation." |
741 | 742 | ], |
|
0 commit comments