update notebook as per PR comments

sayakpaul · web-flow · commit 3a7175f2f6df · 2023-03-15T16:48:05.000+05:30
huggingface/diffusers#2516
diff --git a/diffusers/evaluation.ipynb b/diffusers/evaluation.ipynb
@@ -95,10 +95,11 @@
       "source": [
         "from datasets import load_dataset\n",
         "\n",
-        "prompts = load_dataset(\"nateraw/parti-prompts\", split=\"train\")\n",
-        "prompts = prompts.shuffle()\n",
-        "sample_prompts = [prompts[i][\"Prompt\"] for i in range(5)]\n",
+        "# prompts = load_dataset(\"nateraw/parti-prompts\", split=\"train\")\n",
+        "# prompts = prompts.shuffle()\n",
+        "# sample_prompts = [prompts[i][\"Prompt\"] for i in range(5)]\n",
         "\n",
+        "# Fixing these sample prompts in the interest of reproducibility.\n",
         "sample_prompts = [\n",
         "    \"a corgi\",\n",
         "    \"a hot air balloon with a yin-yang symbol, with the moon visible in the daytime sky\",\n",
@@ -169,7 +170,7 @@
         "\n",
         "> 💡 **Tip:** It is useful to look at some inference samples while a model is training to measure the \n",
         "training progress. In our [training scripts](https://github.com/huggingface/diffusers/tree/main/examples/), we support this utility with additional support for\n",
-        "logging to TensorBoard and Weights and Biases."
+        "logging to TensorBoard and Weights & Biases."
       ],
       "metadata": {
         "id": "tBQjJ36RI-gD"
@@ -178,7 +179,7 @@
     {
       "cell_type": "markdown",
       "source": [
-        "## Quantitative\n",
+        "## Quantitative Evaluation\n",
         "\n",
         "In this section, we will walk you through how to evaluate three different diffusion pipelines using:\n",
         "\n",
@@ -268,7 +269,7 @@
     {
       "cell_type": "markdown",
       "source": [
-        "In the above example, we generated one image per prompt. If we generated multiple images per prompt, we could uniformly sample just one from the pool of generated images.\n",
+        "In the above example, we generated one image per prompt. If we generated multiple images per prompt, we would have to take the average score from the generated images per prompt.\n",
         "\n",
         "Now, if we wanted to compare two checkpoints compatible with the [`StableDiffusionPipeline`](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/overview) we should pass a generator while calling the pipeline. First, we generate images with a fixed seed with the [v1-4 Stable Diffusion checkpoint](https://huggingface.co/CompVis/stable-diffusion-v1-4):\n"
       ],
@@ -660,7 +661,7 @@
         "\n",
         "We can use these metrics for similar pipelines such as the[`StableDiffusionPix2PixZeroPipeline`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/pix2pix_zero#diffusers.StableDiffusionPix2PixZeroPipeline)`.\n",
         "\n",
-        "> Both CLIP score and CLIP direction similarity rely on the CLIP model, which can make the evaluations biased.\n",
+        "> **Info**: Both CLIP score and CLIP direction similarity rely on the CLIP model, which can make the evaluations biased.\n",
         "\n",
         "***Extending metrics like IS, FID (discussed later), or KID can be difficult*** when the model under evaluation was pre-trained on a large image-captioning dataset (such as the [LAION-5B dataset](https://laion.ai/blog/laion-5b/)). This is because underlying these metrics is an InceptionNet (pre-trained on the ImageNet-1k dataset) used for extracting intermediate image features. The pre-training dataset of Stable Diffusion may have limited overlap with the pre-training dataset of InceptionNet, so it is not a good candidate here for feature extraction.\n",
         "\n",
@@ -675,7 +676,7 @@
       "source": [
         "### Class-conditioned image generation\n",
         "\n",
-        "Class-conditioned generative models are usually pre-trained on a class-labeled dataset such as [ImageNet-1k](https://huggingface.co/datasets/imagenet-1k). Popular metrics for evaluating these models include Fréchet Inception Distance (FID), Kernel Inception Distance (KID), and Inception Score (IS). In this document, we focus on FID ([Heusel et al.](https://arxiv.org/abs/1706.08500)). We show how to compute it with the [`DiTPipeline`], which uses the [DiT model](https://arxiv.org/abs/2212.09748) under the hood.\n",
+        "Class-conditioned generative models are usually pre-trained on a class-labeled dataset such as [ImageNet-1k](https://huggingface.co/datasets/imagenet-1k). Popular metrics for evaluating these models include Fréchet Inception Distance (FID), Kernel Inception Distance (KID), and Inception Score (IS). In this document, we focus on FID ([Heusel et al.](https://arxiv.org/abs/1706.08500)). We show how to compute it with the [`DiTPipeline`](https://huggingface.co/docs/diffusers/api/pipelines/dit), which uses the [DiT model](https://arxiv.org/abs/2212.09748) under the hood.\n",
         "\n",
         "FID aims to measure how similar are two datasets of images. As per [this resource](https://mmgeneration.readthedocs.io/en/latest/quick_run.html#fid):\n",
         "\n",
@@ -735,7 +736,7 @@
     {
       "cell_type": "markdown",
       "source": [
-        "These images are from the following Imagenet-1k classes: \"cassette_player\", \"chain_saw\", \"church\", \"gas_pump\", \"parachute\", and \"tench\".\n",
+        "These are 10 images from the following Imagenet-1k classes: \"cassette_player\", \"chain_saw\" (x2), \"church\", \"gas_pump\" (x3), \"parachute\" (x2), and \"tench\".\n",
         "\n",
         "Now that the images are loaded, let's apply some lightweight pre-processing on them to use them for FID calculation."
       ],