probabl-ai · ArturoAmorQ · Jul 15, 2025 · Jul 15, 2025
diff --git a/notebooks/clustering_wrap_up_ex_00.ipynb b/notebooks/clustering_wrap_up_ex_00.ipynb
@@ -0,0 +1,253 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# \ud83c\udfc1 Wrap-up quiz 4\n",
+    "\n",
+    "This notebook contains the guided project to answer the hands-on questions\n",
+    "corresponding to the module \"Clustering\" of the Associate Practitioner Course.\n",
+    "In this test **we do not have access to your code**. Only it's output is\n",
+    "evaluated by using the multiple choice questions, to be answered in the\n",
+    "dedicated User Interface.\n",
+    "\n",
+    "First run the following cell to initialize jupyterlite. Notice that only basic\n",
+    "libraries are available, such as pandas, matplotlib, seaborn and numpy.\n",
+    "Remember that the initial import of libraries can take longer than usual, it\n",
+    "may take around 10-20 seconds for the following cell to run. Please be\n",
+    "patient."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%pip install seaborn\n",
+    "import matplotlib\n",
+    "import numpy\n",
+    "import pandas\n",
+    "import seaborn\n",
+    "import sklearn"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Load the `periodic_signals.csv` dataset with the following cell of code. It\n",
+    "contains readings from 170 industrial sensors installed throughout a\n",
+    "manufacturing facility. Each sensor records the average power consumption (in\n",
+    "watts) every minute for a specific machine, with measurements taken every\n",
+    "minute. Different machines operate with their own characteristic cycles. Rare\n",
+    "events, such as machinery faults or unexpected disturbances, appear as signals\n",
+    "with abnormal frequency patterns. The goal is to identify those disturbances\n",
+    "using the tools we have learned during this module."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "\n",
+    "periodic_signals = pd.read_csv(\"../datasets/periodic_signals.csv\")\n",
+    "_ = periodic_signals.iloc[0].plot(\n",
+    "    xlabel=\"time (minutes)\",\n",
+    "    ylabel=\"power (Watts)\",\n",
+    "    title=\"Signal from the first sensor\",\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's see if we can find one or more stable candidates for the number of\n",
+    "clusters (`n_clusters`) using the silhouette score when resampling the\n",
+    "dataset. For such purpose:\n",
+    "- Create a pipeline consisting of a `RobustScaler` (as it is a good scaling\n",
+    "  option when dealing with outliers), followed by `KMeans` with `n_init=5`.\n",
+    "- You can choose to set the `random_state=0` value of the `KMeans` step, but\n",
+    "  fixing it or not should not change the conclusions.\n",
+    "- Generate randomly resampled data consisting of 90% of the data by using\n",
+    "  `train_test_split` with `train_size=0.9`. Change the `random_state` in the\n",
+    "  `train_test_split` to try around 20 different resamplings. You can use the\n",
+    "  `plot_n_clusters_scores` function (or a simplified version of it) inside a\n",
+    "  `for` loop as we did in a previous exercise.\n",
+    "- In each resampling, compute the silhouette score for `n_clusters` varying in\n",
+    "  `range(2, 11)`.\n",
+    "\n",
+    "Using the silhouette score heuristics, select the correct statements:\n",
+    "\n",
+    "- a) 3 or 4 clusters maximize the score and are resonably stable choices. \n",
+    "- b) 5 or 6 clusters maximize the score and are resonably stable choices. \n",
+    "- c) 7 or 8 clusters maximize the score and are resonably stable choices. \n",
+    "- d) Scores in this range of `n_clusters` are always negative, denoting a bad\n",
+    "  clustering model.\n",
+    "- e) Scores in this range of `n_clusters` are always positive, but hint to a\n",
+    "  weak to moderate cluster cohesion/separation.\n",
+    "\n",
+    "_Select all answers that apply_"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Write your code here."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Set `n_clusters=8` in the `KMeans` step of your previous pipeline for the rest\n",
+    "of this quiz. We are going to define an `outlier_score` using the **minimum**\n",
+    "distance to **any** centroid (using the `fit_transform` method of the\n",
+    "pipeline).\n",
+    "\n",
+    "What are the indices of the 5 signals that are the farthest from any centroid?\n",
+    "\n",
+    "- a) [ 77 32 112 105 101]\n",
+    "- b) [ 92 49 101 132 146]\n",
+    "- c) [ 80 49 121 150 101]\n",
+    "- d) [ 64 98 118 163 121]\n",
+    "\n",
+    "_Select a single answer_\n",
+    "\n",
+    "Hint: You can make use of\n",
+    "[`numpy.min`](https://numpy.org/doc/stable/reference/generated/numpy.min.html)\n",
+    "and\n",
+    "[`numpy.argsort`](https://numpy.org/doc/stable/reference/generated/numpy.argsort.html).\n",
+    "Also, remember that the output of `fit_transform` is a numpy array of shape\n",
+    "`(n_samples, n_clusters)`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Write your code here."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Create an `HDBSCAN` model (no need for scaling) with `min_cluster_size=10`.\n",
+    "How many clusters (excluding the noise label, which is not a cluster) are\n",
+    "found by this model?\n",
+    "\n",
+    "- a) 5\n",
+    "- b) 6\n",
+    "- c) 7\n",
+    "- d) 8\n",
+    "\n",
+    "_Select a single answer_"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Write your code here."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "How many signals are identified as noise?\n",
+    "\n",
+    "- a) 3\n",
+    "- b) 5\n",
+    "- c) 7\n",
+    "- d) 9\n",
+    "\n",
+    "_Select a single answer_"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Write your code here."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "A priori we don't know if the signals are isotropic or follow a gaussian\n",
+    "distribution in the feature space (i.e. if they form spherical blobs). Because\n",
+    "of that, we don't know if a centroid-based or a density-based clustering is\n",
+    "more suitable. We would like to compare the results from both models, but we\n",
+    "know that the presence of outliers makes the silhouette score tricky to\n",
+    "interpret. We can still use other metrics, such as Adjusted Mutual Information\n",
+    "(AMI), to compare both models.\n",
+    "\n",
+    "But first we need k-means to have a similar behavior to HDBSCAN. For such\n",
+    "purpose, we can identify the points that are too far from any centroid as\n",
+    "outliers using the `outlier_score` as defined before. Instead of setting a\n",
+    "fixed distance threshold, we can flag the `n_outliers` signals with the\n",
+    "highest outlier scores as `-1`.\n",
+    "\n",
+    "For such purpose:\n",
+    "\n",
+    "- Cluster your signals with `KMeans` (using `fit_predict`) to get `kmeans_labels`.\n",
+    "- For a range of values of `n_outliers`, re-label the `n_outliers` with highest\n",
+    "  `outlier_score` to `-1`.\n",
+    "- Compute and plot the AMI between this modified KMeans labeling and the\n",
+    "  HDBSCAN cluster labels as a function of `n_outliers`.\n",
+    "\n",
+    "If we denote by `n_noise` the number of signals identified as noise by\n",
+    "HDBSCAN, select the true statements:\n",
+    "\n",
+    "- a) AMI reaches a maximum when `n_outliers` < `n_noise`, some points marked\n",
+    "  as noise by HDBSCAN are not clearly isolated from a centroid.\n",
+    "- b) AMI reaches a maximum when `n_outliers` = `n_noise`, the two models\n",
+    "  most strongly agree.\n",
+    "- c) AMI reaches a maximum when `n_outliers` > `n_noise`, k-means has created\n",
+    "  small clusters (with fewer than `min_cluster_size` samples) that match what\n",
+    "  HDBSCAN considers noise.\n",
+    "- d) AMI is too close to zero, indicating coincidences between models\n",
+    "  are mostly random.\n",
+    "\n",
+    "_Select a single answer_"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Write your code here."
+   ]
+  }
+ ],
+ "metadata": {
+  "jupytext": {
+   "main_language": "python"
+  },
+  "kernelspec": {
+   "display_name": "Python 3",
+   "name": "python3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}