diff --git a/notebooks/clustering_wrap_up_ex_00.ipynb b/notebooks/clustering_wrap_up_ex_00.ipynb new file mode 100644 index 000000000..078aca2ba --- /dev/null +++ b/notebooks/clustering_wrap_up_ex_00.ipynb @@ -0,0 +1,253 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# \ud83c\udfc1 Wrap-up quiz 4\n", + "\n", + "This notebook contains the guided project to answer the hands-on questions\n", + "corresponding to the module \"Clustering\" of the Associate Practitioner Course.\n", + "In this test **we do not have access to your code**. Only it's output is\n", + "evaluated by using the multiple choice questions, to be answered in the\n", + "dedicated User Interface.\n", + "\n", + "First run the following cell to initialize jupyterlite. Notice that only basic\n", + "libraries are available, such as pandas, matplotlib, seaborn and numpy.\n", + "Remember that the initial import of libraries can take longer than usual, it\n", + "may take around 10-20 seconds for the following cell to run. Please be\n", + "patient." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%pip install seaborn\n", + "import matplotlib\n", + "import numpy\n", + "import pandas\n", + "import seaborn\n", + "import sklearn" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Load the `periodic_signals.csv` dataset with the following cell of code. It\n", + "contains readings from 170 industrial sensors installed throughout a\n", + "manufacturing facility. Each sensor records the average power consumption (in\n", + "watts) every minute for a specific machine, with measurements taken every\n", + "minute. Different machines operate with their own characteristic cycles. Rare\n", + "events, such as machinery faults or unexpected disturbances, appear as signals\n", + "with abnormal frequency patterns. The goal is to identify those disturbances\n", + "using the tools we have learned during this module." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "\n", + "periodic_signals = pd.read_csv(\"../datasets/periodic_signals.csv\")\n", + "_ = periodic_signals.iloc[0].plot(\n", + " xlabel=\"time (minutes)\",\n", + " ylabel=\"power (Watts)\",\n", + " title=\"Signal from the first sensor\",\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's see if we can find one or more stable candidates for the number of\n", + "clusters (`n_clusters`) using the silhouette score when resampling the\n", + "dataset. For such purpose:\n", + "- Create a pipeline consisting of a `RobustScaler` (as it is a good scaling\n", + " option when dealing with outliers), followed by `KMeans` with `n_init=5`.\n", + "- You can choose to set the `random_state=0` value of the `KMeans` step, but\n", + " fixing it or not should not change the conclusions.\n", + "- Generate randomly resampled data consisting of 90% of the data by using\n", + " `train_test_split` with `train_size=0.9`. Change the `random_state` in the\n", + " `train_test_split` to try around 20 different resamplings. You can use the\n", + " `plot_n_clusters_scores` function (or a simplified version of it) inside a\n", + " `for` loop as we did in a previous exercise.\n", + "- In each resampling, compute the silhouette score for `n_clusters` varying in\n", + " `range(2, 11)`.\n", + "\n", + "Using the silhouette score heuristics, select the correct statements:\n", + "\n", + "- a) 3 or 4 clusters maximize the score and are resonably stable choices. \n", + "- b) 5 or 6 clusters maximize the score and are resonably stable choices. \n", + "- c) 7 or 8 clusters maximize the score and are resonably stable choices. \n", + "- d) Scores in this range of `n_clusters` are always negative, denoting a bad\n", + " clustering model.\n", + "- e) Scores in this range of `n_clusters` are always positive, but hint to a\n", + " weak to moderate cluster cohesion/separation.\n", + "\n", + "_Select all answers that apply_" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Write your code here." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Set `n_clusters=8` in the `KMeans` step of your previous pipeline for the rest\n", + "of this quiz. We are going to define an `outlier_score` using the **minimum**\n", + "distance to **any** centroid (using the `fit_transform` method of the\n", + "pipeline).\n", + "\n", + "What are the indices of the 5 signals that are the farthest from any centroid?\n", + "\n", + "- a) [ 77 32 112 105 101]\n", + "- b) [ 92 49 101 132 146]\n", + "- c) [ 80 49 121 150 101]\n", + "- d) [ 64 98 118 163 121]\n", + "\n", + "_Select a single answer_\n", + "\n", + "Hint: You can make use of\n", + "[`numpy.min`](https://numpy.org/doc/stable/reference/generated/numpy.min.html)\n", + "and\n", + "[`numpy.argsort`](https://numpy.org/doc/stable/reference/generated/numpy.argsort.html).\n", + "Also, remember that the output of `fit_transform` is a numpy array of shape\n", + "`(n_samples, n_clusters)`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Write your code here." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Create an `HDBSCAN` model (no need for scaling) with `min_cluster_size=10`.\n", + "How many clusters (excluding the noise label, which is not a cluster) are\n", + "found by this model?\n", + "\n", + "- a) 5\n", + "- b) 6\n", + "- c) 7\n", + "- d) 8\n", + "\n", + "_Select a single answer_" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Write your code here." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "How many signals are identified as noise?\n", + "\n", + "- a) 3\n", + "- b) 5\n", + "- c) 7\n", + "- d) 9\n", + "\n", + "_Select a single answer_" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Write your code here." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "A priori we don't know if the signals are isotropic or follow a gaussian\n", + "distribution in the feature space (i.e. if they form spherical blobs). Because\n", + "of that, we don't know if a centroid-based or a density-based clustering is\n", + "more suitable. We would like to compare the results from both models, but we\n", + "know that the presence of outliers makes the silhouette score tricky to\n", + "interpret. We can still use other metrics, such as Adjusted Mutual Information\n", + "(AMI), to compare both models.\n", + "\n", + "But first we need k-means to have a similar behavior to HDBSCAN. For such\n", + "purpose, we can identify the points that are too far from any centroid as\n", + "outliers using the `outlier_score` as defined before. Instead of setting a\n", + "fixed distance threshold, we can flag the `n_outliers` signals with the\n", + "highest outlier scores as `-1`.\n", + "\n", + "For such purpose:\n", + "\n", + "- Cluster your signals with `KMeans` (using `fit_predict`) to get `kmeans_labels`.\n", + "- For a range of values of `n_outliers`, re-label the `n_outliers` with highest\n", + " `outlier_score` to `-1`.\n", + "- Compute and plot the AMI between this modified KMeans labeling and the\n", + " HDBSCAN cluster labels as a function of `n_outliers`.\n", + "\n", + "If we denote by `n_noise` the number of signals identified as noise by\n", + "HDBSCAN, select the true statements:\n", + "\n", + "- a) AMI reaches a maximum when `n_outliers` < `n_noise`, some points marked\n", + " as noise by HDBSCAN are not clearly isolated from a centroid.\n", + "- b) AMI reaches a maximum when `n_outliers` = `n_noise`, the two models\n", + " most strongly agree.\n", + "- c) AMI reaches a maximum when `n_outliers` > `n_noise`, k-means has created\n", + " small clusters (with fewer than `min_cluster_size` samples) that match what\n", + " HDBSCAN considers noise.\n", + "- d) AMI is too close to zero, indicating coincidences between models\n", + " are mostly random.\n", + "\n", + "_Select a single answer_" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Write your code here." + ] + } + ], + "metadata": { + "jupytext": { + "main_language": "python" + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} \ No newline at end of file