Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
253 changes: 253 additions & 0 deletions notebooks/clustering_wrap_up_ex_00.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,253 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# \ud83c\udfc1 Wrap-up quiz 4\n",
"\n",
"This notebook contains the guided project to answer the hands-on questions\n",
"corresponding to the module \"Clustering\" of the Associate Practitioner Course.\n",
"In this test **we do not have access to your code**. Only it's output is\n",
"evaluated by using the multiple choice questions, to be answered in the\n",
"dedicated User Interface.\n",
"\n",
"First run the following cell to initialize jupyterlite. Notice that only basic\n",
"libraries are available, such as pandas, matplotlib, seaborn and numpy.\n",
"Remember that the initial import of libraries can take longer than usual, it\n",
"may take around 10-20 seconds for the following cell to run. Please be\n",
"patient."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%pip install seaborn\n",
"import matplotlib\n",
"import numpy\n",
"import pandas\n",
"import seaborn\n",
"import sklearn"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Load the `periodic_signals.csv` dataset with the following cell of code. It\n",
"contains readings from 170 industrial sensors installed throughout a\n",
"manufacturing facility. Each sensor records the average power consumption (in\n",
"watts) every minute for a specific machine, with measurements taken every\n",
"minute. Different machines operate with their own characteristic cycles. Rare\n",
"events, such as machinery faults or unexpected disturbances, appear as signals\n",
"with abnormal frequency patterns. The goal is to identify those disturbances\n",
"using the tools we have learned during this module."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"periodic_signals = pd.read_csv(\"../datasets/periodic_signals.csv\")\n",
"_ = periodic_signals.iloc[0].plot(\n",
" xlabel=\"time (minutes)\",\n",
" ylabel=\"power (Watts)\",\n",
" title=\"Signal from the first sensor\",\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's see if we can find one or more stable candidates for the number of\n",
"clusters (`n_clusters`) using the silhouette score when resampling the\n",
"dataset. For such purpose:\n",
"- Create a pipeline consisting of a `RobustScaler` (as it is a good scaling\n",
" option when dealing with outliers), followed by `KMeans` with `n_init=5`.\n",
"- You can choose to set the `random_state=0` value of the `KMeans` step, but\n",
" fixing it or not should not change the conclusions.\n",
"- Generate randomly resampled data consisting of 90% of the data by using\n",
" `train_test_split` with `train_size=0.9`. Change the `random_state` in the\n",
" `train_test_split` to try around 20 different resamplings. You can use the\n",
" `plot_n_clusters_scores` function (or a simplified version of it) inside a\n",
" `for` loop as we did in a previous exercise.\n",
"- In each resampling, compute the silhouette score for `n_clusters` varying in\n",
" `range(2, 11)`.\n",
"\n",
"Using the silhouette score heuristics, select the correct statements:\n",
"\n",
"- a) 3 or 4 clusters maximize the score and are resonably stable choices. \n",
"- b) 5 or 6 clusters maximize the score and are resonably stable choices. \n",
"- c) 7 or 8 clusters maximize the score and are resonably stable choices. \n",
"- d) Scores in this range of `n_clusters` are always negative, denoting a bad\n",
" clustering model.\n",
"- e) Scores in this range of `n_clusters` are always positive, but hint to a\n",
" weak to moderate cluster cohesion/separation.\n",
"\n",
"_Select all answers that apply_"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Write your code here."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Set `n_clusters=8` in the `KMeans` step of your previous pipeline for the rest\n",
"of this quiz. We are going to define an `outlier_score` using the **minimum**\n",
"distance to **any** centroid (using the `fit_transform` method of the\n",
"pipeline).\n",
"\n",
"What are the indices of the 5 signals that are the farthest from any centroid?\n",
"\n",
"- a) [ 77 32 112 105 101]\n",
"- b) [ 92 49 101 132 146]\n",
"- c) [ 80 49 121 150 101]\n",
"- d) [ 64 98 118 163 121]\n",
"\n",
"_Select a single answer_\n",
"\n",
"Hint: You can make use of\n",
"[`numpy.min`](https://numpy.org/doc/stable/reference/generated/numpy.min.html)\n",
"and\n",
"[`numpy.argsort`](https://numpy.org/doc/stable/reference/generated/numpy.argsort.html).\n",
"Also, remember that the output of `fit_transform` is a numpy array of shape\n",
"`(n_samples, n_clusters)`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Write your code here."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create an `HDBSCAN` model (no need for scaling) with `min_cluster_size=10`.\n",
"How many clusters (excluding the noise label, which is not a cluster) are\n",
"found by this model?\n",
"\n",
"- a) 5\n",
"- b) 6\n",
"- c) 7\n",
"- d) 8\n",
"\n",
"_Select a single answer_"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Write your code here."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"How many signals are identified as noise?\n",
"\n",
"- a) 3\n",
"- b) 5\n",
"- c) 7\n",
"- d) 9\n",
"\n",
"_Select a single answer_"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Write your code here."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A priori we don't know if the signals are isotropic or follow a gaussian\n",
"distribution in the feature space (i.e. if they form spherical blobs). Because\n",
"of that, we don't know if a centroid-based or a density-based clustering is\n",
"more suitable. We would like to compare the results from both models, but we\n",
"know that the presence of outliers makes the silhouette score tricky to\n",
"interpret. We can still use other metrics, such as Adjusted Mutual Information\n",
"(AMI), to compare both models.\n",
"\n",
"But first we need k-means to have a similar behavior to HDBSCAN. For such\n",
"purpose, we can identify the points that are too far from any centroid as\n",
"outliers using the `outlier_score` as defined before. Instead of setting a\n",
"fixed distance threshold, we can flag the `n_outliers` signals with the\n",
"highest outlier scores as `-1`.\n",
"\n",
"For such purpose:\n",
"\n",
"- Cluster your signals with `KMeans` (using `fit_predict`) to get `kmeans_labels`.\n",
"- For a range of values of `n_outliers`, re-label the `n_outliers` with highest\n",
" `outlier_score` to `-1`.\n",
"- Compute and plot the AMI between this modified KMeans labeling and the\n",
" HDBSCAN cluster labels as a function of `n_outliers`.\n",
"\n",
"If we denote by `n_noise` the number of signals identified as noise by\n",
"HDBSCAN, select the true statements:\n",
"\n",
"- a) AMI reaches a maximum when `n_outliers` < `n_noise`, some points marked\n",
" as noise by HDBSCAN are not clearly isolated from a centroid.\n",
"- b) AMI reaches a maximum when `n_outliers` = `n_noise`, the two models\n",
" most strongly agree.\n",
"- c) AMI reaches a maximum when `n_outliers` > `n_noise`, k-means has created\n",
" small clusters (with fewer than `min_cluster_size` samples) that match what\n",
" HDBSCAN considers noise.\n",
"- d) AMI is too close to zero, indicating coincidences between models\n",
" are mostly random.\n",
"\n",
"_Select a single answer_"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Write your code here."
]
}
],
"metadata": {
"jupytext": {
"main_language": "python"
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
}
},
"nbformat": 4,
"nbformat_minor": 5
}