INRIA · ArturoAmorQ · May 22, 2025 · May 27, 2025 · May 27, 2025 · May 27, 2025
diff --git a/datasets/bbc_news.csv b/datasets/bbc_news.csv
diff --git a/datasets/periodic_signals.csv b/datasets/periodic_signals.csv
diff --git a/datasets/rfm_segmentation.csv b/datasets/rfm_segmentation.csv
diff --git a/environment-dev.yml b/environment-dev.yml
@@ -7,6 +7,7 @@ dependencies:
   - matplotlib-base
   - seaborn >= 0.13
   - plotly >= 5.10
+  - skrub
   - jupytext
   - beautifulsoup4
   - IPython

diff --git a/environment.yml b/environment.yml
@@ -8,6 +8,7 @@ dependencies:
   - pandas >= 1
   - matplotlib-base
   - seaborn >= 0.13
+  - skrub
   - jupyterlab
   - notebook
   - plotly >= 5.10

diff --git a/figures/clustering_quiz_kmeans_not_scaled.svg b/figures/clustering_quiz_kmeans_not_scaled.svg
diff --git a/figures/clustering_quiz_kmeans_scaled.svg b/figures/clustering_quiz_kmeans_scaled.svg
diff --git a/jupyter-book/_toc.yml b/jupyter-book/_toc.yml
@@ -236,3 +236,22 @@ parts:
   chapters:
   - file: python_scripts/dev_features_importance
   - file: interpretation/interpretation_quiz
+- caption: 🚧 Clustering
+  chapters:
+  - file: clustering/clustering_module_intro
+  - file: clustering/clustering_kmeans_index
+    sections:
+    - file: python_scripts/clustering_kmeans
+    - file: python_scripts/clustering_ex_01
+    - file: python_scripts/clustering_sol_01
+    - file: python_scripts/clustering_supervised_metrics
+    - file: python_scripts/clustering_ex_02
+    - file: python_scripts/clustering_sol_02
+    - file: clustering/clustering_quiz_m4_01
+  - file: clustering/clustering_assumptions_index
+    sections:
+    - file: python_scripts/clustering_hdbscan
+    - file: python_scripts/clustering_transformer
+    - file: clustering/clustering_quiz_m4_02
+  - file: clustering/clustering_wrap_up_quiz
+  - file: clustering/clustering_module_take_away
diff --git a/jupyter-book/clustering/clustering_assumptions_index.md b/jupyter-book/clustering/clustering_assumptions_index.md
@@ -0,0 +1,5 @@
+# Clustering when k-means assumptions fail
+
+```{tableofcontents}
+
+```
diff --git a/jupyter-book/clustering/clustering_kmeans_index.md b/jupyter-book/clustering/clustering_kmeans_index.md
@@ -0,0 +1,5 @@
+# K-means
+
+```{tableofcontents}
+
+```
diff --git a/jupyter-book/clustering/clustering_module_intro.md b/jupyter-book/clustering/clustering_module_intro.md
@@ -0,0 +1,56 @@
+# Module overview
+
+## What you will learn
+
+<!-- Give in plain English what the module is about -->
+
+In the previous modules, we introduced the development, tuning and evaluation
+of **supervised** machine learning models and pipelines.
+
+In this module we present an **unsupervised** learning task, namely clustering.
+In particular, we will focus on the k-means algorithm, and consider how to
+evaluate such models via concepts such as cluster stability and evaluation
+metrics such as silhouette score and inertia. We also introduce supervised
+clustering metrics that leverage annotated data to assess clustering
+quality.
+
+Finally, we discuss what to do when the assumptions of k-means do not hold, such
+as using HDBSCAN for non-convex clusters, and show how k-means can still be
+useful as a feature engineering step in a supervised learning pipeline, by using
+distances to centroids.
+
+
+## Before getting started
+
+<!-- Give the required skills for the module -->
+
+The required technical skills to carry on this module are:
+
+- skills acquired during the "The Predictive Modeling Pipeline" module with
+  basic usage of scikit-learn;
+- skills acquired during the "Selecting The Best Model" module, mainly around
+  the concept of validation curves and the concepts around stability.
+
+<!-- Point to resources to learning these skills -->
+
+## Objectives and time schedule
+
+<!-- Give the learning objectives -->
+
+The objective in the module are the following:
+
+- apply k-means clustering and assess its behavior across different settings;
+- evaluate cluster quality using unsupervised metrics such as silhouette score
+  and WCSS (also known as inertia);
+- interpret and compute supervised clustering metrics (e.g., AMI, ARI,
+  V-measure) when ground truth labels are available;
+- understand the limitations of k-means and identify cases where its assumptions
+  (e.g., convex, isotropic clusters) do not hold;
+- use HDBSCAN as an alternative clustering method suited for irregular or
+  non-convex cluster shapes;
+- integrate k-means into a supervised learning pipeline by using distances to
+  centroids as features.
+
+<!-- Give the investment in time -->
+
+The estimated time to go through this module is about 6 hours.
diff --git a/jupyter-book/clustering/clustering_module_take_away.md b/jupyter-book/clustering/clustering_module_take_away.md
@@ -0,0 +1,26 @@
+# Main take-away
+
+## Wrap-up
+
+<!-- Quick wrap-up for the module -->
+
+In this module, we presented the framework used in unsupervised learning with
+clustering, focusing on k-means and how to evaluate its results using both
+unsupervised and supervised metrics.
+
+We explored the concept of cluster stability, addressed the limitations of
+k-means when clusters are not convex, and introduced HDBSCAN as an alternative.
+
+Finally, we showed how clustering can be integrated into supervised pipelines
+to perform unsupervised feature engineering.
+
+## To go further
+
+<!-- Some extra links of content to go further -->
+
+You can refer to the following scikit-learn examples which are related to
+the concepts approached in this module:
+
+- [Adjustment for chance in clustering performance evaluation](https://scikit-learn.org/stable/auto_examples/cluster/plot_adjusted_for_chance_measures.html)
+- [Demonstration of k-means assumptions](https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_assumptions.html)
+- [Clustering text documents using k-means](https://scikit-learn.org/stable/auto_examples/text/plot_document_clustering.html)
diff --git a/jupyter-book/clustering/clustering_quiz_m4_01.md b/jupyter-book/clustering/clustering_quiz_m4_01.md
@@ -0,0 +1,89 @@
+# ✅ Quiz M4.01
+
+```{admonition} Question
+Imagine you work for a music streaming platform that hosts a vast library of
+songs, playlists, and podcasts. You have access to detailed listening data from
+millions of users. For each user, you know their most-listened genres, the
+devices they use, their average session length, and how often they explore new
+content.
+
+You want to segment users based on their listening patterns to improve
+personalized recommendations, without relying on rigid, predefined labels like
+"pop fan" or "casual listener" which may fail to capture the complexity of
+their behavior.
+
+What kind of problem are you dealing with?
+
+- a) a supervised task
+- b) an unsupervised task
+- c) a classification task
+- d) a clustering task
+
+_Select all answers that apply_
+```
+
++++
+
+```{admonition} Question
+The plots below show the cluster labels as found by k-means with 3 clusters, only
+differing in the scaling step. Based on this, which conclusions can be obtained?
+
+![K-means on original features](../../figures/clustering_quiz_kmeans_not_scaled.svg)
+![K-means on scaled features](../../figures/clustering_quiz_kmeans_scaled.svg)
+
+- a) without scaling, cluster assignment is dominated by the feature in the vertical axis
+- b) without scaling, cluster assignment is dominated by the feature in the horizontal axis
+- c) without scaling, both features contribute equally to cluster assignment
+
+_Select a single answer_
+```
+
++++
+
+```{admonition} Question
+Which of the following statements correctly describe factors that affect the
+stability of k-means clustering across different resampling iterations of the data?
+
+- a) K-means can produce different results on resampled datasets due to
+  sensitivity to initialization.
+- b) If data is unevenly distributed, the stability improves when increasing the
+  parameter `n_init` in the "k-means++" initialization.
+- c) Stability under resampling is guaranteed after feature scaling.
+- d) Increasing the number of clusters always reduces the variability of
+  results across resamples.
+
+_Select all answers that apply_
+```
+
++++
+
+```{admonition} Question
+Which of the following statements correctly describe how WCSS (within-cluster
+sum of squares, or inertia) behaves in k-means clustering?
+
+- a) For a fixed number of clusters, WCSS is lower when clusters are compact.
+- b) For a fixed number of clusters, WCSS is lower for wider clusters.
+- c) For a fixed number of clusters, lower WCSS implies lower computational cost
+  during training.
+- d) Assuming `n_init` is large enough to ensure convergence, WCSS always
+  decreases as the number of clusters increases.
+
+_Select all answers that apply_
+```
+
++++
+
+```{admonition} Question
+Which of the following statements correctly describe differences between
+supervised and unsupervised clustering metrics?
+
+- a) Supervised clustering metrics such as ARI and AMI require access to ground
+  truth labels to evaluate clustering performance.
+- b) WCSS and the silhouette score evaluate internal cluster structure without
+  needing reference labels.
+- c) V-measure is zero when labels are assigned completely at random.
+- d) Supervised clustering metrics are not useful if the number of clusters does
+  not match the number of predefined classes.
+
+_Select all answers that apply_
+```
diff --git a/jupyter-book/clustering/clustering_quiz_m4_02.md b/jupyter-book/clustering/clustering_quiz_m4_02.md
@@ -0,0 +1,45 @@
+# ✅ Quiz M4.02
+
+```{admonition} Question
+If we increase `min_cluster_size` in HDBSCAN, what happens to the number of
+points labeled as noise?
+
+- a) It decreases.
+- b) It increases.
+- c) It stays the same.
+- d) HDBSCAN fails to converge.
+
+_Select a single answer_
+
+```
+
++++
+
+```{admonition} Question
+What happens to k-means centroids in the presence of outliers?
+
+- a) They move towards the outliers assigned to their cluster.
+- b) They are not sensitive to outliers.
+- c) If a centroid is initialized on an outlier, it may remain isolated in
+  subsequent iterations.
+
+_Select all answers that apply_
+
+```
+
++++
+
+```{admonition} Question
+A `KMeans` instance with `n_clusters=10` is used to transform the latitude and
+longitude in a supervised learning pipeline. Provided the original dataset consists of
+`n_features`, including those two, how many features are passed to
+the final estimator of the pipeline?
+
+- a) `n_features` + 10
+- b) `n_features` + 8
+- c) `n_features` - 2
+- d) `n_features`
+
+_Select a single answer_
+
+```