Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
2ec9530
WIP Add module about clustering
May 22, 2025
8ab8ff8
Iter on Kmeans exercise
May 27, 2025
d828ada
Synch exercise notebooks
May 27, 2025
d94eb14
Add notebooks on hdbscan and feature engineering
May 27, 2025
fd41577
Reworked the k-means intro notebook to use penguins dataset
ogrisel May 28, 2025
90bdfed
Rerender the first notebook
ogrisel May 28, 2025
8ccd658
Add some missing cell markers
ogrisel May 28, 2025
0a2dfa3
Rerender the first notebook
ogrisel May 28, 2025
83bc291
More missing markers
ogrisel May 28, 2025
60d32eb
Rerender the first notebook
ogrisel May 28, 2025
2621cec
Improve phrasing / fix typos
ogrisel May 28, 2025
57fafec
Typo
ogrisel May 28, 2025
51a0d13
Rerender the first notebook
ogrisel May 28, 2025
6b26222
Iter on Olivier's work
May 30, 2025
fe58a3e
General rewording
Jun 2, 2025
7374ae5
Apply suggestions from code review
ArturoAmorQ Jun 2, 2025
fd03b1a
Rephrasing in cluster_kmeans_sol_01.py
ogrisel Jun 3, 2025
da9a3ec
Resynchronize exercise and fix CI
Jun 3, 2025
d4ad40c
Wording
Jun 3, 2025
8ab3e29
Use MAE to score predicted house prices
Jun 4, 2025
52f244a
Solve plotly DeprecationWarning
Jun 4, 2025
dec3453
Prefer make_column_transformer as per #831
Jun 4, 2025
00c41a7
Iter on hdbscan notebook
Jun 5, 2025
ffe1855
Remove redundant paragraph
Jun 5, 2025
26cd3d2
Rename exercise and solution
Jun 5, 2025
50e9bc0
Add exercise and solution using AMI
Jun 5, 2025
d7e03e6
Fix exercise
Jun 6, 2025
900f0da
Small improvements to the solution of exercise 02
ogrisel Jun 6, 2025
87d438f
Add the skrub dependency
ogrisel Jun 6, 2025
166868c
Expand analysis a bit
ogrisel Jun 6, 2025
74245d3
Improvements in the HDBSCAN notebook
ogrisel Jun 6, 2025
9bfe2f2
Reworded analysis of the BBC text clustering notebook + use cross-val…
ogrisel Jun 6, 2025
a765e53
Improvements in the supervised metrics notebook
ogrisel Jun 6, 2025
024545f
Add discussion on silhouette for hdbscan
Jun 9, 2025
2647274
Fix warning and plot not rendering
Jun 9, 2025
bd32b87
Add intro, overview and sections
Jun 10, 2025
27f4e52
Iter discussion on silhouette for hdbscan
Jun 10, 2025
b8f6646
Tweaks
Jun 10, 2025
d328631
Add first quiz on clustering and related images
Jun 11, 2025
c32eabe
Wording tweaks
Jun 11, 2025
89110ea
Add second quiz on clustering
Jun 20, 2025
d4b975e
Apply suggestions from code review
ArturoAmorQ Jun 24, 2025
b016d8b
Update jupyter-book/clustering/clustering_module_take_away.md
ArturoAmorQ Jun 24, 2025
a72be00
Synchronize quizzes from review
Jun 24, 2025
60a58c4
Merge branch 'clustering_module' of github.com:ArturoAmorQ/scikit-lea…
Jun 24, 2025
aff2dba
Synchronize notebooks
Jun 24, 2025
0c161ee
Add clustering wrap-up quiz
Jul 15, 2025
54ff951
Fix bug in wrap-up quiz
Jul 15, 2025
3fa71e9
Add wrap-up quiz to toc
Jul 15, 2025
5c7b28c
Fix a couple of bugs
Aug 12, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1,251 changes: 1,251 additions & 0 deletions datasets/bbc_news.csv

Large diffs are not rendered by default.

171 changes: 171 additions & 0 deletions datasets/periodic_signals.csv

Large diffs are not rendered by default.

5,882 changes: 5,882 additions & 0 deletions datasets/rfm_segmentation.csv

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions environment-dev.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ dependencies:
- matplotlib-base
- seaborn >= 0.13
- plotly >= 5.10
- skrub
- jupytext
- beautifulsoup4
- IPython
Expand Down
1 change: 1 addition & 0 deletions environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ dependencies:
- pandas >= 1
- matplotlib-base
- seaborn >= 0.13
- skrub
- jupyterlab
- notebook
- plotly >= 5.10
Expand Down
1,004 changes: 1,004 additions & 0 deletions figures/clustering_quiz_kmeans_not_scaled.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
980 changes: 980 additions & 0 deletions figures/clustering_quiz_kmeans_scaled.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
19 changes: 19 additions & 0 deletions jupyter-book/_toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -236,3 +236,22 @@ parts:
chapters:
- file: python_scripts/dev_features_importance
- file: interpretation/interpretation_quiz
- caption: 🚧 Clustering
chapters:
- file: clustering/clustering_module_intro
- file: clustering/clustering_kmeans_index
sections:
- file: python_scripts/clustering_kmeans
- file: python_scripts/clustering_ex_01
- file: python_scripts/clustering_sol_01
- file: python_scripts/clustering_supervised_metrics
- file: python_scripts/clustering_ex_02
- file: python_scripts/clustering_sol_02
- file: clustering/clustering_quiz_m4_01
- file: clustering/clustering_assumptions_index
sections:
- file: python_scripts/clustering_hdbscan
- file: python_scripts/clustering_transformer
- file: clustering/clustering_quiz_m4_02
- file: clustering/clustering_wrap_up_quiz
- file: clustering/clustering_module_take_away
5 changes: 5 additions & 0 deletions jupyter-book/clustering/clustering_assumptions_index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Clustering when k-means assumptions fail

```{tableofcontents}

```
5 changes: 5 additions & 0 deletions jupyter-book/clustering/clustering_kmeans_index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# K-means

```{tableofcontents}

```
56 changes: 56 additions & 0 deletions jupyter-book/clustering/clustering_module_intro.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# Module overview

## What you will learn

<!-- Give in plain English what the module is about -->

In the previous modules, we introduced the development, tuning and evaluation
of **supervised** machine learning models and pipelines.

In this module we present an **unsupervised** learning task, namely clustering.
In particular, we will focus on the k-means algorithm, and consider how to
evaluate such models via concepts such as cluster stability and evaluation
metrics such as silhouette score and inertia. We also introduce supervised
clustering metrics that leverage annotated data to assess clustering
quality.

Finally, we discuss what to do when the assumptions of k-means do not hold, such
as using HDBSCAN for non-convex clusters, and show how k-means can still be
useful as a feature engineering step in a supervised learning pipeline, by using
distances to centroids.


## Before getting started

<!-- Give the required skills for the module -->

The required technical skills to carry on this module are:

- skills acquired during the "The Predictive Modeling Pipeline" module with
basic usage of scikit-learn;
- skills acquired during the "Selecting The Best Model" module, mainly around
the concept of validation curves and the concepts around stability.

<!-- Point to resources to learning these skills -->

## Objectives and time schedule

<!-- Give the learning objectives -->

The objective in the module are the following:

- apply k-means clustering and assess its behavior across different settings;
- evaluate cluster quality using unsupervised metrics such as silhouette score
and WCSS (also known as inertia);
- interpret and compute supervised clustering metrics (e.g., AMI, ARI,
V-measure) when ground truth labels are available;
- understand the limitations of k-means and identify cases where its assumptions
(e.g., convex, isotropic clusters) do not hold;
- use HDBSCAN as an alternative clustering method suited for irregular or
non-convex cluster shapes;
- integrate k-means into a supervised learning pipeline by using distances to
centroids as features.

<!-- Give the investment in time -->

The estimated time to go through this module is about 6 hours.
26 changes: 26 additions & 0 deletions jupyter-book/clustering/clustering_module_take_away.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Main take-away

## Wrap-up

<!-- Quick wrap-up for the module -->

In this module, we presented the framework used in unsupervised learning with
clustering, focusing on k-means and how to evaluate its results using both
unsupervised and supervised metrics.

We explored the concept of cluster stability, addressed the limitations of
k-means when clusters are not convex, and introduced HDBSCAN as an alternative.

Finally, we showed how clustering can be integrated into supervised pipelines
to perform unsupervised feature engineering.

## To go further

<!-- Some extra links of content to go further -->

You can refer to the following scikit-learn examples which are related to
the concepts approached in this module:

- [Adjustment for chance in clustering performance evaluation](https://scikit-learn.org/stable/auto_examples/cluster/plot_adjusted_for_chance_measures.html)
- [Demonstration of k-means assumptions](https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_assumptions.html)
- [Clustering text documents using k-means](https://scikit-learn.org/stable/auto_examples/text/plot_document_clustering.html)
89 changes: 89 additions & 0 deletions jupyter-book/clustering/clustering_quiz_m4_01.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
# ✅ Quiz M4.01

```{admonition} Question
Imagine you work for a music streaming platform that hosts a vast library of
songs, playlists, and podcasts. You have access to detailed listening data from
millions of users. For each user, you know their most-listened genres, the
devices they use, their average session length, and how often they explore new
content.

You want to segment users based on their listening patterns to improve
personalized recommendations, without relying on rigid, predefined labels like
"pop fan" or "casual listener" which may fail to capture the complexity of
their behavior.

What kind of problem are you dealing with?

- a) a supervised task
- b) an unsupervised task
- c) a classification task
- d) a clustering task

_Select all answers that apply_
```

+++

```{admonition} Question
The plots below show the cluster labels as found by k-means with 3 clusters, only
differing in the scaling step. Based on this, which conclusions can be obtained?

![K-means on original features](../../figures/clustering_quiz_kmeans_not_scaled.svg)
![K-means on scaled features](../../figures/clustering_quiz_kmeans_scaled.svg)

- a) without scaling, cluster assignment is dominated by the feature in the vertical axis
- b) without scaling, cluster assignment is dominated by the feature in the horizontal axis
- c) without scaling, both features contribute equally to cluster assignment

_Select a single answer_
```

+++

```{admonition} Question
Which of the following statements correctly describe factors that affect the
stability of k-means clustering across different resampling iterations of the data?

- a) K-means can produce different results on resampled datasets due to
sensitivity to initialization.
- b) If data is unevenly distributed, the stability improves when increasing the
parameter `n_init` in the "k-means++" initialization.
- c) Stability under resampling is guaranteed after feature scaling.
- d) Increasing the number of clusters always reduces the variability of
results across resamples.

_Select all answers that apply_
```

+++

```{admonition} Question
Which of the following statements correctly describe how WCSS (within-cluster
sum of squares, or inertia) behaves in k-means clustering?

- a) For a fixed number of clusters, WCSS is lower when clusters are compact.
- b) For a fixed number of clusters, WCSS is lower for wider clusters.
- c) For a fixed number of clusters, lower WCSS implies lower computational cost
during training.
- d) Assuming `n_init` is large enough to ensure convergence, WCSS always
decreases as the number of clusters increases.

_Select all answers that apply_
```

+++

```{admonition} Question
Which of the following statements correctly describe differences between
supervised and unsupervised clustering metrics?

- a) Supervised clustering metrics such as ARI and AMI require access to ground
truth labels to evaluate clustering performance.
- b) WCSS and the silhouette score evaluate internal cluster structure without
needing reference labels.
- c) V-measure is zero when labels are assigned completely at random.
- d) Supervised clustering metrics are not useful if the number of clusters does
not match the number of predefined classes.

_Select all answers that apply_
```
45 changes: 45 additions & 0 deletions jupyter-book/clustering/clustering_quiz_m4_02.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# ✅ Quiz M4.02

```{admonition} Question
If we increase `min_cluster_size` in HDBSCAN, what happens to the number of
points labeled as noise?

- a) It decreases.
- b) It increases.
- c) It stays the same.
- d) HDBSCAN fails to converge.

_Select a single answer_

```

+++

```{admonition} Question
What happens to k-means centroids in the presence of outliers?

- a) They move towards the outliers assigned to their cluster.
- b) They are not sensitive to outliers.
- c) If a centroid is initialized on an outlier, it may remain isolated in
subsequent iterations.

_Select all answers that apply_

```

+++

```{admonition} Question
A `KMeans` instance with `n_clusters=10` is used to transform the latitude and
longitude in a supervised learning pipeline. Provided the original dataset consists of
`n_features`, including those two, how many features are passed to
the final estimator of the pipeline?

- a) `n_features` + 10
- b) `n_features` + 8
- c) `n_features` - 2
- d) `n_features`

_Select a single answer_

```
Loading