@@ -6,7 +6,25 @@ Under-sampling
66
77.. currentmodule :: imblearn.under_sampling
88
9- You can refer to
9+ One way of handling imbalanced datasets is to reduce the number of observations from
10+ all classes but the minority class. The minority class is that with the least number
11+ of observations. The most well known algorithm in this group is random
12+ undersampling, where samples from the targeted classes are removed at random.
13+
14+ But there are many other algorithms to help us reduce the number of observations in the
15+ dataset. These algorithms can be grouped based on their undersampling strategy into:
16+
17+ - Prototype generation methods.
18+ - Prototype selection methods.
19+
20+ And within the latter, we find:
21+
22+ - Controlled undersampling
23+ - Cleaning methods
24+
25+ We will discuss the different algorithms throughout this document.
26+
27+ Check also
1028:ref: `sphx_glr_auto_examples_under-sampling_plot_comparison_under_sampling.py `.
1129
1230.. _cluster_centroids :
@@ -16,7 +34,7 @@ Prototype generation
1634
1735Given an original data set :math: `S`, prototype generation algorithms will
1836generate a new set :math: `S'` where :math: `|S'| < |S|` and :math: `S' \not \subset
19- S`. In other words, prototype generation technique will reduce the number of
37+ S`. In other words, prototype generation techniques will reduce the number of
2038samples in the targeted classes but the remaining samples are generated --- and
2139not selected --- from the original set.
2240
@@ -61,16 +79,22 @@ original one.
6179Prototype selection
6280===================
6381
64- On the contrary to prototype generation algorithms, prototype selection
65- algorithms will select samples from the original set :math: `S`. Therefore,
66- :math: `S'` is defined such as :math: `|S'| < |S|` and :math: `S' \subset S`.
82+ Prototype selection algorithms will select samples from the original set :math: `S`,
83+ generating a dataset :math: `S'`, where :math: `|S'| < |S|` and :math: `S' \subset S`. In
84+ other words, :math: `S'` is a subset of :math: `S`.
85+
86+ Prototype selection algorithms can be divided into two groups: (i) controlled
87+ under-sampling techniques and (ii) cleaning under-sampling techniques.
88+
89+ Controlled under-sampling methods reduce the number of observations in the majority
90+ class or classes to an arbitrary number of samples specified by the user. Typically,
91+ they reduce the number of observations to the number of samples observed in the
92+ minority class.
6793
68- In addition, these algorithms can be divided into two groups: (i) the
69- controlled under-sampling techniques and (ii) the cleaning under-sampling
70- techniques. The first group of methods allows for an under-sampling strategy in
71- which the number of samples in :math: `S'` is specified by the user. By
72- contrast, cleaning under-sampling techniques do not allow this specification
73- and are meant for cleaning the feature space.
94+ In contrast, cleaning under-sampling techniques "clean" the feature space by removing
95+ either "noisy" or "too easy to classify" observations, depending on the method. The
96+ final number of observations in each class varies with the cleaning method and can't be
97+ specified by the user.
7498
7599.. _controlled_under_sampling :
76100
0 commit comments