1- """Class to perform under-sampling based on the edited nearest neighbour
1+ """Classes to perform under-sampling based on the edited nearest neighbour
22method."""
33
44# Authors: Guillaume Lemaitre <[email protected] > 2828class EditedNearestNeighbours (BaseCleaningSampler ):
2929 """Undersample based on the edited nearest neighbour method.
3030
31- This method will clean the database by removing samples close to the
32- decision boundary.
31+ This method cleans the dataset by removing samples close to the
32+ decision boundary. It removes observations from the majority class or
33+ classes when any or most of its closest neighours are from a different class.
3334
3435 Read more in the :ref:`User Guide <edited_nearest_neighbors>`.
3536
@@ -38,29 +39,31 @@ class EditedNearestNeighbours(BaseCleaningSampler):
3839 {sampling_strategy}
3940
4041 n_neighbors : int or object, default=3
41- If ``int``, size of the neighbourhood to consider to compute the
42- nearest neighbors. If object, an estimator that inherits from
43- :class:`~sklearn.neighbors.base.KNeighborsMixin` that will be used to
44- find the nearest-neighbors.
42+ If ``int``, size of the neighbourhood to consider for the undersampling, i.e.,
43+ if `n_neighbors=3`, a sample will be removed when any or most of its 3 closest
44+ neighbours are from a different class. If object, an estimator that inherits
45+ from :class:`~sklearn.neighbors.base.KNeighborsMixin` that will be used to
46+ find the nearest-neighbors. Note that if you want to examine the 3 closest
47+ neighbours of a sample for the undersampling, you need to pass a 4-KNN.
4548
4649 kind_sel : {{'all', 'mode'}}, default='all'
47- Strategy to use in order to exclude samples.
50+ Strategy to use to exclude samples.
4851
49- - If ``'all'``, all neighbours will have to agree with the samples of
50- interest to not be excluded.
51- - If ``'mode'``, the majority vote of the neighbours will be used in
52- order to exclude a sample .
52+ - If ``'all'``, all neighbours should be of the same class of the examined
53+ sample for it not be excluded.
54+ - If ``'mode'``, most neighbours should be of the same class of the examined
55+ sample for it not be excluded .
5356
5457 The strategy `"all"` will be less conservative than `'mode'`. Thus,
55- more samples will be removed when `kind_sel="all"` generally.
58+ more samples will be removed when `kind_sel="all"`, generally.
5659
5760 {n_jobs}
5861
5962 Attributes
6063 ----------
6164 sampling_strategy_ : dict
6265 Dictionary containing the information to sample the dataset. The keys
63- corresponds to the class labels from which to sample and the values
66+ correspond to the class labels from which to sample and the values
6467 are the number of samples to sample.
6568
6669 nn_ : estimator object
@@ -86,9 +89,9 @@ class EditedNearestNeighbours(BaseCleaningSampler):
8689 --------
8790 CondensedNearestNeighbour : Undersample by condensing samples.
8891
89- RepeatedEditedNearestNeighbours : Undersample by repeating ENN algorithm.
92+ RepeatedEditedNearestNeighbours : Undersample by repeating the ENN algorithm.
9093
91- AllKNN : Undersample using ENN and various number of neighbours.
94+ AllKNN : Undersample using ENN with varying neighbours.
9295
9396 Notes
9497 -----
@@ -197,7 +200,11 @@ def _more_tags(self):
197200class RepeatedEditedNearestNeighbours (BaseCleaningSampler ):
198201 """Undersample based on the repeated edited nearest neighbour method.
199202
200- This method will repeat several time the ENN algorithm.
203+ This method repeats the :class:`EditedNearestNeighbours` algorithm several times.
204+ The repetitions will stop when i) the maximum number of iterations is reached,
205+ or ii) no more observations are being removed, or iii) one of the majority classes
206+ becomes a minority class or iv) one of the majority classes disappears
207+ during undersampling.
201208
202209 Read more in the :ref:`User Guide <edited_nearest_neighbors>`.
203210
@@ -206,33 +213,34 @@ class RepeatedEditedNearestNeighbours(BaseCleaningSampler):
206213 {sampling_strategy}
207214
208215 n_neighbors : int or object, default=3
209- If ``int``, size of the neighbourhood to consider to compute the
210- nearest neighbors. If object, an estimator that inherits from
211- :class:`~sklearn.neighbors.base.KNeighborsMixin` that will be used to
212- find the nearest-neighbors.
216+ If ``int``, size of the neighbourhood to consider for the undersampling, i.e.,
217+ if `n_neighbors=3`, a sample will be removed when any or most of its 3 closest
218+ neighbours are from a different class. If object, an estimator that inherits
219+ from :class:`~sklearn.neighbors.base.KNeighborsMixin` that will be used to
220+ find the nearest-neighbors. Note that if you want to examine the 3 closest
221+ neighbours of a sample for the undersampling, you need to pass a 4-KNN.
213222
214223 max_iter : int, default=100
215- Maximum number of iterations of the edited nearest neighbours
216- algorithm for a single run.
224+ Maximum number of iterations of the edited nearest neighbours.
217225
218226 kind_sel : {{'all', 'mode'}}, default='all'
219- Strategy to use in order to exclude samples.
227+ Strategy to use to exclude samples.
220228
221- - If ``'all'``, all neighbours will have to agree with the samples of
222- interest to not be excluded.
223- - If ``'mode'``, the majority vote of the neighbours will be used in
224- order to exclude a sample .
229+ - If ``'all'``, all neighbours should be of the same class of the examined
230+ sample for it not be excluded.
231+ - If ``'mode'``, most neighbours should be of the same class of the examined
232+ sample for it not be excluded .
225233
226234 The strategy `"all"` will be less conservative than `'mode'`. Thus,
227- more samples will be removed when `kind_sel="all"` generally.
235+ more samples will be removed when `kind_sel="all"`, generally.
228236
229237 {n_jobs}
230238
231239 Attributes
232240 ----------
233241 sampling_strategy_ : dict
234242 Dictionary containing the information to sample the dataset. The keys
235- corresponds to the class labels from which to sample and the values
243+ correspond to the class labels from which to sample and the values
236244 are the number of samples to sample.
237245
238246 nn_ : estimator object
@@ -269,7 +277,7 @@ class RepeatedEditedNearestNeighbours(BaseCleaningSampler):
269277
270278 EditedNearestNeighbours : Undersample by editing samples.
271279
272- AllKNN : Undersample using ENN and various number of neighbours.
280+ AllKNN : Undersample using ENN with varying neighbours.
273281
274282 Notes
275283 -----
@@ -413,8 +421,12 @@ def _more_tags(self):
413421class AllKNN (BaseCleaningSampler ):
414422 """Undersample based on the AllKNN method.
415423
416- This method will apply ENN several time and will vary the number of nearest
417- neighbours.
424+ This method will apply :class:`EditedNearestNeighbours` several times varying the
425+ number of nearest neighbours at each round. It begins by examining 1 closest
426+ neighbour, and it incrases the neighbourhood by 1 at each round.
427+
428+ The algorithm stops when the maximum number of neighbours are examined or
429+ when the majority class becomes the minority class, whichever comes first.
418430
419431 Read more in the :ref:`User Guide <edited_nearest_neighbors>`.
420432
@@ -423,21 +435,23 @@ class AllKNN(BaseCleaningSampler):
423435 {sampling_strategy}
424436
425437 n_neighbors : int or estimator object, default=3
426- If ``int``, size of the neighbourhood to consider to compute the
427- nearest neighbors. If object, an estimator that inherits from
428- :class:`~sklearn.neighbors.base.KNeighborsMixin` that will be used to
429- find the nearest-neighbors. By default, it will be a 3-NN.
438+ If ``int``, size of the maximum neighbourhood to examine for the undersampling.
439+ If `n_neighbors=3`, in the first iteration the algorithm will examine 1 closest
440+ neigbhour, in the second round 2, and in the final round 3. If object, an
441+ estimator that inherits from :class:`~sklearn.neighbors.base.KNeighborsMixin`
442+ that will be used to find the nearest-neighbors. Note that if you want to
443+ examine the 3 closest neighbours of a sample, you need to pass a 4-KNN.
430444
431445 kind_sel : {{'all', 'mode'}}, default='all'
432- Strategy to use in order to exclude samples.
446+ Strategy to use to exclude samples.
433447
434- - If ``'all'``, all neighbours will have to agree with the samples of
435- interest to not be excluded.
436- - If ``'mode'``, the majority vote of the neighbours will be used in
437- order to exclude a sample .
448+ - If ``'all'``, all neighbours should be of the same class of the examined
449+ sample for it not be excluded.
450+ - If ``'mode'``, most neighbours should be of the same class of the examined
451+ sample for it not be excluded .
438452
439453 The strategy `"all"` will be less conservative than `'mode'`. Thus,
440- more samples will be removed when `kind_sel="all"` generally.
454+ more samples will be removed when `kind_sel="all"`, generally.
441455
442456 allow_minority : bool, default=False
443457 If ``True``, it allows the majority classes to become the minority
@@ -451,7 +465,7 @@ class without early stopping.
451465 ----------
452466 sampling_strategy_ : dict
453467 Dictionary containing the information to sample the dataset. The keys
454- corresponds to the class labels from which to sample and the values
468+ correspond to the class labels from which to sample and the values
455469 are the number of samples to sample.
456470
457471 nn_ : estimator object
0 commit comments