Skip to content

Commit dae3e2e

Browse files
committed
updates user guide for enn, renn and allknn
1 parent 3f0e265 commit dae3e2e

File tree

1 file changed

+44
-18
lines changed

1 file changed

+44
-18
lines changed

doc/under_sampling.rst

Lines changed: 44 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -198,7 +198,7 @@ Cleaning under-sampling techniques
198198
----------------------------------
199199

200200
Cleaning under-sampling techniques do not allow to specify the number of
201-
samples to have in each class. In fact, each algorithm implement an heuristic
201+
samples to have in each class. In fact, each algorithm implements an heuristic
202202
which will clean the dataset.
203203

204204
.. _tomek_links:
@@ -240,11 +240,17 @@ figure illustrates this behaviour.
240240
Edited data set using nearest neighbours
241241
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
242242

243-
:class:`EditedNearestNeighbours` applies a nearest-neighbors algorithm and
244-
"edit" the dataset by removing samples which do not agree "enough" with their
245-
neighboorhood :cite:`wilson1972asymptotic`. For each sample in the class to be
246-
under-sampled, the nearest-neighbours are computed and if the selection
247-
criterion is not fulfilled, the sample is removed::
243+
:class:`EditedNearestNeighbours` trains a nearest-neighbors algorithm and
244+
then looks at the closest neighbours of each data point of the class to be
245+
under-sampled, and "edits" the dataset by removing samples which do not agree
246+
"enough" with their neighborhood :cite:`wilson1972asymptotic`. In short,
247+
a KNN algorithm is trained on the data. Then, for each sample in the class
248+
to be under-sampled, the (K-1) nearest-neighbours are identified. Note that
249+
if a 4-KNN algorithm is trained, only 3 neighbours will be examined, because
250+
the sample being inspected is the fourth neighbour returned by the algorithm.
251+
Once the neighbours are identified, if all the neighbours or most of the
252+
neighbours agree with the class of the sample being inspected, the sample is
253+
kept, otherwise removed. Check the selection criteria below::
248254

249255
>>> sorted(Counter(y).items())
250256
[(0, 64), (1, 262), (2, 4674)]
@@ -256,10 +262,9 @@ criterion is not fulfilled, the sample is removed::
256262

257263
Two selection criteria are currently available: (i) the majority (i.e.,
258264
``kind_sel='mode'``) or (ii) all (i.e., ``kind_sel='all'``) the
259-
nearest-neighbors have to belong to the same class than the sample inspected to
260-
keep it in the dataset. Thus, it implies that `kind_sel='all'` will be less
261-
conservative than `kind_sel='mode'`, and more samples will be excluded in
262-
the former strategy than the latest::
265+
nearest-neighbors must belong to the same class than the sample inspected to
266+
keep it in the dataset. This means that `kind_sel='all'` will be less
267+
conservative than `kind_sel='mode'`, and more samples will be excluded::
263268

264269
>>> enn = EditedNearestNeighbours(kind_sel="all")
265270
>>> X_resampled, y_resampled = enn.fit_resample(X, y)
@@ -270,32 +275,53 @@ the former strategy than the latest::
270275
>>> print(sorted(Counter(y_resampled).items()))
271276
[(0, 64), (1, 234), (2, 4666)]
272277

273-
The parameter ``n_neighbors`` allows to give a classifier subclassed from
274-
``KNeighborsMixin`` from scikit-learn to find the nearest neighbors and make
275-
the decision to keep a given sample or not.
278+
The parameter ``n_neighbors`` can take a classifier subclassed from
279+
``KNeighborsMixin`` from scikit-learn to find the nearest neighbors.
280+
Alternatively, an integer can be passed to indicate the size of the
281+
neighborhood to examine to make a decision. Note that if ``n_neighbors=3``
282+
this means that the edited nearest neighbors will look at the 3 closest
283+
neighbours of each sample, thus a 4-KNN algorithm will be trained
284+
on the data.
276285

277286
:class:`RepeatedEditedNearestNeighbours` extends
278287
:class:`EditedNearestNeighbours` by repeating the algorithm multiple times
279288
:cite:`tomek1976experiment`. Generally, repeating the algorithm will delete
280-
more data::
289+
more data. The user indicates how many times to repeat the algorithm
290+
through the parameter ``max_iter``::
281291

282292
>>> from imblearn.under_sampling import RepeatedEditedNearestNeighbours
283293
>>> renn = RepeatedEditedNearestNeighbours()
284294
>>> X_resampled, y_resampled = renn.fit_resample(X, y)
285295
>>> print(sorted(Counter(y_resampled).items()))
286296
[(0, 64), (1, 208), (2, 4551)]
287297

288-
:class:`AllKNN` differs from the previous
289-
:class:`RepeatedEditedNearestNeighbours` since the number of neighbors of the
290-
internal nearest neighbors algorithm is increased at each iteration
291-
:cite:`tomek1976experiment`::
298+
:class:`AllKNN` extends :class:`EditedNearestNeighbours` by repeating
299+
the algorithm multiple times, each time with an additional neighbour
300+
:cite:`tomek1976experiment`. In other words, :class:`AllKNN` differs
301+
from :class:`RepeatedEditedNearestNeighbours` in that the number of
302+
neighbors of the internal nearest neighbors algorithm increases at
303+
each iteration. In short, in the first iteration, a 2-KNN algorithm
304+
is trained on the data to examine the 1 closest neighbour of each
305+
sample from the class to be under-sampled. In each subsequent
306+
iteration, the neighbourhood examined is increased by 1, until the
307+
number of neighbours to examine indicated in the parameter ``n_neighbors``::
292308

293309
>>> from imblearn.under_sampling import AllKNN
294310
>>> allknn = AllKNN()
295311
>>> X_resampled, y_resampled = allknn.fit_resample(X, y)
296312
>>> print(sorted(Counter(y_resampled).items()))
297313
[(0, 64), (1, 220), (2, 4601)]
298314

315+
316+
The parameter ``n_neighbors`` can take an integer to indicate the size
317+
of the neighborhood to examine to make a decision in the last iteration.
318+
Thus, if ``n_neighbors=3``, AlKNN will examine the 1 closest neighbour
319+
in the first iteration, the 2 closest neighbours in the second iteration
320+
and the 3 closest neighbors in the third iteration. The parameter
321+
``n_neighbors`` can also take a classifier subclassed from
322+
``KNeighborsMixin`` from scikit-learn to find the nearest neighbors.
323+
Again, this will be the KNN used in the last iteration.
324+
299325
In the example below, it can be seen that the three algorithms have similar
300326
impact by cleaning noisy samples next to the boundaries of the classes.
301327

0 commit comments

Comments
 (0)