Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 28 additions & 16 deletions doc/under_sampling.rst
Original file line number Diff line number Diff line change
Expand Up @@ -237,14 +237,23 @@ figure illustrates this behaviour.

.. _edited_nearest_neighbors:

Edited data set using nearest neighbours
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Editing data using nearest neighbours
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

:class:`EditedNearestNeighbours` applies a nearest-neighbors algorithm and
"edit" the dataset by removing samples which do not agree "enough" with their
neighboorhood :cite:`wilson1972asymptotic`. For each sample in the class to be
under-sampled, the nearest-neighbours are computed and if the selection
criterion is not fulfilled, the sample is removed::
Edited nearest neighbours
~~~~~~~~~~~~~~~~~~~~~~~~~

The edited nearest neighbours methodology uses K-Nearest Neighbours to identify the
neighbours of the targeted class samples, and then removes observations if any or most
of their neighbours are from a different class :cite:`wilson1972asymptotic`.

:class:`EditedNearestNeighbours` carries out the following steps:

1. Train a K-Nearest neighbours using the entire dataset.
2. Find each observations' K closest neighbours (only for the targeted classes).
3. Remove observations if any or most of its neighbours belong to a different class.

Below the code implementation::

>>> sorted(Counter(y).items())
[(0, 64), (1, 262), (2, 4674)]
Expand All @@ -254,12 +263,12 @@ criterion is not fulfilled, the sample is removed::
>>> print(sorted(Counter(y_resampled).items()))
[(0, 64), (1, 213), (2, 4568)]

Two selection criteria are currently available: (i) the majority (i.e.,
``kind_sel='mode'``) or (ii) all (i.e., ``kind_sel='all'``) the
nearest-neighbors have to belong to the same class than the sample inspected to
keep it in the dataset. Thus, it implies that `kind_sel='all'` will be less
conservative than `kind_sel='mode'`, and more samples will be excluded in
the former strategy than the latest::

To paraphrase step 3, :class:`EditedNearestNeighbours` will retain observations from
the majority class when **most**, or **all** of its neighbours are from the same class.
To control this behaviour we set ``kind_sel='mode'`` or ``kind_sel='all'``,
respectively. Hence, `kind_sel='all'` is less conservative than `kind_sel='mode'`,
resulting in the removal of more samples::

>>> enn = EditedNearestNeighbours(kind_sel="all")
>>> X_resampled, y_resampled = enn.fit_resample(X, y)
Expand All @@ -270,9 +279,12 @@ the former strategy than the latest::
>>> print(sorted(Counter(y_resampled).items()))
[(0, 64), (1, 234), (2, 4666)]

The parameter ``n_neighbors`` allows to give a classifier subclassed from
``KNeighborsMixin`` from scikit-learn to find the nearest neighbors and make
the decision to keep a given sample or not.
The parameter ``n_neighbors`` accepts integers. The integer refers to the number of
neighbours to examine for each sample. It can also take a classifier subclassed from
``KNeighborsMixin`` from scikit-learn. When passing a classifier, note that, if you
pass a 3-Nearest Neighbors classifier, only 2 neighbours will be examined for the cleaning, as the
third sample is the one being examined for undersampling since it is part of the
samples provided at `fit`.

:class:`RepeatedEditedNearestNeighbours` extends
:class:`EditedNearestNeighbours` by repeating the algorithm multiple times
Expand Down