updates user guide for enn, renn and allknn

solegalli · solegalli · commit dae3e2e42dae · 2021-08-05T19:35:09.000+02:00
diff --git a/doc/under_sampling.rst b/doc/under_sampling.rst
@@ -198,7 +198,7 @@ Cleaning under-sampling techniques
 ----------------------------------
 
 Cleaning under-sampling techniques do not allow to specify the number of
-samples to have in each class. In fact, each algorithm implement an heuristic
+samples to have in each class. In fact, each algorithm implements an heuristic
 which will clean the dataset.
 
 .. _tomek_links:
@@ -240,11 +240,17 @@ figure illustrates this behaviour.
 Edited data set using nearest neighbours
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-:class:`EditedNearestNeighbours` applies a nearest-neighbors algorithm and
-"edit" the dataset by removing samples which do not agree "enough" with their
-neighboorhood :cite:`wilson1972asymptotic`. For each sample in the class to be
-under-sampled, the nearest-neighbours are computed and if the selection
-criterion is not fulfilled, the sample is removed::
+:class:`EditedNearestNeighbours` trains a nearest-neighbors algorithm and
+then looks at the closest neighbours of each data point of the class to be
+under-sampled, and "edits" the dataset by removing samples which do not agree
+"enough" with their neighborhood :cite:`wilson1972asymptotic`. In short,
+a KNN algorithm is trained on the data. Then, for each sample in the class
+to be under-sampled, the (K-1) nearest-neighbours are identified. Note that
+if a 4-KNN algorithm is trained, only 3 neighbours will be examined, because
+the sample being inspected is the fourth neighbour returned by the algorithm.
+Once the neighbours are identified, if all the neighbours or most of the
+neighbours agree with the class of the sample being inspected, the sample is
+kept, otherwise removed. Check the selection criteria below::
 
   >>> sorted(Counter(y).items())
   [(0, 64), (1, 262), (2, 4674)]
@@ -256,10 +262,9 @@ criterion is not fulfilled, the sample is removed::
 
 Two selection criteria are currently available: (i) the majority (i.e.,
 ``kind_sel='mode'``) or (ii) all (i.e., ``kind_sel='all'``) the
-nearest-neighbors have to belong to the same class than the sample inspected to
-keep it in the dataset. Thus, it implies that `kind_sel='all'` will be less
-conservative than `kind_sel='mode'`, and more samples will be excluded in
-the former strategy than the latest::
+nearest-neighbors must belong to the same class than the sample inspected to
+keep it in the dataset. This means that `kind_sel='all'` will be less
+conservative than `kind_sel='mode'`, and more samples will be excluded::
 
   >>> enn = EditedNearestNeighbours(kind_sel="all")
   >>> X_resampled, y_resampled = enn.fit_resample(X, y)
@@ -270,32 +275,53 @@ the former strategy than the latest::
   >>> print(sorted(Counter(y_resampled).items()))
   [(0, 64), (1, 234), (2, 4666)]
 
-The parameter ``n_neighbors`` allows to give a classifier subclassed from
-``KNeighborsMixin`` from scikit-learn to find the nearest neighbors and make
-the decision to keep a given sample or not.
+The parameter ``n_neighbors`` can take a classifier subclassed from
+``KNeighborsMixin`` from scikit-learn to find the nearest neighbors.
+Alternatively, an integer can be passed to indicate the size of the
+neighborhood to examine to make a decision. Note that if ``n_neighbors=3``
+this means that the edited nearest neighbors will look at the 3 closest
+neighbours of each sample, thus a 4-KNN algorithm will be trained
+on the data.
 
 :class:`RepeatedEditedNearestNeighbours` extends
 :class:`EditedNearestNeighbours` by repeating the algorithm multiple times
 :cite:`tomek1976experiment`. Generally, repeating the algorithm will delete
-more data::
+more data. The user indicates how many times to repeat the algorithm
+through the parameter ``max_iter``::
 
    >>> from imblearn.under_sampling import RepeatedEditedNearestNeighbours
    >>> renn = RepeatedEditedNearestNeighbours()
    >>> X_resampled, y_resampled = renn.fit_resample(X, y)
    >>> print(sorted(Counter(y_resampled).items()))
    [(0, 64), (1, 208), (2, 4551)]
 
-:class:`AllKNN` differs from the previous
-:class:`RepeatedEditedNearestNeighbours` since the number of neighbors of the
-internal nearest neighbors algorithm is increased at each iteration
-:cite:`tomek1976experiment`::
+:class:`AllKNN` extends :class:`EditedNearestNeighbours` by repeating
+the algorithm multiple times, each time with an additional neighbour
+:cite:`tomek1976experiment`. In other words, :class:`AllKNN` differs
+from :class:`RepeatedEditedNearestNeighbours` in that the number of
+neighbors of the internal nearest neighbors algorithm increases at
+each iteration. In short, in the first iteration, a 2-KNN algorithm
+is trained on the data to examine the 1 closest neighbour of each
+sample from the class to be under-sampled. In each subsequent
+iteration, the neighbourhood examined is increased by 1, until the
+number of neighbours to examine indicated in the parameter ``n_neighbors``::
 
   >>> from imblearn.under_sampling import AllKNN
   >>> allknn = AllKNN()
   >>> X_resampled, y_resampled = allknn.fit_resample(X, y)
   >>> print(sorted(Counter(y_resampled).items()))
   [(0, 64), (1, 220), (2, 4601)]
 
+
+The parameter ``n_neighbors`` can take an integer to indicate the size
+of the neighborhood to examine to make a decision in the last iteration.
+Thus, if ``n_neighbors=3``, AlKNN will examine the 1 closest neighbour
+in the first iteration, the 2 closest neighbours in the second iteration
+and the 3 closest neighbors in the third iteration. The parameter
+``n_neighbors`` can also take a classifier subclassed from
+``KNeighborsMixin`` from scikit-learn to find the nearest neighbors.
+Again, this will be the KNN used in the last iteration.
+
 In the example below, it can be seen that the three algorithms have similar
 impact by cleaning noisy samples next to the boundaries of the classes.