@@ -255,14 +255,23 @@ majority class is removed, whereas on the right, the entire Tomek's link is remo
255255
256256.. _edited_nearest_neighbors :
257257
258- Edited data set using nearest neighbours
259- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
258+ Editing data using nearest neighbours
259+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
260260
261- :class: `EditedNearestNeighbours ` applies a nearest-neighbors algorithm and
262- "edit" the dataset by removing samples which do not agree "enough" with their
263- neighboorhood :cite: `wilson1972asymptotic `. For each sample in the class to be
264- under-sampled, the nearest-neighbours are computed and if the selection
265- criterion is not fulfilled, the sample is removed::
261+ Edited nearest neighbours
262+ ~~~~~~~~~~~~~~~~~~~~~~~~~
263+
264+ The edited nearest neighbours methodology uses K-Nearest Neighbours to identify the
265+ neighbours of the targeted class samples, and then removes observations if any or most
266+ of their neighbours are from a different class :cite: `wilson1972asymptotic `.
267+
268+ :class: `EditedNearestNeighbours ` carries out the following steps:
269+
270+ 1. Train a K-Nearest neighbours using the entire dataset.
271+ 2. Find each observations' K closest neighbours (only for the targeted classes).
272+ 3. Remove observations if any or most of its neighbours belong to a different class.
273+
274+ Below the code implementation::
266275
267276 >>> sorted(Counter(y).items())
268277 [(0, 64), (1, 262), (2, 4674)]
@@ -272,12 +281,12 @@ criterion is not fulfilled, the sample is removed::
272281 >>> print(sorted(Counter(y_resampled).items()))
273282 [(0, 64), (1, 213), (2, 4568)]
274283
275- Two selection criteria are currently available: (i) the majority (i.e.,
276- `` kind_sel='mode' ``) or (ii) all (i.e., `` kind_sel='all' ``) the
277- nearest-neighbors have to belong to the same class than the sample inspected to
278- keep it in the dataset. Thus, it implies that ` kind_sel='all' ` will be less
279- conservative than `kind_sel='mode' `, and more samples will be excluded in
280- the former strategy than the latest ::
284+
285+ To paraphrase step 3, :class: ` EditedNearestNeighbours ` will retain observations from
286+ the majority class when ** most **, or ** all ** of its neighbours are from the same class.
287+ To control this behaviour we set `` kind_sel='mode' `` or `` kind_sel='all' ``,
288+ respectively. Hence, `kind_sel='all' ` is less conservative than ` kind_sel='mode' `,
289+ resulting in the removal of more samples ::
281290
282291 >>> enn = EditedNearestNeighbours(kind_sel="all")
283292 >>> X_resampled, y_resampled = enn.fit_resample(X, y)
@@ -288,9 +297,12 @@ the former strategy than the latest::
288297 >>> print(sorted(Counter(y_resampled).items()))
289298 [(0, 64), (1, 234), (2, 4666)]
290299
291- The parameter ``n_neighbors `` allows to give a classifier subclassed from
292- ``KNeighborsMixin `` from scikit-learn to find the nearest neighbors and make
293- the decision to keep a given sample or not.
300+ The parameter ``n_neighbors `` accepts integers. The integer refers to the number of
301+ neighbours to examine for each sample. It can also take a classifier subclassed from
302+ ``KNeighborsMixin `` from scikit-learn. When passing a classifier, note that, if you
303+ pass a 3-Nearest Neighbors classifier, only 2 neighbours will be examined for the cleaning, as the
304+ third sample is the one being examined for undersampling since it is part of the
305+ samples provided at `fit `.
294306
295307Repeated Edited Nearest Neighbours
296308~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0 commit comments