@@ -237,14 +237,23 @@ figure illustrates this behaviour.
237237
238238.. _edited_nearest_neighbors :
239239
240- Edited data set using nearest neighbours
241- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
240+ Editing data set using nearest neighbours
241+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
242242
243- :class: `EditedNearestNeighbours ` applies a nearest-neighbors algorithm and
244- "edit" the dataset by removing samples which do not agree "enough" with their
245- neighboorhood :cite: `wilson1972asymptotic `. For each sample in the class to be
246- under-sampled, the nearest-neighbours are computed and if the selection
247- criterion is not fulfilled, the sample is removed::
243+ Edited nearest neighbours
244+ ~~~~~~~~~~~~~~~~~~~~~~~~~
245+
246+ The edited nearest neighbours methodology uses KNN to identify the neighbours of the
247+ targeted class samples, and then removes observations if all or most of their
248+ neighbours are from a different class :cite: `wilson1972asymptotic `.
249+
250+ :class: `EditedNearestNeighbours ` carries out the following steps:
251+
252+ 1. Train a KNN using the entire dataset (typically a 3-KNN).
253+ 2. Finds each observations 3 closest neighbours (only for the targeted classes).
254+ 3. Removes observations if any or most of its neighbours belong to a different class.
255+
256+ Below the implementation::
248257
249258 >>> sorted(Counter(y).items())
250259 [(0, 64), (1, 262), (2, 4674)]
@@ -254,12 +263,12 @@ criterion is not fulfilled, the sample is removed::
254263 >>> print(sorted(Counter(y_resampled).items()))
255264 [(0, 64), (1, 213), (2, 4568)]
256265
257- Two selection criteria are currently available: (i) the majority (i.e.,
258- `` kind_sel='mode' ``) or (ii) all (i.e., `` kind_sel='all' ``) the
259- nearest-neighbors have to belong to the same class than the sample inspected to
260- keep it in the dataset. Thus, it implies that ` kind_sel='all' ` will be less
261- conservative than `kind_sel='mode' `, and more samples will be excluded in
262- the former strategy than the latest ::
266+
267+ To paraphrase step 3, :class: ` EditedNearestNeighbours ` will retain observations from
268+ the majority class when ** most **, or ** all ** of its neighbours are from the same class.
269+ To control this behaviour we set `` kind_sel='mode' `` or `` kind_sel='all' ``,
270+ respectively. Hence, `kind_sel='all' ` is less conservative than ` kind_sel='mode' `,
271+ resulting in a removal of more samples ::
263272
264273 >>> enn = EditedNearestNeighbours(kind_sel="all")
265274 >>> X_resampled, y_resampled = enn.fit_resample(X, y)
@@ -270,9 +279,11 @@ the former strategy than the latest::
270279 >>> print(sorted(Counter(y_resampled).items()))
271280 [(0, 64), (1, 234), (2, 4666)]
272281
273- The parameter ``n_neighbors `` allows to give a classifier subclassed from
274- ``KNeighborsMixin `` from scikit-learn to find the nearest neighbors and make
275- the decision to keep a given sample or not.
282+ The parameter ``n_neighbors `` accepts integers. The integer refers to the number of
283+ neighbours to examine for each sample. It can also take a classifier subclassed from
284+ ``KNeighborsMixin `` from scikit-learn. When passing a classifier, note that, if you
285+ pass a 3-KNN classifier, only 2 neighbours will be examined for the cleaning, as the
286+ third sample is the one being examined for exclusion.
276287
277288:class: `RepeatedEditedNearestNeighbours ` extends
278289:class: `EditedNearestNeighbours ` by repeating the algorithm multiple times
0 commit comments