@@ -198,7 +198,7 @@ Cleaning under-sampling techniques
198198----------------------------------
199199
200200Cleaning under-sampling techniques do not allow to specify the number of
201- samples to have in each class. In fact, each algorithm implement an heuristic
201+ samples to have in each class. In fact, each algorithm implements an heuristic
202202which will clean the dataset.
203203
204204.. _tomek_links :
@@ -240,11 +240,17 @@ figure illustrates this behaviour.
240240Edited data set using nearest neighbours
241241^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
242242
243- :class: `EditedNearestNeighbours ` applies a nearest-neighbors algorithm and
244- "edit" the dataset by removing samples which do not agree "enough" with their
245- neighboorhood :cite: `wilson1972asymptotic `. For each sample in the class to be
246- under-sampled, the nearest-neighbours are computed and if the selection
247- criterion is not fulfilled, the sample is removed::
243+ :class: `EditedNearestNeighbours ` trains a nearest-neighbors algorithm and
244+ then looks at the closest neighbours of each data point of the class to be
245+ under-sampled, and "edits" the dataset by removing samples which do not agree
246+ "enough" with their neighborhood :cite: `wilson1972asymptotic `. In short,
247+ a KNN algorithm is trained on the data. Then, for each sample in the class
248+ to be under-sampled, the (K-1) nearest-neighbours are identified. Note that
249+ if a 4-KNN algorithm is trained, only 3 neighbours will be examined, because
250+ the sample being inspected is the fourth neighbour returned by the algorithm.
251+ Once the neighbours are identified, if all the neighbours or most of the
252+ neighbours agree with the class of the sample being inspected, the sample is
253+ kept, otherwise removed. Check the selection criteria below::
248254
249255 >>> sorted(Counter(y).items())
250256 [(0, 64), (1, 262), (2, 4674)]
@@ -256,10 +262,9 @@ criterion is not fulfilled, the sample is removed::
256262
257263Two selection criteria are currently available: (i) the majority (i.e.,
258264``kind_sel='mode' ``) or (ii) all (i.e., ``kind_sel='all' ``) the
259- nearest-neighbors have to belong to the same class than the sample inspected to
260- keep it in the dataset. Thus, it implies that `kind_sel='all' ` will be less
261- conservative than `kind_sel='mode' `, and more samples will be excluded in
262- the former strategy than the latest::
265+ nearest-neighbors must belong to the same class than the sample inspected to
266+ keep it in the dataset. This means that `kind_sel='all' ` will be less
267+ conservative than `kind_sel='mode' `, and more samples will be excluded::
263268
264269 >>> enn = EditedNearestNeighbours(kind_sel="all")
265270 >>> X_resampled, y_resampled = enn.fit_resample(X, y)
@@ -270,32 +275,53 @@ the former strategy than the latest::
270275 >>> print(sorted(Counter(y_resampled).items()))
271276 [(0, 64), (1, 234), (2, 4666)]
272277
273- The parameter ``n_neighbors `` allows to give a classifier subclassed from
274- ``KNeighborsMixin `` from scikit-learn to find the nearest neighbors and make
275- the decision to keep a given sample or not.
278+ The parameter ``n_neighbors `` can take a classifier subclassed from
279+ ``KNeighborsMixin `` from scikit-learn to find the nearest neighbors.
280+ Alternatively, an integer can be passed to indicate the size of the
281+ neighborhood to examine to make a decision. Note that if ``n_neighbors=3 ``
282+ this means that the edited nearest neighbors will look at the 3 closest
283+ neighbours of each sample, thus a 4-KNN algorithm will be trained
284+ on the data.
276285
277286:class: `RepeatedEditedNearestNeighbours ` extends
278287:class: `EditedNearestNeighbours ` by repeating the algorithm multiple times
279288:cite: `tomek1976experiment `. Generally, repeating the algorithm will delete
280- more data::
289+ more data. The user indicates how many times to repeat the algorithm
290+ through the parameter ``max_iter ``::
281291
282292 >>> from imblearn.under_sampling import RepeatedEditedNearestNeighbours
283293 >>> renn = RepeatedEditedNearestNeighbours()
284294 >>> X_resampled, y_resampled = renn.fit_resample(X, y)
285295 >>> print(sorted(Counter(y_resampled).items()))
286296 [(0, 64), (1, 208), (2, 4551)]
287297
288- :class: `AllKNN ` differs from the previous
289- :class: `RepeatedEditedNearestNeighbours ` since the number of neighbors of the
290- internal nearest neighbors algorithm is increased at each iteration
291- :cite: `tomek1976experiment `::
298+ :class: `AllKNN ` extends :class: `EditedNearestNeighbours ` by repeating
299+ the algorithm multiple times, each time with an additional neighbour
300+ :cite: `tomek1976experiment `. In other words, :class: `AllKNN ` differs
301+ from :class: `RepeatedEditedNearestNeighbours ` in that the number of
302+ neighbors of the internal nearest neighbors algorithm increases at
303+ each iteration. In short, in the first iteration, a 2-KNN algorithm
304+ is trained on the data to examine the 1 closest neighbour of each
305+ sample from the class to be under-sampled. In each subsequent
306+ iteration, the neighbourhood examined is increased by 1, until the
307+ number of neighbours to examine indicated in the parameter ``n_neighbors ``::
292308
293309 >>> from imblearn.under_sampling import AllKNN
294310 >>> allknn = AllKNN()
295311 >>> X_resampled, y_resampled = allknn.fit_resample(X, y)
296312 >>> print(sorted(Counter(y_resampled).items()))
297313 [(0, 64), (1, 220), (2, 4601)]
298314
315+
316+ The parameter ``n_neighbors `` can take an integer to indicate the size
317+ of the neighborhood to examine to make a decision in the last iteration.
318+ Thus, if ``n_neighbors=3 ``, AlKNN will examine the 1 closest neighbour
319+ in the first iteration, the 2 closest neighbours in the second iteration
320+ and the 3 closest neighbors in the third iteration. The parameter
321+ ``n_neighbors `` can also take a classifier subclassed from
322+ ``KNeighborsMixin `` from scikit-learn to find the nearest neighbors.
323+ Again, this will be the KNN used in the last iteration.
324+
299325In the example below, it can be seen that the three algorithms have similar
300326impact by cleaning noisy samples next to the boundaries of the classes.
301327
0 commit comments