@@ -306,20 +306,25 @@ impact by cleaning noisy samples next to the boundaries of the classes.
306306
307307.. _condensed_nearest_neighbors :
308308
309- Condensed nearest neighbors and derived algorithms
310- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
309+ Condensed nearest neighbors
310+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^
311311
312312:class: `CondensedNearestNeighbour ` uses a 1 nearest neighbor rule to
313- iteratively decide if a sample should be removed or not
314- :cite: `hart1968condensed `. The algorithm is running as followed :
313+ iteratively decide if a sample should be removed
314+ :cite: `hart1968condensed `. The algorithm runs as follows :
315315
3163161. Get all minority samples in a set :math: `C`.
3173172. Add a sample from the targeted class (class to be under-sampled) in
318318 :math: `C` and all other samples of this class in a set :math: `S`.
319- 3. Go through the set :math: `S`, sample by sample, and classify each sample
320- using a 1 nearest neighbor rule.
321- 4. If the sample is misclassified, add it to :math: `C`, otherwise do nothing.
322- 5. Reiterate on :math: `S` until there is no samples to be added.
319+ 3. Train a 1-Nearest Neigbhour on :math: `C`.
320+ 4. Go through the samples in set :math: `S`, sample by sample, and classify each one
321+ using a 1 nearest neighbor rule (trained in 3).
322+ 5. If the sample is misclassified, add it to :math: `C`, and go to step 6.
323+ 6. Repeat steps 3 to 5 until all observations in :math: `S` have been examined.
324+
325+ The final dataset is :math: `S`, containing all observations from the minority class and
326+ those from the majority that were miss-classified by the successive
327+ 1-Nearest Neigbhour algorithms.
323328
324329The :class: `CondensedNearestNeighbour ` can be used in the following manner::
325330
@@ -329,23 +334,44 @@ The :class:`CondensedNearestNeighbour` can be used in the following manner::
329334 >>> print(sorted(Counter(y_resampled).items()))
330335 [(0, 64), (1, 24), (2, 115)]
331336
332- However as illustrated in the figure below, :class: `CondensedNearestNeighbour `
333- is sensitive to noise and will add noisy samples.
337+ :class: `CondensedNearestNeighbour ` is sensitive to noise and may add noisy samples
338+ (see figure later on).
339+
340+ One Sided Selection
341+ ~~~~~~~~~~~~~~~~~~~
342+
343+ In an attempt to remove the noisy observations introduced by
344+ :class: `CondensedNearestNeighbour `, :class: `OneSidedSelection `
345+ will first find the observations that are hard to classify, and then will use
346+ :class: `TomekLinks ` to remove noisy samples :cite: `hart1968condensed `.
347+ :class: `OneSidedSelection ` runs as follows:
348+
349+ 1. Get all minority samples in a set :math: `C`.
350+ 2. Add a sample from the targeted class (class to be under-sampled) in
351+ :math: `C` and all other samples of this class in a set :math: `S`.
352+ 3. Train a 1-Nearest Neighbors on :math: `C`.
353+ 4. Using a 1 nearest neighbor rule trained in 3, classify all samples in
354+ set :math: `S`.
355+ 5. Add all misclassified samples to :math: `C`.
356+ 6. Remove Tomek Links from :math: `C`.
357+
358+ The final dataset is :math: `S`, containing all observations from the minority class,
359+ plus the observations from the majority that were added at random, plus all
360+ those from the majority that were miss-classified by the 1-Nearest Neighbors algorithms.
334361
335- In the contrary, :class: `OneSidedSelection ` will use :class: `TomekLinks ` to
336- remove noisy samples :cite: `hart1968condensed `. In addition, the 1 nearest
337- neighbor rule is applied to all samples and the one which are misclassified
338- will be added to the set :math: `C`. No iteration on the set :math: `S` will take
339- place. The class can be used as::
362+ Note that differently from :class: `CondensedNearestNeighbour `, :class: `OneSidedSelection `
363+ does not train a K-Nearest Neighbors after each sample is misclassified. It uses the
364+ 1-Nearest Neighbors from step 3 to classify all samples from the majority in 1 pass.
365+ The class can be used as::
340366
341367 >>> from imblearn.under_sampling import OneSidedSelection
342368 >>> oss = OneSidedSelection(random_state=0)
343369 >>> X_resampled, y_resampled = oss.fit_resample(X, y)
344370 >>> print(sorted(Counter(y_resampled).items()))
345371 [(0, 64), (1, 174), (2, 4404)]
346372
347- Our implementation offer to set the number of seeds to put in the set :math: `C`
348- originally by setting the parameter ``n_seeds_S ``.
373+ Our implementation offers the possibility to set the number of observations
374+ to put at random in the set :math: `C` through the parameter ``n_seeds_S ``.
349375
350376:class: `NeighbourhoodCleaningRule ` will focus on cleaning the data than
351377condensing them :cite: `laurikkala2001improving `. Therefore, it will used the
0 commit comments