@@ -204,38 +204,49 @@ affected by noise due to the first step sample selection.
204204Cleaning under-sampling techniques
205205----------------------------------
206206
207- Cleaning under-sampling techniques do not allow to specify the number of
208- samples to have in each class. In fact, each algorithm implement an heuristic
209- which will clean the dataset.
207+ Cleaning under-sampling methods "clean" the feature space by removing
208+ either "noisy" observations or observations that are "too easy to classify", depending
209+ on the method. The final number of observations in each targeted class varies with the
210+ cleaning method and cannot be specified by the user.
210211
211212.. _tomek_links :
212213
213214Tomek's links
214215^^^^^^^^^^^^^
215216
216- :class: `TomekLinks ` detects the so-called Tomek's links :cite: `tomek1976two `. A
217- Tomek's link between two samples of different class :math: `x` and :math: `y` is
218- defined such that for any sample :math: `z`:
217+ A Tomek's link exists when two samples from different classes are closest neighbors to
218+ each other.
219+
220+ Mathematically, a Tomek's link between two samples from different classes :math: `x`
221+ and :math: `y` is defined such that for any sample :math: `z`:
219222
220223.. math ::
221224
222225 d(x, y) < d(x, z) \text { and } d(x, y) < d(y, z)
223226
224- where :math: `d(.)` is the distance between the two samples. In some other
225- words, a Tomek's link exist if the two samples are the nearest neighbors of
226- each other. In the figure below, a Tomek's link is illustrated by highlighting
227- the samples of interest in green.
227+ where :math: `d(.)` is the distance between the two samples.
228+
229+ :class: `TomekLinks ` detects and removes Tomek's links :cite: `tomek1976two `. The
230+ underlying idea is that Tomek's links are noisy or hard to classify observations and
231+ would not help the algorithm find a suitable discrimination boundary.
232+
233+ In the following figure, a Tomek's link between an observation of class :math: `+` and
234+ class :math: `-` is highlighted in green:
228235
229236.. image :: ./auto_examples/under-sampling/images/sphx_glr_plot_illustration_tomek_links_001.png
230237 :target: ./auto_examples/under-sampling/plot_illustration_tomek_links.html
231238 :scale: 60
232239 :align: center
233240
234- The parameter ``sampling_strategy `` control which sample of the link will be
235- removed. For instance, the default (i.e., ``sampling_strategy='auto' ``) will
236- remove the sample from the majority class. Both samples from the majority and
237- minority class can be removed by setting ``sampling_strategy `` to ``'all' ``. The
238- figure illustrates this behaviour.
241+ When :class: `TomekLinks ` finds a Tomek's link, it can either remove the sample of the
242+ majority class, or both. The parameter ``sampling_strategy `` controls which samples
243+ from the link will be removed. By default (i.e., ``sampling_strategy='auto' ``), it will
244+ remove the sample from the majority class. Both samples, that is that from the majority
245+ and the one from the minority class, can be removed by setting ``sampling_strategy `` to
246+ ``'all' ``.
247+
248+ The following figure illustrates this behaviour: on the left, only the sample from the
249+ majority class is removed, whereas on the right, the entire Tomek's link is removed.
239250
240251.. image :: ./auto_examples/under-sampling/images/sphx_glr_plot_illustration_tomek_links_002.png
241252 :target: ./auto_examples/under-sampling/plot_illustration_tomek_links.html
0 commit comments