From 360a8ee5ad3763491a3ea4842aace6d30680a3a4 Mon Sep 17 00:00:00 2001 From: Soledad Galli Date: Mon, 10 Jul 2023 20:25:55 +0200 Subject: [PATCH 1/6] update tomeklinks docs --- doc/under_sampling.rst | 41 ++++++++++++++++++++++++++--------------- 1 file changed, 26 insertions(+), 15 deletions(-) diff --git a/doc/under_sampling.rst b/doc/under_sampling.rst index 9f2795430..9edcefecc 100644 --- a/doc/under_sampling.rst +++ b/doc/under_sampling.rst @@ -197,38 +197,49 @@ affected by noise due to the first step sample selection. Cleaning under-sampling techniques ---------------------------------- -Cleaning under-sampling techniques do not allow to specify the number of -samples to have in each class. In fact, each algorithm implement an heuristic -which will clean the dataset. +cleaning under-sampling methods "clean" the feature space by removing +either "noisy" or observations that are "too easy to classify", depending on the +method. The final number of observations in each targeted class varies with the +cleaning method and can't be specified by the user. .. _tomek_links: Tomek's links ^^^^^^^^^^^^^ -:class:`TomekLinks` detects the so-called Tomek's links :cite:`tomek1976two`. A -Tomek's link between two samples of different class :math:`x` and :math:`y` is -defined such that for any sample :math:`z`: +A Tomek's link exists when two samples from different classes are closest neighbors to +each other. + +Mathematically, a Tomek's link between two samples from different classes :math:`x` +and :math:`y` is defined such that for any sample :math:`z`: .. math:: d(x, y) < d(x, z) \text{ and } d(x, y) < d(y, z) -where :math:`d(.)` is the distance between the two samples. In some other -words, a Tomek's link exist if the two samples are the nearest neighbors of -each other. In the figure below, a Tomek's link is illustrated by highlighting -the samples of interest in green. +where :math:`d(.)` is the distance between the two samples. + +:class:`TomekLinks` detects and removes Tomek's links :cite:`tomek1976two`. The +underlying idea is that Tomek's links are noisy or hard to classify observations and +would not help the algorithm find a suitable discrimination boundary. + +In the following figure, a Tomek's link between an observation of class :math:`+` and +class :math:`-`is highlighted in green: .. image:: ./auto_examples/under-sampling/images/sphx_glr_plot_illustration_tomek_links_001.png :target: ./auto_examples/under-sampling/plot_illustration_tomek_links.html :scale: 60 :align: center -The parameter ``sampling_strategy`` control which sample of the link will be -removed. For instance, the default (i.e., ``sampling_strategy='auto'``) will -remove the sample from the majority class. Both samples from the majority and -minority class can be removed by setting ``sampling_strategy`` to ``'all'``. The -figure illustrates this behaviour. +When :class:`TomekLinks` finds a Tomek's link, it can either remove the sample of the +majority class, or both. The parameter ``sampling_strategy`` controls which samples +from the link will be removed. By default (i.e., ``sampling_strategy='auto'``), it will +remove the sample from the majority class. Both samples, that is that from the majority +and the one from the minority class, can be removed by setting ``sampling_strategy`` to +``'all'``. + +The following figure illustrates this behaviour: on the left, only the sample from the +majority class is removed, whereas on the right, the entire Tomek's link is removed. .. image:: ./auto_examples/under-sampling/images/sphx_glr_plot_illustration_tomek_links_002.png :target: ./auto_examples/under-sampling/plot_illustration_tomek_links.html From eab58c2c0bfe83e6a9334be2b09f0bc9659b09bc Mon Sep 17 00:00:00 2001 From: Soledad Galli Date: Tue, 11 Jul 2023 11:31:34 +0200 Subject: [PATCH 2/6] add cap Co-authored-by: Guillaume Lemaitre --- doc/under_sampling.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/under_sampling.rst b/doc/under_sampling.rst index 9edcefecc..1085be229 100644 --- a/doc/under_sampling.rst +++ b/doc/under_sampling.rst @@ -197,7 +197,7 @@ affected by noise due to the first step sample selection. Cleaning under-sampling techniques ---------------------------------- -cleaning under-sampling methods "clean" the feature space by removing +Cleaning under-sampling methods "clean" the feature space by removing either "noisy" or observations that are "too easy to classify", depending on the method. The final number of observations in each targeted class varies with the cleaning method and can't be specified by the user. From e321d6414916bb58e8c12602f4697e077ffac30e Mon Sep 17 00:00:00 2001 From: Soledad Galli Date: Tue, 11 Jul 2023 11:32:06 +0200 Subject: [PATCH 3/6] re-word Co-authored-by: Guillaume Lemaitre --- doc/under_sampling.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/under_sampling.rst b/doc/under_sampling.rst index 1085be229..36feb6ec9 100644 --- a/doc/under_sampling.rst +++ b/doc/under_sampling.rst @@ -198,7 +198,7 @@ Cleaning under-sampling techniques ---------------------------------- Cleaning under-sampling methods "clean" the feature space by removing -either "noisy" or observations that are "too easy to classify", depending on the +either "noisy" observations or observations that are "too easy to classify", depending on the method. The final number of observations in each targeted class varies with the cleaning method and can't be specified by the user. From 1f7245eeff8e715a286fcf983bd2d628144c1c35 Mon Sep 17 00:00:00 2001 From: Soledad Galli Date: Tue, 11 Jul 2023 11:32:24 +0200 Subject: [PATCH 4/6] reword Co-authored-by: Guillaume Lemaitre --- doc/under_sampling.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/under_sampling.rst b/doc/under_sampling.rst index 36feb6ec9..736279f62 100644 --- a/doc/under_sampling.rst +++ b/doc/under_sampling.rst @@ -200,7 +200,7 @@ Cleaning under-sampling techniques Cleaning under-sampling methods "clean" the feature space by removing either "noisy" observations or observations that are "too easy to classify", depending on the method. The final number of observations in each targeted class varies with the -cleaning method and can't be specified by the user. +cleaning method and cannot be specified by the user. .. _tomek_links: From a3ee9509731f6aa86b0f85d4f898e6003900a1f9 Mon Sep 17 00:00:00 2001 From: Soledad Galli Date: Tue, 11 Jul 2023 11:32:38 +0200 Subject: [PATCH 5/6] add space Co-authored-by: Guillaume Lemaitre --- doc/under_sampling.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/under_sampling.rst b/doc/under_sampling.rst index 736279f62..007b5316e 100644 --- a/doc/under_sampling.rst +++ b/doc/under_sampling.rst @@ -224,7 +224,7 @@ underlying idea is that Tomek's links are noisy or hard to classify observations would not help the algorithm find a suitable discrimination boundary. In the following figure, a Tomek's link between an observation of class :math:`+` and -class :math:`-`is highlighted in green: +class :math:`-` is highlighted in green: .. image:: ./auto_examples/under-sampling/images/sphx_glr_plot_illustration_tomek_links_001.png :target: ./auto_examples/under-sampling/plot_illustration_tomek_links.html From c25c0b681be7a5e7494403be300c88c1b6c567d7 Mon Sep 17 00:00:00 2001 From: Soledad Galli Date: Tue, 11 Jul 2023 11:35:12 +0200 Subject: [PATCH 6/6] shorten line --- doc/under_sampling.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/doc/under_sampling.rst b/doc/under_sampling.rst index 007b5316e..78f43ad8b 100644 --- a/doc/under_sampling.rst +++ b/doc/under_sampling.rst @@ -198,8 +198,8 @@ Cleaning under-sampling techniques ---------------------------------- Cleaning under-sampling methods "clean" the feature space by removing -either "noisy" observations or observations that are "too easy to classify", depending on the -method. The final number of observations in each targeted class varies with the +either "noisy" observations or observations that are "too easy to classify", depending +on the method. The final number of observations in each targeted class varies with the cleaning method and cannot be specified by the user. .. _tomek_links: