From 05e2451f24c1e63dc5a64d22306084c5699222bc Mon Sep 17 00:00:00 2001 From: Soledad Galli Date: Mon, 10 Jul 2023 22:04:48 +0200 Subject: [PATCH 1/8] update user guide CNN and OSS --- doc/under_sampling.rst | 56 +++++++++++++++++++++++++++++------------- 1 file changed, 39 insertions(+), 17 deletions(-) diff --git a/doc/under_sampling.rst b/doc/under_sampling.rst index 9f2795430..bfb6cb039 100644 --- a/doc/under_sampling.rst +++ b/doc/under_sampling.rst @@ -306,20 +306,24 @@ impact by cleaning noisy samples next to the boundaries of the classes. .. _condensed_nearest_neighbors: -Condensed nearest neighbors and derived algorithms -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Condensed nearest neighbors +^^^^^^^^^^^^^^^^^^^^^^^^^^^ :class:`CondensedNearestNeighbour` uses a 1 nearest neighbor rule to -iteratively decide if a sample should be removed or not -:cite:`hart1968condensed`. The algorithm is running as followed: +iteratively decide if a sample should be removed +:cite:`hart1968condensed`. The algorithm runs as follows: 1. Get all minority samples in a set :math:`C`. 2. Add a sample from the targeted class (class to be under-sampled) in :math:`C` and all other samples of this class in a set :math:`S`. -3. Go through the set :math:`S`, sample by sample, and classify each sample - using a 1 nearest neighbor rule. -4. If the sample is misclassified, add it to :math:`C`, otherwise do nothing. -5. Reiterate on :math:`S` until there is no samples to be added. +3. Train a 1-KNN on `C`. +4. Go through the samples in set :math:`S`, sample by sample, and classify each one + using a 1 nearest neighbor rule (trained in 3). +5. If the sample is misclassified, add it to :math:`C`, and go to step 6. +6. Repeat steps 3 to 5 until all observations in `S` have been examined. + +The final dataset is `S`, containing all observations from the minority class and +those from the majority that were miss-classified by the successive 1-KNN algorithms. The :class:`CondensedNearestNeighbour` can be used in the following manner:: @@ -329,14 +333,32 @@ The :class:`CondensedNearestNeighbour` can be used in the following manner:: >>> print(sorted(Counter(y_resampled).items())) [(0, 64), (1, 24), (2, 115)] -However as illustrated in the figure below, :class:`CondensedNearestNeighbour` -is sensitive to noise and will add noisy samples. +:class:`CondensedNearestNeighbour` is sensitive to noise and may add noisy samples +(see figure later on). + +One Sided Selection +~~~~~~~~~~~~~~~~~~~ + +In an attempt to remove noisy observations, :class:`OneSidedSelection` +will first find the observations that are hard to classify, and then will use +:class:`TomekLinks` to remove noisy samples :cite:`hart1968condensed`. +:class:`OneSidedSelection` runs as follows: -In the contrary, :class:`OneSidedSelection` will use :class:`TomekLinks` to -remove noisy samples :cite:`hart1968condensed`. In addition, the 1 nearest -neighbor rule is applied to all samples and the one which are misclassified -will be added to the set :math:`C`. No iteration on the set :math:`S` will take -place. The class can be used as:: +1. Get all minority samples in a set :math:`C`. +2. Add a sample from the targeted class (class to be under-sampled) in + :math:`C` and all other samples of this class in a set :math:`S`. +3. Train a 1-KNN on `C`. +4. Using a 1 nearest neighbor rule trained in 3, classify all samples in + set :math:`S`. +5. Add all misclassified samples to :math:`C`. +6. Remove Tomek Links from :math:`C`. + +The final dataset is `S`, containing all observations from the minority class, +plus the observations from the majority that were added at random, plus all +those from the majority that were miss-classified by the 1-KNN algorithms. Note +that differently from :class:`CondensedNearestNeighbour`, :class:`OneSidedSelection` +does not train a KNN after each sample is missclassified. It uses the one KNN +to classify all samples from the majority in 1 pass. The class can be used as:: >>> from imblearn.under_sampling import OneSidedSelection >>> oss = OneSidedSelection(random_state=0) @@ -344,8 +366,8 @@ place. The class can be used as:: >>> print(sorted(Counter(y_resampled).items())) [(0, 64), (1, 174), (2, 4404)] -Our implementation offer to set the number of seeds to put in the set :math:`C` -originally by setting the parameter ``n_seeds_S``. +Our implementation offers the possibility to set the number of observations +to put at random in the set :math:`C` through the parameter ``n_seeds_S``. :class:`NeighbourhoodCleaningRule` will focus on cleaning the data than condensing them :cite:`laurikkala2001improving`. Therefore, it will used the From eb2ec39f9b7484a3c0d43fe2a2242dd1744ab2e1 Mon Sep 17 00:00:00 2001 From: Soledad Galli Date: Mon, 10 Jul 2023 22:09:13 +0200 Subject: [PATCH 2/8] final touches --- doc/under_sampling.rst | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/doc/under_sampling.rst b/doc/under_sampling.rst index bfb6cb039..5168e1491 100644 --- a/doc/under_sampling.rst +++ b/doc/under_sampling.rst @@ -339,7 +339,8 @@ The :class:`CondensedNearestNeighbour` can be used in the following manner:: One Sided Selection ~~~~~~~~~~~~~~~~~~~ -In an attempt to remove noisy observations, :class:`OneSidedSelection` +In an attempt to remove the noisy observations introduced by +:class:`CondensedNearestNeighbour`, :class:`OneSidedSelection` will first find the observations that are hard to classify, and then will use :class:`TomekLinks` to remove noisy samples :cite:`hart1968condensed`. :class:`OneSidedSelection` runs as follows: From 9816834e819a05eccbc4ff457f954bcf7d7c8f69 Mon Sep 17 00:00:00 2001 From: Soledad Galli Date: Tue, 11 Jul 2023 10:53:46 +0200 Subject: [PATCH 3/8] expand knn name Co-authored-by: Guillaume Lemaitre --- doc/under_sampling.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/under_sampling.rst b/doc/under_sampling.rst index 5168e1491..c4565f0cf 100644 --- a/doc/under_sampling.rst +++ b/doc/under_sampling.rst @@ -348,7 +348,7 @@ will first find the observations that are hard to classify, and then will use 1. Get all minority samples in a set :math:`C`. 2. Add a sample from the targeted class (class to be under-sampled) in :math:`C` and all other samples of this class in a set :math:`S`. -3. Train a 1-KNN on `C`. +3. Train a 1-Nearest Neighbors on `C`. 4. Using a 1 nearest neighbor rule trained in 3, classify all samples in set :math:`S`. 5. Add all misclassified samples to :math:`C`. From aa80fd7c6edf22a558d4330a4bca141549f48976 Mon Sep 17 00:00:00 2001 From: Soledad Galli Date: Tue, 11 Jul 2023 10:54:16 +0200 Subject: [PATCH 4/8] add missing math instruction Co-authored-by: Guillaume Lemaitre --- doc/under_sampling.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/under_sampling.rst b/doc/under_sampling.rst index c4565f0cf..4d4e70b04 100644 --- a/doc/under_sampling.rst +++ b/doc/under_sampling.rst @@ -354,7 +354,7 @@ will first find the observations that are hard to classify, and then will use 5. Add all misclassified samples to :math:`C`. 6. Remove Tomek Links from :math:`C`. -The final dataset is `S`, containing all observations from the minority class, +The final dataset is :math:`S`, containing all observations from the minority class, plus the observations from the majority that were added at random, plus all those from the majority that were miss-classified by the 1-KNN algorithms. Note that differently from :class:`CondensedNearestNeighbour`, :class:`OneSidedSelection` From e361d85bb3dbf00718e41a535577f4d54de28540 Mon Sep 17 00:00:00 2001 From: Soledad Galli Date: Tue, 11 Jul 2023 10:54:35 +0200 Subject: [PATCH 5/8] expand knn name Co-authored-by: Guillaume Lemaitre --- doc/under_sampling.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/under_sampling.rst b/doc/under_sampling.rst index 4d4e70b04..acdf08880 100644 --- a/doc/under_sampling.rst +++ b/doc/under_sampling.rst @@ -356,7 +356,7 @@ will first find the observations that are hard to classify, and then will use The final dataset is :math:`S`, containing all observations from the minority class, plus the observations from the majority that were added at random, plus all -those from the majority that were miss-classified by the 1-KNN algorithms. Note +those from the majority that were miss-classified by the 1-Nearest Neighbors algorithms. Note that differently from :class:`CondensedNearestNeighbour`, :class:`OneSidedSelection` does not train a KNN after each sample is missclassified. It uses the one KNN to classify all samples from the majority in 1 pass. The class can be used as:: From ad739f38b3c41b7b1450b0749f77b9c2fb74709d Mon Sep 17 00:00:00 2001 From: Soledad Galli Date: Tue, 11 Jul 2023 10:55:06 +0200 Subject: [PATCH 6/8] expand knn name and fix typo Co-authored-by: Guillaume Lemaitre --- doc/under_sampling.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/under_sampling.rst b/doc/under_sampling.rst index acdf08880..a1b3ce48e 100644 --- a/doc/under_sampling.rst +++ b/doc/under_sampling.rst @@ -358,7 +358,7 @@ The final dataset is :math:`S`, containing all observations from the minority cl plus the observations from the majority that were added at random, plus all those from the majority that were miss-classified by the 1-Nearest Neighbors algorithms. Note that differently from :class:`CondensedNearestNeighbour`, :class:`OneSidedSelection` -does not train a KNN after each sample is missclassified. It uses the one KNN +does not train a K-Nearet Neighbors after each sample is misclassified. It uses the one K-Nearest Neighbors to classify all samples from the majority in 1 pass. The class can be used as:: >>> from imblearn.under_sampling import OneSidedSelection From 1c7e04e4485e6e4eafeb1a317ec602e404d98459 Mon Sep 17 00:00:00 2001 From: Soledad Galli Date: Tue, 11 Jul 2023 11:05:25 +0200 Subject: [PATCH 7/8] expanded knn to full name and added missing :math: --- doc/under_sampling.rst | 20 +++++++++++--------- 1 file changed, 11 insertions(+), 9 deletions(-) diff --git a/doc/under_sampling.rst b/doc/under_sampling.rst index a1b3ce48e..9d717ccba 100644 --- a/doc/under_sampling.rst +++ b/doc/under_sampling.rst @@ -316,14 +316,15 @@ iteratively decide if a sample should be removed 1. Get all minority samples in a set :math:`C`. 2. Add a sample from the targeted class (class to be under-sampled) in :math:`C` and all other samples of this class in a set :math:`S`. -3. Train a 1-KNN on `C`. +3. Train a 1-Nearest Neigbhour on :math:`C`. 4. Go through the samples in set :math:`S`, sample by sample, and classify each one using a 1 nearest neighbor rule (trained in 3). 5. If the sample is misclassified, add it to :math:`C`, and go to step 6. -6. Repeat steps 3 to 5 until all observations in `S` have been examined. +6. Repeat steps 3 to 5 until all observations in :math:`S` have been examined. -The final dataset is `S`, containing all observations from the minority class and -those from the majority that were miss-classified by the successive 1-KNN algorithms. +The final dataset is :math:`S`, containing all observations from the minority class and +those from the majority that were miss-classified by the successive +1-Nearest Neigbhour algorithms. The :class:`CondensedNearestNeighbour` can be used in the following manner:: @@ -348,7 +349,7 @@ will first find the observations that are hard to classify, and then will use 1. Get all minority samples in a set :math:`C`. 2. Add a sample from the targeted class (class to be under-sampled) in :math:`C` and all other samples of this class in a set :math:`S`. -3. Train a 1-Nearest Neighbors on `C`. +3. Train a 1-Nearest Neighbors on :math:`C`. 4. Using a 1 nearest neighbor rule trained in 3, classify all samples in set :math:`S`. 5. Add all misclassified samples to :math:`C`. @@ -356,10 +357,11 @@ will first find the observations that are hard to classify, and then will use The final dataset is :math:`S`, containing all observations from the minority class, plus the observations from the majority that were added at random, plus all -those from the majority that were miss-classified by the 1-Nearest Neighbors algorithms. Note -that differently from :class:`CondensedNearestNeighbour`, :class:`OneSidedSelection` -does not train a K-Nearet Neighbors after each sample is misclassified. It uses the one K-Nearest Neighbors -to classify all samples from the majority in 1 pass. The class can be used as:: +those from the majority that were miss-classified by the 1-Nearest Neighbors algorithms. +Note that differently from :class:`CondensedNearestNeighbour`, :class:`OneSidedSelection` +does not train a K-Nearest Neighbors after each sample is misclassified. It uses the +1-Nearest Neighbors from step 3 to classify all samples from the majority in 1 pass. +The class can be used as:: >>> from imblearn.under_sampling import OneSidedSelection >>> oss = OneSidedSelection(random_state=0) From 3a787f450d0ecced3a67fab45ceaecca259b4154 Mon Sep 17 00:00:00 2001 From: Soledad Galli Date: Tue, 11 Jul 2023 11:06:51 +0200 Subject: [PATCH 8/8] split paragraph --- doc/under_sampling.rst | 1 + 1 file changed, 1 insertion(+) diff --git a/doc/under_sampling.rst b/doc/under_sampling.rst index 9d717ccba..fd9f43c0e 100644 --- a/doc/under_sampling.rst +++ b/doc/under_sampling.rst @@ -358,6 +358,7 @@ will first find the observations that are hard to classify, and then will use The final dataset is :math:`S`, containing all observations from the minority class, plus the observations from the majority that were added at random, plus all those from the majority that were miss-classified by the 1-Nearest Neighbors algorithms. + Note that differently from :class:`CondensedNearestNeighbour`, :class:`OneSidedSelection` does not train a K-Nearest Neighbors after each sample is misclassified. It uses the 1-Nearest Neighbors from step 3 to classify all samples from the majority in 1 pass.