From fd9a48c6a6caff34f40635f3a06582f3067f58f3 Mon Sep 17 00:00:00 2001 From: Soledad Galli Date: Mon, 10 Jul 2023 21:21:46 +0200 Subject: [PATCH 1/5] update RENN and AllKNN --- doc/under_sampling.rst | 33 ++++++++++++++++++++++++++++----- 1 file changed, 28 insertions(+), 5 deletions(-) diff --git a/doc/under_sampling.rst b/doc/under_sampling.rst index 9f2795430..d6435fd2d 100644 --- a/doc/under_sampling.rst +++ b/doc/under_sampling.rst @@ -274,6 +274,9 @@ The parameter ``n_neighbors`` allows to give a classifier subclassed from ``KNeighborsMixin`` from scikit-learn to find the nearest neighbors and make the decision to keep a given sample or not. +Repeated Edited Nearest Neighbours +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + :class:`RepeatedEditedNearestNeighbours` extends :class:`EditedNearestNeighbours` by repeating the algorithm multiple times :cite:`tomek1976experiment`. Generally, repeating the algorithm will delete @@ -285,9 +288,24 @@ more data:: >>> print(sorted(Counter(y_resampled).items())) [(0, 64), (1, 208), (2, 4551)] -:class:`AllKNN` differs from the previous -:class:`RepeatedEditedNearestNeighbours` since the number of neighbors of the -internal nearest neighbors algorithm is increased at each iteration +The user can set up the number of times the ENN method should be repeated through the +paramter `max_iter`. + +The repetitions will stop when: + +1. the maximum number of iterations is reached, or +2. no more observations are removed, or +3. one of the majority classes becomes a minority class, or +4. one of the majority classes disappears during the undersampling. + +All KNN +~~~~~~~ + +:class:`AllKNN` is a variation of the +:class:`RepeatedEditedNearestNeighbours` where the number of neighbours evaluated at +each round of ENN increases. It starts by editing based on 1 closest neighbour, and it +incrases the neighbourhood by 1 at each iteration. + :cite:`tomek1976experiment`:: >>> from imblearn.under_sampling import AllKNN @@ -296,8 +314,13 @@ internal nearest neighbors algorithm is increased at each iteration >>> print(sorted(Counter(y_resampled).items())) [(0, 64), (1, 220), (2, 4601)] -In the example below, it can be seen that the three algorithms have similar -impact by cleaning noisy samples next to the boundaries of the classes. +:class:`AllKNN` stops cleaning when the maximum number of neighbours to examine, which +is determined by the user through the parameter ``n_neighbors` is reached, or when the +majority class becomes the minority class. + +In the example below, we see that ENN, RENN and AllKNN have similar impact when +cleaning "noisy" samples at the boundaries between classes. + .. image:: ./auto_examples/under-sampling/images/sphx_glr_plot_comparison_under_sampling_004.png :target: ./auto_examples/under-sampling/plot_comparison_under_sampling.html From c8a8bfaf3de5c7281563ed8bd179d9ea7cf4c6db Mon Sep 17 00:00:00 2001 From: Soledad Galli Date: Mon, 10 Jul 2023 21:25:15 +0200 Subject: [PATCH 2/5] final touches --- doc/under_sampling.rst | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/doc/under_sampling.rst b/doc/under_sampling.rst index d6435fd2d..e07c2881e 100644 --- a/doc/under_sampling.rst +++ b/doc/under_sampling.rst @@ -304,9 +304,7 @@ All KNN :class:`AllKNN` is a variation of the :class:`RepeatedEditedNearestNeighbours` where the number of neighbours evaluated at each round of ENN increases. It starts by editing based on 1 closest neighbour, and it -incrases the neighbourhood by 1 at each iteration. - -:cite:`tomek1976experiment`:: +increases the neighbourhood by 1 at each iteration :cite:`tomek1976experiment`:: >>> from imblearn.under_sampling import AllKNN >>> allknn = AllKNN() @@ -321,7 +319,6 @@ majority class becomes the minority class. In the example below, we see that ENN, RENN and AllKNN have similar impact when cleaning "noisy" samples at the boundaries between classes. - .. image:: ./auto_examples/under-sampling/images/sphx_glr_plot_comparison_under_sampling_004.png :target: ./auto_examples/under-sampling/plot_comparison_under_sampling.html :scale: 60 From b29f39110b4e22b551cbafba50abb0b86963feaa Mon Sep 17 00:00:00 2001 From: Soledad Galli Date: Tue, 11 Jul 2023 11:15:16 +0200 Subject: [PATCH 3/5] expand enn Co-authored-by: Guillaume Lemaitre --- doc/under_sampling.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/under_sampling.rst b/doc/under_sampling.rst index e07c2881e..bc4fb3ca9 100644 --- a/doc/under_sampling.rst +++ b/doc/under_sampling.rst @@ -288,7 +288,7 @@ more data:: >>> print(sorted(Counter(y_resampled).items())) [(0, 64), (1, 208), (2, 4551)] -The user can set up the number of times the ENN method should be repeated through the +The user can set up the number of times the edited nearest neighbours method should be repeated through the paramter `max_iter`. The repetitions will stop when: From c8c570a6aeea1ea1e9a13bca6c6f54b4539b936b Mon Sep 17 00:00:00 2001 From: Soledad Galli Date: Tue, 11 Jul 2023 11:15:32 +0200 Subject: [PATCH 4/5] remove space Co-authored-by: Guillaume Lemaitre --- doc/under_sampling.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/under_sampling.rst b/doc/under_sampling.rst index bc4fb3ca9..e1f49712f 100644 --- a/doc/under_sampling.rst +++ b/doc/under_sampling.rst @@ -313,7 +313,7 @@ increases the neighbourhood by 1 at each iteration :cite:`tomek1976experiment`:: [(0, 64), (1, 220), (2, 4601)] :class:`AllKNN` stops cleaning when the maximum number of neighbours to examine, which -is determined by the user through the parameter ``n_neighbors` is reached, or when the +is determined by the user through the parameter `n_neighbors` is reached, or when the majority class becomes the minority class. In the example below, we see that ENN, RENN and AllKNN have similar impact when From 84dbc121247516536e6c5b3a1edfa1b485f50e5e Mon Sep 17 00:00:00 2001 From: Soledad Galli Date: Tue, 11 Jul 2023 11:21:39 +0200 Subject: [PATCH 5/5] add link to classes --- doc/under_sampling.rst | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/doc/under_sampling.rst b/doc/under_sampling.rst index e1f49712f..7f0f5bfc1 100644 --- a/doc/under_sampling.rst +++ b/doc/under_sampling.rst @@ -288,8 +288,8 @@ more data:: >>> print(sorted(Counter(y_resampled).items())) [(0, 64), (1, 208), (2, 4551)] -The user can set up the number of times the edited nearest neighbours method should be repeated through the -paramter `max_iter`. +The user can set up the number of times the edited nearest neighbours method should be +repeated through the parameter `max_iter`. The repetitions will stop when: @@ -303,8 +303,9 @@ All KNN :class:`AllKNN` is a variation of the :class:`RepeatedEditedNearestNeighbours` where the number of neighbours evaluated at -each round of ENN increases. It starts by editing based on 1 closest neighbour, and it -increases the neighbourhood by 1 at each iteration :cite:`tomek1976experiment`:: +each round of :class:`EditedNearestNeighbours` increases. It starts by editing based on +1-Nearest Neighbour, and it increases the neighbourhood by 1 at each iteration +:cite:`tomek1976experiment`:: >>> from imblearn.under_sampling import AllKNN >>> allknn = AllKNN() @@ -316,7 +317,8 @@ increases the neighbourhood by 1 at each iteration :cite:`tomek1976experiment`:: is determined by the user through the parameter `n_neighbors` is reached, or when the majority class becomes the minority class. -In the example below, we see that ENN, RENN and AllKNN have similar impact when +In the example below, we see that :class:`EditedNearestNeighbours`, +:class:`RepeatedEditedNearestNeighbours` and :class:`AllKNN` have similar impact when cleaning "noisy" samples at the boundaries between classes. .. image:: ./auto_examples/under-sampling/images/sphx_glr_plot_comparison_under_sampling_004.png