Skip to content

Commit 3444430

Browse files
authored
MNT move ROSE into RandomOverSampler with addititional parameters (#791)
1 parent 9b666a0 commit 3444430

File tree

15 files changed

+479
-429
lines changed

15 files changed

+479
-429
lines changed

azure-pipelines.yml

Lines changed: 14 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,20 @@ jobs:
1616
./build_tools/circle/linting.sh
1717
displayName: Run linting
1818
19+
- template: build_tools/azure/posix.yml
20+
parameters:
21+
name: Linux_Runs
22+
vmImage: ubuntu-18.04
23+
matrix:
24+
pylatest_pip_openblas_pandas:
25+
DISTRIB: 'conda-pip-latest'
26+
PYTHON_VERSION: '3.9'
27+
COVERAGE: 'true'
28+
PANDAS_VERSION: '*'
29+
TEST_DOCSTRINGS: 'true'
30+
JOBLIB_VERSION: '*'
31+
CHECK_WARNINGS: 'true'
32+
1933
- template: build_tools/azure/posix.yml
2034
parameters:
2135
name: Linux
@@ -29,15 +43,6 @@ jobs:
2943
DISTRIB: 'ubuntu'
3044
PYTHON_VERSION: '3.6'
3145
JOBLIB_VERSION: '*'
32-
# Linux environment to test the latest available dependencies and MKL.
33-
pylatest_pip_openblas_pandas:
34-
DISTRIB: 'conda-pip-latest'
35-
PYTHON_VERSION: '3.9'
36-
COVERAGE: 'true'
37-
PANDAS_VERSION: '*'
38-
TEST_DOCSTRINGS: 'true'
39-
JOBLIB_VERSION: '*'
40-
CHECK_WARNINGS: 'true'
4146
pylatest_conda_pandas_keras:
4247
DISTRIB: 'conda'
4348
PYTHON_VERSION: '3.7'

build_tools/circle/linting.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -140,7 +140,7 @@ else
140140

141141
check_files "$(echo "$MODIFIED_FILES" | grep -v ^examples)"
142142
check_files "$(echo "$MODIFIED_FILES" | grep ^examples)" \
143-
--config ./examples/.flake8
143+
--config ./setup.cfg
144144
fi
145145
echo -e "No problem detected by flake8\n"
146146

doc/api.rst

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -76,7 +76,6 @@ Prototype selection
7676
over_sampling.SMOTE
7777
over_sampling.SMOTENC
7878
over_sampling.SVMSMOTE
79-
over_sampling.ROSE
8079

8180

8281
.. _combine_ref:

doc/over_sampling.rst

Lines changed: 19 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -80,6 +80,19 @@ It would also work with pandas dataframe::
8080
>>> df_resampled, y_resampled = ros.fit_resample(df_adult, y_adult)
8181
>>> df_resampled.head() # doctest: +SKIP
8282

83+
If repeating samples is an issue, the parameter `smoothed_bootstrap` can be
84+
turned to `True` to create a smoothed bootstrap. However, the original data
85+
needs to be numerical. The `shrinkage` parameter controls the dispersion of the
86+
new generated samples. We show an example illustrate that the new samples are
87+
not overlapping anymore once using a smoothed bootstrap. This ways of
88+
generating smoothed bootstrap is also known a Random Over-Sampler Examples
89+
(ROSE) :cite:`torelli2014rose`.
90+
91+
.. image:: ./auto_examples/over-sampling/images/sphx_glr_plot_comparison_over_sampling_003.png
92+
:target: ./auto_examples/over-sampling/plot_comparison_over_sampling.html
93+
:scale: 60
94+
:align: center
95+
8396
.. _smote_adasyn:
8497

8598
From random over-sampling to SMOTE and ADASYN
@@ -104,7 +117,7 @@ the same manner::
104117
The figure below illustrates the major difference of the different
105118
over-sampling methods.
106119

107-
.. image:: ./auto_examples/over-sampling/images/sphx_glr_plot_comparison_over_sampling_003.png
120+
.. image:: ./auto_examples/over-sampling/images/sphx_glr_plot_comparison_over_sampling_004.png
108121
:target: ./auto_examples/over-sampling/plot_comparison_over_sampling.html
109122
:scale: 60
110123
:align: center
@@ -122,14 +135,14 @@ implementation of :class:`SMOTE` will not make any distinction between easy and
122135
hard samples to be classified using the nearest neighbors rule. Therefore, the
123136
decision function found during training will be different among the algorithms.
124137

125-
.. image:: ./auto_examples/over-sampling/images/sphx_glr_plot_comparison_over_sampling_004.png
138+
.. image:: ./auto_examples/over-sampling/images/sphx_glr_plot_comparison_over_sampling_005.png
126139
:target: ./auto_examples/over-sampling/plot_comparison_over_sampling.html
127140
:align: center
128141

129142
The sampling particularities of these two algorithms can lead to some peculiar
130143
behavior as shown below.
131144

132-
.. image:: ./auto_examples/over-sampling/images/sphx_glr_plot_comparison_over_sampling_005.png
145+
.. image:: ./auto_examples/over-sampling/images/sphx_glr_plot_comparison_over_sampling_006.png
133146
:target: ./auto_examples/over-sampling/plot_comparison_over_sampling.html
134147
:scale: 60
135148
:align: center
@@ -144,7 +157,7 @@ samples. Those methods focus on samples near of the border of the optimal
144157
decision function and will generate samples in the opposite direction of the
145158
nearest neighbors class. Those variants are presented in the figure below.
146159

147-
.. image:: ./auto_examples/over-sampling/images/sphx_glr_plot_comparison_over_sampling_006.png
160+
.. image:: ./auto_examples/over-sampling/images/sphx_glr_plot_comparison_over_sampling_007.png
148161
:target: ./auto_examples/over-sampling/plot_comparison_over_sampling.html
149162
:scale: 60
150163
:align: center
@@ -198,29 +211,14 @@ Therefore, it can be seen that the samples generated in the first and last
198211
columns are belonging to the same categories originally presented without any
199212
other extra interpolation.
200213

201-
.. _rose:
202-
203-
ROSE (Random Over-Sampling Examples)
204-
------------------------------------
205-
206-
ROSE uses smoothed bootstrapping to draw artificial samples from the
207-
feature space neighborhood around selected classes, using a multivariate
208-
Gaussian kernel around randomly selected samples. First, random samples are
209-
selected from original classes. Then the smoothing kernel distribution
210-
is computed around the samples: :math:`\hat f(x|y=Y_i) = \sum_i^{n_j}
211-
p_i Pr(x|x_i)=\sum_i^{n_j} \frac{1}{n_j} Pr(x|x_i)=\sum_i^{n_j}
212-
\frac{1}{n_j} K_{H_j}(x|x_i)`.
213-
214-
Then new samples are drawn from the computed distribution.
215-
216214
Mathematical formulation
217215
========================
218216

219217
Sample generation
220218
-----------------
221219

222-
Both SMOTE and ADASYN use the same algorithm to generate new samples.
223-
Considering a sample :math:`x_i`, a new sample :math:`x_{new}` will be
220+
Both :class:`SMOTE` and :class:`ADASYN` use the same algorithm to generate new
221+
samples. Considering a sample :math:`x_i`, a new sample :math:`x_{new}` will be
224222
generated considering its k neareast-neighbors (corresponding to
225223
``k_neighbors``). For instance, the 3 nearest-neighbors are included in the
226224
blue circle as illustrated in the figure below. Then, one of these

doc/whats_new/v0.7.rst

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -72,8 +72,12 @@ Enhancements
7272
- Lazy import `keras` module when importing `imblearn.keras`
7373
:pr:`719` by :user:`Guillaume Lemaitre <glemaitre>`.
7474

75-
- Added Random Over-Sampling Examples (ROSE) class.
76-
:pr:`754` by :user:`Andrea Lorenzon <andrealorenzon>`.
75+
- Added an option to generate smoothed bootstrap in
76+
:class:`imblearn.over_sampling.RandomOverSampler`. It is controls by the
77+
parameters `smoothed_bootstrap` and `shrinkage`. This method is also known as
78+
Random Over-Sampling Examples (ROSE).
79+
:pr:`754` by :user:`Andrea Lorenzon <andrealorenzon>` and
80+
:user:`Guillaume Lemaitre <glemaitre>`.
7781

7882
- Add option `output_dict` in
7983
:func:`imblearn.metrics.classification_report_imbalanced` to return a

examples/over-sampling/plot_comparison_over_sampling.py

Lines changed: 43 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -106,16 +106,15 @@ def plot_decision_function(X, y, clf, ax):
106106
# data using a linear SVM classifier. Greater is the difference between the
107107
# number of samples in each class, poorer are the classfication results.
108108

109-
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 12))
109+
fig, axs = plt.subplots(2, 2, figsize=(15, 12))
110110

111-
ax_arr = (ax1, ax2, ax3, ax4)
112111
weights_arr = (
113112
(0.01, 0.01, 0.98),
114113
(0.01, 0.05, 0.94),
115114
(0.2, 0.1, 0.7),
116115
(0.33, 0.33, 0.33),
117116
)
118-
for ax, weights in zip(ax_arr, weights_arr):
117+
for ax, weights in zip(axs.ravel(), weights_arr):
119118
X, y = create_dataset(n_samples=1000, weights=weights)
120119
clf = LinearSVC().fit(X, y)
121120
plot_decision_function(X, y, clf, ax)
@@ -129,20 +128,40 @@ def plot_decision_function(X, y, clf, ax):
129128
###############################################################################
130129
# Random over-sampling can be used to repeat some samples and balance the
131130
# number of samples between the dataset. It can be seen that with this trivial
132-
# approach the boundary decision is already less biaised toward the majority
131+
# approach the boundary decision is already less biased toward the majority
133132
# class.
134133

135-
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 7))
134+
fig, axs = plt.subplots(1, 2, figsize=(15, 7))
136135
X, y = create_dataset(n_samples=10000, weights=(0.01, 0.05, 0.94))
137136
clf = LinearSVC().fit(X, y)
138-
plot_decision_function(X, y, clf, ax1)
139-
ax1.set_title(f"Linear SVC with y={Counter(y)}")
137+
plot_decision_function(X, y, clf, axs[0])
138+
axs[0].set_title(f"Linear SVC with y={Counter(y)}")
140139
pipe = make_pipeline(RandomOverSampler(random_state=0), LinearSVC())
141140
pipe.fit(X, y)
142-
plot_decision_function(X, y, pipe, ax2)
143-
ax2.set_title("Decision function for RandomOverSampler")
141+
plot_decision_function(X, y, pipe, axs[1])
142+
axs[1].set_title("Decision function for RandomOverSampler")
144143
fig.tight_layout()
145144

145+
###############################################################################
146+
# By default, random over-sampling generates a bootstrap. The parameter
147+
# `smoothed_bootstrap` allows adding a small perturbation to the generated data
148+
# to generate a smoothed bootstrap instead. The plot below shows the difference
149+
# between the two data generation strategies.
150+
151+
fig, axs = plt.subplots(1, 2, figsize=(15, 7))
152+
sampler = RandomOverSampler(random_state=0)
153+
plot_resampling(X, y, sampler, ax=axs[0])
154+
axs[0].set_title("RandomOverSampler with normal bootstrap")
155+
sampler = RandomOverSampler(smoothed_bootstrap=True, shrinkage=0.2, random_state=0)
156+
plot_resampling(X, y, sampler, ax=axs[1])
157+
axs[1].set_title("RandomOverSampler with smoothed bootstrap")
158+
fig.tight_layout()
159+
160+
###############################################################################
161+
# It looks like more samples are generated with smoothed bootstrap. This is due
162+
# to the fact that the samples generated are not superimposing with the
163+
# original samples.
164+
#
146165
###############################################################################
147166
# More advanced over-sampling using ADASYN and SMOTE
148167
###############################################################################
@@ -161,16 +180,15 @@ def _fit_resample(self, X, y):
161180
return X, y
162181

163182

164-
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 15))
183+
fig, axs = plt.subplots(2, 2, figsize=(15, 15))
165184
X, y = create_dataset(n_samples=10000, weights=(0.01, 0.05, 0.94))
166185
sampler = FakeSampler()
167186
clf = make_pipeline(sampler, LinearSVC())
168-
plot_resampling(X, y, sampler, ax1)
169-
ax1.set_title(f"Original data - y={Counter(y)}")
187+
plot_resampling(X, y, sampler, axs[0, 0])
188+
axs[0, 0].set_title(f"Original data - y={Counter(y)}")
170189

171-
ax_arr = (ax2, ax3, ax4)
172190
for ax, sampler in zip(
173-
ax_arr,
191+
axs.ravel()[1:],
174192
(
175193
RandomOverSampler(random_state=0),
176194
SMOTE(random_state=0),
@@ -189,33 +207,32 @@ def _fit_resample(self, X, y):
189207
# nearest-neighbors rule while regular SMOTE will not make any distinction.
190208
# Therefore, the decision function depending of the algorithm.
191209

192-
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(20, 6))
210+
fig, axs = plt.subplots(1, 3, figsize=(20, 6))
193211
X, y = create_dataset(n_samples=10000, weights=(0.01, 0.05, 0.94))
194212

195213
clf = LinearSVC().fit(X, y)
196-
plot_decision_function(X, y, clf, ax1)
197-
ax1.set_title(f"Linear SVC with y={Counter(y)}")
214+
plot_decision_function(X, y, clf, axs[0])
215+
axs[0].set_title(f"Linear SVC with y={Counter(y)}")
198216
sampler = SMOTE()
199217
clf = make_pipeline(sampler, LinearSVC())
200218
clf.fit(X, y)
201-
plot_decision_function(X, y, clf, ax2)
202-
ax2.set_title(f"Decision function for {sampler.__class__.__name__}")
219+
plot_decision_function(X, y, clf, axs[1])
220+
axs[1].set_title(f"Decision function for {sampler.__class__.__name__}")
203221
sampler = ADASYN()
204222
clf = make_pipeline(sampler, LinearSVC())
205223
clf.fit(X, y)
206-
plot_decision_function(X, y, clf, ax3)
207-
ax3.set_title(f"Decision function for {sampler.__class__.__name__}")
224+
plot_decision_function(X, y, clf, axs[2])
225+
axs[2].set_title(f"Decision function for {sampler.__class__.__name__}")
208226
fig.tight_layout()
209227

210228
###############################################################################
211229
# Due to those sampling particularities, it can give rise to some specific
212230
# issues as illustrated below.
213231

214-
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 15))
232+
fig, axs = plt.subplots(2, 2, figsize=(15, 15))
215233
X, y = create_dataset(n_samples=5000, weights=(0.01, 0.05, 0.94), class_sep=0.8)
216234

217-
ax_arr = ((ax1, ax2), (ax3, ax4))
218-
for ax, sampler in zip(ax_arr, (SMOTE(random_state=0), ADASYN(random_state=0))):
235+
for ax, sampler in zip(axs, (SMOTE(random_state=0), ADASYN(random_state=0))):
219236
clf = make_pipeline(sampler, LinearSVC())
220237
clf.fit(X, y)
221238
plot_decision_function(X, y, clf, ax[0])
@@ -232,16 +249,11 @@ def _fit_resample(self, X, y):
232249
# the KMeans version will make a clustering before to generate samples in each
233250
# cluster independently depending each cluster density.
234251

235-
(
236-
fig,
237-
((ax1, ax2), (ax3, ax4), (ax5, ax6), (ax7, ax8), (ax9, ax10)),
238-
) = plt.subplots(5, 2, figsize=(15, 30))
252+
fig, axs = plt.subplots(5, 2, figsize=(15, 30))
239253
X, y = create_dataset(n_samples=5000, weights=(0.01, 0.05, 0.94), class_sep=0.8)
240254

241-
242-
ax_arr = ((ax1, ax2), (ax3, ax4), (ax5, ax6), (ax7, ax8), (ax9, ax10))
243255
for ax, sampler in zip(
244-
ax_arr,
256+
axs,
245257
(
246258
SMOTE(random_state=0),
247259
BorderlineSMOTE(random_state=0, kind="borderline-1"),
@@ -282,5 +294,3 @@ def _fit_resample(self, X, y):
282294
print(sorted(Counter(y_resampled).items()))
283295
print("SMOTE-NC will generate categories for the categorical features:")
284296
print(X_resampled[-5:])
285-
286-
plt.show()

0 commit comments

Comments
 (0)