Skip to content

Commit 57f839f

Browse files
committed
DOC add whats new
1 parent f648089 commit 57f839f

File tree

3 files changed

+71
-5
lines changed

3 files changed

+71
-5
lines changed

doc/over_sampling.rst

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -211,6 +211,44 @@ Therefore, it can be seen that the samples generated in the first and last
211211
columns are belonging to the same categories originally presented without any
212212
other extra interpolation.
213213

214+
However, :class:`SMOTENC` is working with data composed of categorical data
215+
only. WHen data are made of only nominal categorical data, one can use the
216+
:class:`SMOTEN` variant :cite:`chawla2002smote`. The algorithm changes in
217+
two ways:
218+
219+
* the nearest neighbors search does not rely on the Euclidean distance. Indeed,
220+
the value difference metric (VDM) also implemented in the class
221+
:class:`~imblearn.metrics.ValueDifferenceMetric` is used.
222+
* the new sample generation is based on majority vote per feature to generate
223+
the most common category seen in the neighbors samples.
224+
225+
Let's take the following example::
226+
227+
>>> import numpy as np
228+
>>> X = np.array(["green"] * 5 + ["red"] * 10 + ["blue"] * 7,
229+
... dtype=object).reshape(-1, 1)
230+
>>> y = np.array(["apple"] * 5 + ["not apple"] * 3 + ["apple"] * 7 +
231+
... ["not apple"] * 5 + ["apple"] * 2, dtype=object)
232+
233+
We generate a dataset associating a color to being an apple or not an apple.
234+
We strongly associated "green" and "red" to being an apple. The minority class
235+
being "not apple", we expect new data generated belonging to the category
236+
"blue"::
237+
238+
>>> from imblearn.over_sampling import SMOTEN
239+
>>> sampler = SMOTEN(random_state=0)
240+
>>> X_res, y_res = sampler.fit_resample(X, y)
241+
>>> X_res[y.size:]
242+
array([['blue'],
243+
['blue'],
244+
['blue'],
245+
['blue'],
246+
['blue'],
247+
['blue']], dtype=object)
248+
>>> y_res[y.size:]
249+
array(['not apple', 'not apple', 'not apple', 'not apple', 'not apple',
250+
'not apple'], dtype=object)
251+
214252
Mathematical formulation
215253
========================
216254

doc/whats_new/v0.8.rst

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,10 @@ New features
1919
compute pairwise distances between samples containing only nominal values.
2020
:pr:`796` by :user:`Guillaume Lemaitre <glemaitre>`.
2121

22+
- Add the class :class:`imblearn.over_sampling.SMOTEN` to over-sample data
23+
only containing nominal categorical features.
24+
:pr:`802` by :user:`Guillaume Lemaitre <glemaitre>`.
25+
2226
Enhancements
2327
............
2428

imblearn/over_sampling/tests/test_smoten.py

Lines changed: 29 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,3 @@
1-
from collections import Counter
2-
31
import numpy as np
42
import pytest
53

@@ -22,9 +20,35 @@ def data():
2220

2321

2422
def test_smoten(data):
23+
# overall check for SMOTEN
2524
X, y = data
26-
print(X, y)
2725
sampler = SMOTEN(random_state=0)
2826
X_res, y_res = sampler.fit_resample(X, y)
29-
print(X_res, y_res)
30-
print(Counter(y_res))
27+
28+
assert X_res.shape == (80, 3)
29+
assert y_res.shape == (80,)
30+
31+
32+
def test_smoten_resampling():
33+
# check if the SMOTEN resample data as expected
34+
# we generate data such that "not apple" will be the minority class and
35+
# samples from this class will be generated. We will force the "blue"
36+
# category to be associated with this class. Therefore, the new generated
37+
# samples should as well be from the "blue" category.
38+
X = np.array(["green"] * 5 + ["red"] * 10 + ["blue"] * 7, dtype=object).reshape(
39+
-1, 1
40+
)
41+
y = np.array(
42+
["apple"] * 5
43+
+ ["not apple"] * 3
44+
+ ["apple"] * 7
45+
+ ["not apple"] * 5
46+
+ ["apple"] * 2,
47+
dtype=object,
48+
)
49+
sampler = SMOTEN(random_state=0)
50+
X_res, y_res = sampler.fit_resample(X, y)
51+
52+
X_generated, y_generated = X_res[X.shape[0] :], y_res[X.shape[0] :]
53+
np.testing.assert_array_equal(X_generated, "blue")
54+
np.testing.assert_array_equal(y_generated, "not apple")

0 commit comments

Comments
 (0)