DOC add whats new

glemaitre · glemaitre · commit 57f839f07d12 · 2021-02-15T23:53:02.000+01:00
diff --git a/doc/over_sampling.rst b/doc/over_sampling.rst
@@ -211,6 +211,44 @@ Therefore, it can be seen that the samples generated in the first and last
 columns are belonging to the same categories originally presented without any
 other extra interpolation.
 
+However, :class:`SMOTENC` is working with data composed of categorical data
+only. WHen data are made of only nominal categorical data, one can use the
+:class:`SMOTEN` variant :cite:`chawla2002smote`. The algorithm changes in
+two ways:
+
+* the nearest neighbors search does not rely on the Euclidean distance. Indeed,
+  the value difference metric (VDM) also implemented in the class
+  :class:`~imblearn.metrics.ValueDifferenceMetric` is used.
+* the new sample generation is based on majority vote per feature to generate
+  the most common category seen in the neighbors samples.
+
+Let's take the following example::
+
+   >>> import numpy as np
+   >>> X = np.array(["green"] * 5 + ["red"] * 10 + ["blue"] * 7,
+   ...              dtype=object).reshape(-1, 1)
+   >>> y = np.array(["apple"] * 5 + ["not apple"] * 3 + ["apple"] * 7 +
+   ...              ["not apple"] * 5 + ["apple"] * 2, dtype=object)
+
+We generate a dataset associating a color to being an apple or not an apple.
+We strongly associated "green" and "red" to being an apple. The minority class
+being "not apple", we expect new data generated belonging to the category
+"blue"::
+
+   >>> from imblearn.over_sampling import SMOTEN
+   >>> sampler = SMOTEN(random_state=0)
+   >>> X_res, y_res = sampler.fit_resample(X, y)
+   >>> X_res[y.size:]
+   array([['blue'],
+           ['blue'],
+           ['blue'],
+           ['blue'],
+           ['blue'],
+           ['blue']], dtype=object)
+   >>> y_res[y.size:]
+   array(['not apple', 'not apple', 'not apple', 'not apple', 'not apple',
+          'not apple'], dtype=object)
+
 Mathematical formulation
 ========================
 
diff --git a/doc/whats_new/v0.8.rst b/doc/whats_new/v0.8.rst
@@ -19,6 +19,10 @@ New features
   compute pairwise distances between samples containing only nominal values.
   :pr:`796` by :user:`Guillaume Lemaitre <glemaitre>`.
 
+- Add the class :class:`imblearn.over_sampling.SMOTEN` to over-sample data
+  only containing nominal categorical features.
+  :pr:`802` by :user:`Guillaume Lemaitre <glemaitre>`.
+
 Enhancements
 ............
 
diff --git a/imblearn/over_sampling/tests/test_smoten.py b/imblearn/over_sampling/tests/test_smoten.py
@@ -1,5 +1,3 @@
-from collections import Counter
-
 import numpy as np
 import pytest
 
@@ -22,9 +20,35 @@ def data():
 
 
 def test_smoten(data):
+    # overall check for SMOTEN
     X, y = data
-    print(X, y)
     sampler = SMOTEN(random_state=0)
     X_res, y_res = sampler.fit_resample(X, y)
-    print(X_res, y_res)
-    print(Counter(y_res))
+
+    assert X_res.shape == (80, 3)
+    assert y_res.shape == (80,)
+
+
+def test_smoten_resampling():
+    # check if the SMOTEN resample data as expected
+    # we generate data such that "not apple" will be the minority class and
+    # samples from this class will be generated. We will force the "blue"
+    # category to be associated with this class. Therefore, the new generated
+    # samples should as well be from the "blue" category.
+    X = np.array(["green"] * 5 + ["red"] * 10 + ["blue"] * 7, dtype=object).reshape(
+        -1, 1
+    )
+    y = np.array(
+        ["apple"] * 5
+        + ["not apple"] * 3
+        + ["apple"] * 7
+        + ["not apple"] * 5
+        + ["apple"] * 2,
+        dtype=object,
+    )
+    sampler = SMOTEN(random_state=0)
+    X_res, y_res = sampler.fit_resample(X, y)
+
+    X_generated, y_generated = X_res[X.shape[0] :], y_res[X.shape[0] :]
+    np.testing.assert_array_equal(X_generated, "blue")
+    np.testing.assert_array_equal(y_generated, "not apple")