@@ -211,6 +211,44 @@ Therefore, it can be seen that the samples generated in the first and last
211211columns are belonging to the same categories originally presented without any
212212other extra interpolation.
213213
214+ However, :class: `SMOTENC ` is working with data composed of categorical data
215+ only. WHen data are made of only nominal categorical data, one can use the
216+ :class: `SMOTEN ` variant :cite: `chawla2002smote `. The algorithm changes in
217+ two ways:
218+
219+ * the nearest neighbors search does not rely on the Euclidean distance. Indeed,
220+ the value difference metric (VDM) also implemented in the class
221+ :class: `~imblearn.metrics.ValueDifferenceMetric ` is used.
222+ * the new sample generation is based on majority vote per feature to generate
223+ the most common category seen in the neighbors samples.
224+
225+ Let's take the following example::
226+
227+ >>> import numpy as np
228+ >>> X = np.array(["green"] * 5 + ["red"] * 10 + ["blue"] * 7,
229+ ... dtype=object).reshape(-1, 1)
230+ >>> y = np.array(["apple"] * 5 + ["not apple"] * 3 + ["apple"] * 7 +
231+ ... ["not apple"] * 5 + ["apple"] * 2, dtype=object)
232+
233+ We generate a dataset associating a color to being an apple or not an apple.
234+ We strongly associated "green" and "red" to being an apple. The minority class
235+ being "not apple", we expect new data generated belonging to the category
236+ "blue"::
237+
238+ >>> from imblearn.over_sampling import SMOTEN
239+ >>> sampler = SMOTEN(random_state=0)
240+ >>> X_res, y_res = sampler.fit_resample(X, y)
241+ >>> X_res[y.size:]
242+ array([['blue'],
243+ ['blue'],
244+ ['blue'],
245+ ['blue'],
246+ ['blue'],
247+ ['blue']], dtype=object)
248+ >>> y_res[y.size:]
249+ array(['not apple', 'not apple', 'not apple', 'not apple', 'not apple',
250+ 'not apple'], dtype=object)
251+
214252Mathematical formulation
215253========================
216254
0 commit comments