66
77.. currentmodule :: imblearn.metrics
88
9+ Classification metrics
10+ ----------------------
11+
912Currently, scikit-learn only offers the
1013``sklearn.metrics.balanced_accuracy_score `` (in 0.20) as metric to deal with
1114imbalanced datasets. The module :mod: `imblearn.metrics ` offers a couple of
@@ -15,7 +18,7 @@ classifiers.
1518.. _sensitivity_specificity :
1619
1720Sensitivity and specificity metrics
18- -----------------------------------
21+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1922
2023Sensitivity and specificity are metrics which are well known in medical
2124imaging. Sensitivity (also called true positive rate or recall) is the
@@ -34,7 +37,7 @@ use those metrics.
3437.. _imbalanced_metrics :
3538
3639Additional metrics specific to imbalanced datasets
37- --------------------------------------------------
40+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
3841
3942The :func: `geometric_mean_score `
4043:cite: `barandela2003strategies,kubat1997addressing ` is the root of the product
@@ -48,7 +51,7 @@ parameter ``alpha``.
4851.. _macro_averaged_mean_absolute_error :
4952
5053Macro-Averaged Mean Absolute Error (MA-MAE)
51- -------------------------------------------
54+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
5255
5356Ordinal classification is used when there is a rank among classes, for example
5457levels of functionality or movie ratings.
@@ -60,9 +63,84 @@ each class and averaged over classes, giving an equal weight to each class.
6063.. _classification_report :
6164
6265Summary of important metrics
63- ----------------------------
66+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
6467
6568The :func: `classification_report_imbalanced ` will compute a set of metrics
6669per class and summarize it in a table. The parameter `output_dict ` allows
6770to get a string or a Python dictionary. This dictionary can be reused to create
6871a Pandas dataframe for instance.
72+
73+ .. _pairwise_metrics :
74+
75+ Pairwise metrics
76+ ----------------
77+
78+ The :mod: `imblearn.metrics.pairwise ` submodule implements pairwise distances
79+ that are available in scikit-learn while used in some of the methods in
80+ imbalanced-learn.
81+
82+ .. _vdm :
83+
84+ Value Difference Metric
85+ ~~~~~~~~~~~~~~~~~~~~~~~
86+
87+ The class :class: `~imblearn.metrics.pairwise.ValueDifferenceMetric ` is
88+ implementing the Value Difference Metric proposed in
89+ :cite: `stanfill1986toward `. This measure is used to compute the proximity
90+ of two samples composed of only nominal values.
91+
92+ Given a single feature, categories with similar correlation with the target
93+ vector will be considered closer. Let's give an example to illustrate this
94+ behaviour as given in :cite: `wilson1997improved `. `X ` will be represented by a
95+ single feature which will be some color and the target will be if a sample is
96+ whether or not an apple::
97+
98+ >>> import numpy as np
99+ >>> X = np.array(["green"] * 10 + ["red"] * 10 + ["blue"] * 10).reshape(-1, 1)
100+ >>> y = ["apple"] * 8 + ["not apple"] * 5 + ["apple"] * 7 + ["not apple"] * 9 + ["apple"]
101+
102+ In this dataset, the categories "red" and "green" are more correlated to the
103+ target `y ` and should have a smaller distance than with the category "blue".
104+ We should this behaviour. Be aware that we need to encode the `X ` to work with
105+ numerical values::
106+
107+ >>> from sklearn.preprocessing import OrdinalEncoder
108+ >>> encoder = OrdinalEncoder(dtype=np.int32)
109+ >>> X_encoded = encoder.fit_transform(X)
110+
111+ Now, we can compute the distance between three different samples representing
112+ the different categories::
113+
114+ >>> from imblearn.metrics.pairwise import ValueDifferenceMetric
115+ >>> vdm = ValueDifferenceMetric().fit(X_encoded, y)
116+ >>> X_test = np.array(["green", "red", "blue"]).reshape(-1, 1)
117+ >>> X_test_encoded = encoder.transform(X_test)
118+ >>> vdm.pairwise(X_test_encoded)
119+ array([[ 0. , 0.04, 1.96],
120+ [ 0.04, 0. , 1.44],
121+ [ 1.96, 1.44, 0. ]])
122+
123+ We see that the minimum distance happen when the categories "red" and "green"
124+ are compared. Whenever comparing with "blue", the distance is much larger.
125+
126+ **Mathematical formulation **
127+
128+ The distance between feature values of two samples is defined as:
129+
130+ .. math ::
131+ \delta (x, y) = \sum _{c=1 }^{C} |p(c|x_{f}) - p(c|y_{f})|^{k} \ ,
132+
133+ where :math: `x` and :math: `y` are two samples and :math: `f` a given
134+ feature, :math: `C` is the number of classes, :math: `p(c|x_{f})` is the
135+ conditional probability that the output class is :math: `c` given that
136+ the feature value :math: `f` has the value :math: `x` and :math: `k` an
137+ exponent usually defined to 1 or 2.
138+
139+ The distance for the feature vectors :math: `X` and :math: `Y` is
140+ subsequently defined as:
141+
142+ .. math ::
143+ \Delta (X, Y) = \sum _{f=1 }^{F} \delta (X_{f}, Y_{f})^{r} \ ,
144+
145+ where :math: `F` is the number of feature and :math: `r` an exponent usually
146+ defined equal to 1 or 2.
0 commit comments