Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions .flake8
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
[flake8]
max-line-length = 88
# Default flake8 3.5 ignored flags
ignore=E121,E123,E126,E226,E24,E704,W503,W504,E203
# It's fine not to put the import at the top of the file in the examples
# folder.
per-file-ignores =
examples/*: E402
20 changes: 20 additions & 0 deletions doc/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -205,6 +205,10 @@ Imbalance-learn provides some fast-prototyping tools.

.. currentmodule:: imblearn

Classification metrics
----------------------
See the :ref:`metrics` section of the user guide for further details.

.. autosummary::
:toctree: generated/
:template: function.rst
Expand All @@ -217,6 +221,22 @@ Imbalance-learn provides some fast-prototyping tools.
metrics.macro_averaged_mean_absolute_error
metrics.make_index_balanced_accuracy

Pairwise metrics
----------------
See the :ref:`pairwise_metrics` section of the user guide for further details.

.. automodule:: imblearn.metrics.pairwise
:no-members:
:no-inherited-members:

.. currentmodule:: imblearn

.. autosummary::
:toctree: generated/
:template: class.rst

metrics.pairwise.ValueDifferenceMetric

.. _datasets_ref:

:mod:`imblearn.datasets`: Datasets
Expand Down
22 changes: 21 additions & 1 deletion doc/bibtex/refs.bib
Original file line number Diff line number Diff line change
Expand Up @@ -223,4 +223,24 @@ @article{esuli2009ordinal
publisher = {IEEE Computer Society},
address = {Los Alamitos, CA, USA},
month = {dec}
}
}

@article{stanfill1986toward,
title={Toward memory-based reasoning},
author={Stanfill, Craig and Waltz, David},
journal={Communications of the ACM},
volume={29},
number={12},
pages={1213--1228},
year={1986},
publisher={ACM New York, NY, USA}
}

@article{wilson1997improved,
title={Improved heterogeneous distance functions},
author={Wilson, D Randall and Martinez, Tony R},
journal={Journal of artificial intelligence research},
volume={6},
pages={1--34},
year={1997}
}
12 changes: 7 additions & 5 deletions doc/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,6 @@
# documentation root, use os.path.abspath to make it absolute, like shown here.
sys.path.insert(0, os.path.abspath("sphinxext"))
from github_link import make_linkcode_resolve
import sphinx_gallery

# -- General configuration ------------------------------------------------

Expand All @@ -44,7 +43,7 @@
]

# bibtex file
bibtex_bibfiles = ['bibtex/refs.bib']
bibtex_bibfiles = ["bibtex/refs.bib"]

# this is needed for some reason...
# see https://github.com/numpy/numpydoc/issues/69
Expand Down Expand Up @@ -77,8 +76,8 @@
master_doc = "index"

# General information about the project.
project = 'imbalanced-learn'
copyright = '2014-2020, The imbalanced-learn developers'
project = "imbalanced-learn"
copyright = "2014-2020, The imbalanced-learn developers"

# The version info for the project you're documenting, acts as replacement for
# |version| and |release|, also used in various other places throughout the
Expand Down Expand Up @@ -260,7 +259,10 @@

# intersphinx configuration
intersphinx_mapping = {
"python": ("https://docs.python.org/{.major}".format(sys.version_info), None,),
"python": (
"https://docs.python.org/{.major}".format(sys.version_info),
None,
),
"numpy": ("https://docs.scipy.org/doc/numpy/", None),
"scipy": ("https://docs.scipy.org/doc/scipy/reference", None),
"matplotlib": ("https://matplotlib.org/", None),
Expand Down
86 changes: 82 additions & 4 deletions doc/metrics.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,9 @@ Metrics

.. currentmodule:: imblearn.metrics

Classification metrics
----------------------

Currently, scikit-learn only offers the
``sklearn.metrics.balanced_accuracy_score`` (in 0.20) as metric to deal with
imbalanced datasets. The module :mod:`imblearn.metrics` offers a couple of
Expand All @@ -15,7 +18,7 @@ classifiers.
.. _sensitivity_specificity:

Sensitivity and specificity metrics
-----------------------------------
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Sensitivity and specificity are metrics which are well known in medical
imaging. Sensitivity (also called true positive rate or recall) is the
Expand All @@ -34,7 +37,7 @@ use those metrics.
.. _imbalanced_metrics:

Additional metrics specific to imbalanced datasets
--------------------------------------------------
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The :func:`geometric_mean_score`
:cite:`barandela2003strategies,kubat1997addressing` is the root of the product
Expand All @@ -48,7 +51,7 @@ parameter ``alpha``.
.. _macro_averaged_mean_absolute_error:

Macro-Averaged Mean Absolute Error (MA-MAE)
-------------------------------------------
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Ordinal classification is used when there is a rank among classes, for example
levels of functionality or movie ratings.
Expand All @@ -60,9 +63,84 @@ each class and averaged over classes, giving an equal weight to each class.
.. _classification_report:

Summary of important metrics
----------------------------
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The :func:`classification_report_imbalanced` will compute a set of metrics
per class and summarize it in a table. The parameter `output_dict` allows
to get a string or a Python dictionary. This dictionary can be reused to create
a Pandas dataframe for instance.

.. _pairwise_metrics:

Pairwise metrics
----------------

The :mod:`imblearn.metrics.pairwise` submodule implements pairwise distances
that are available in scikit-learn while used in some of the methods in
imbalanced-learn.

.. _vdm:

Value Difference Metric
~~~~~~~~~~~~~~~~~~~~~~~

The class :class:`~imblearn.metrics.pairwise.ValueDifferenceMetric` is
implementing the Value Difference Metric proposed in
:cite:`stanfill1986toward`. This measure is used to compute the proximity
of two samples composed of only nominal values.

Given a single feature, categories with similar correlation with the target
vector will be considered closer. Let's give an example to illustrate this
behaviour as given in :cite:`wilson1997improved`. `X` will be represented by a
single feature which will be some color and the target will be if a sample is
whether or not an apple::

>>> import numpy as np
>>> X = np.array(["green"] * 10 + ["red"] * 10 + ["blue"] * 10).reshape(-1, 1)
>>> y = ["apple"] * 8 + ["not apple"] * 5 + ["apple"] * 7 + ["not apple"] * 9 + ["apple"]

In this dataset, the categories "red" and "green" are more correlated to the
target `y` and should have a smaller distance than with the category "blue".
We should this behaviour. Be aware that we need to encode the `X` to work with
numerical values::

>>> from sklearn.preprocessing import OrdinalEncoder
>>> encoder = OrdinalEncoder(dtype=np.int32)
>>> X_encoded = encoder.fit_transform(X)

Now, we can compute the distance between three different samples representing
the different categories::

>>> from imblearn.metrics.pairwise import ValueDifferenceMetric
>>> vdm = ValueDifferenceMetric().fit(X_encoded, y)
>>> X_test = np.array(["green", "red", "blue"]).reshape(-1, 1)
>>> X_test_encoded = encoder.transform(X_test)
>>> vdm.pairwise(X_test_encoded)
array([[ 0. , 0.04, 1.96],
[ 0.04, 0. , 1.44],
[ 1.96, 1.44, 0. ]])

We see that the minimum distance happen when the categories "red" and "green"
are compared. Whenever comparing with "blue", the distance is much larger.

**Mathematical formulation**

The distance between feature values of two samples is defined as:

.. math::
\delta(x, y) = \sum_{c=1}^{C} |p(c|x_{f}) - p(c|y_{f})|^{k} \ ,

where :math:`x` and :math:`y` are two samples and :math:`f` a given
feature, :math:`C` is the number of classes, :math:`p(c|x_{f})` is the
conditional probability that the output class is :math:`c` given that
the feature value :math:`f` has the value :math:`x` and :math:`k` an
exponent usually defined to 1 or 2.

The distance for the feature vectors :math:`X` and :math:`Y` is
subsequently defined as:

.. math::
\Delta(X, Y) = \sum_{f=1}^{F} \delta(X_{f}, Y_{f})^{r} \ ,

where :math:`F` is the number of feature and :math:`r` an exponent usually
defined equal to 1 or 2.
4 changes: 4 additions & 0 deletions doc/whats_new/v0.8.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,10 @@ New features
classification.
:pr:`780` by :user:`Aurélien Massiot <AurelienMassiot>`.

- Add the class :class:`imblearn.metrics.pairwise.ValueDifferenceMetric` to
compute pairwise distances between samples containing only nominal values.
:pr:`796` by :user:`Guillaume Lemaitre <glemaitre>`.

Enhancements
............

Expand Down
4 changes: 3 additions & 1 deletion imblearn/metrics/_classification.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# coding: utf-8
"""Metrics to assess performance on classification task given class prediction
"""Metrics to assess performance on a classification task given class
predictions. The available metrics are complementary from the metrics available
in scikit-learn.

Functions named as ``*_score`` return a scalar value to maximize: the higher
the better
Expand Down
Loading