@@ -467,12 +467,32 @@ and the output a 3 nearest neighbors classifier. The class can be used as::
467467
468468.. _instance_hardness_threshold :
469469
470+ Additional undersampling techniques
471+ -----------------------------------
472+
470473Instance hardness threshold
471474^^^^^^^^^^^^^^^^^^^^^^^^^^^
472475
473- :class: `InstanceHardnessThreshold ` is a specific algorithm in which a
474- classifier is trained on the data and the samples with lower probabilities are
475- removed :cite: `smith2014instance `. The class can be used as::
476+ **Instance Hardness ** is a measure of how difficult it is to classify an instance or
477+ observation correctly. In other words, hard instances are observations that are hard to
478+ classify correctly.
479+
480+ Fundamentally, instances that are hard to classify correctly are those for which the
481+ learning algorithm or classifier produces a low probability of predicting the correct
482+ class label.
483+
484+ If we removed these hard instances from the dataset, the logic goes, we would help the
485+ classifier better identify the different classes :cite: `smith2014instance `.
486+
487+ :class: `InstanceHardnessThreshold ` trains a classifier on the data and then removes the
488+ samples with lower probabilities :cite: `smith2014instance `. Or in other words, it
489+ retains the observations with the higher class probabilities.
490+
491+ In our implementation, :class: `InstanceHardnessThreshold ` is (almost) a controlled
492+ under-sampling method: it will retain a specific number of observations of the target
493+ class(es), which is specified by the user (see caveat below).
494+
495+ The class can be used as::
476496
477497 >>> from sklearn.linear_model import LogisticRegression
478498 >>> from imblearn.under_sampling import InstanceHardnessThreshold
@@ -483,18 +503,18 @@ removed :cite:`smith2014instance`. The class can be used as::
483503 >>> print(sorted(Counter(y_resampled).items()))
484504 [(0, 64), (1, 64), (2, 64)]
485505
486- This class has 2 important parameters. `` estimator `` will accept any
487- scikit-learn classifier which has a method ``predict_proba ``. The classifier
488- training is performed using a cross-validation and the parameter `` cv `` can set
489- the number of folds to use .
506+ : class: ` InstanceHardnessThreshold ` has 2 important parameters. The parameter
507+ `` estimator `` accepts any scikit-learn classifier with a method ``predict_proba ``.
508+ This classifier will be used to identify the hard instances. The training is performed
509+ with cross-validation which can be specified through the parameter `` cv` .
490510
491511.. note::
492512
493513 :class:`InstanceHardnessThreshold` could almost be considered as a
494514 controlled under-sampling method. However, due to the probability outputs, it
495- is not always possible to get a specific number of samples.
515+ is not always possible to get the specified number of samples.
496516
497- The figure below gives another examples on some toy data .
517+ The figure below shows examples of instance hardness undersampling on a toy dataset .
498518
499519.. image:: ./auto_examples/under-sampling/images/sphx_glr_plot_comparison_under_sampling_006.png
500520 :target: ./auto_examples/under-sampling/plot_comparison_under_sampling.html
0 commit comments