Skip to content

Overestimation of OOB score, probable bug in resampling? #655

@SvenWarnke

Description

@SvenWarnke

Description

When I calculate out of bag score the out of bag score is quite high, even if there is no connection between features and labels. I assume, that something goes wrong in keeping track of which samples are out of bag for each tree. Hence, the samples get evaluated on some trees where they where in fact in the bag.

Steps/Code to Reproduce

Example:

import numpy as np
from imblearn import ensemble

X = np.arange(1000).reshape(-1, 1)
y = np.random.binomial(1, 0.5, size=1000)

rf = ensemble.BalancedRandomForestClassifier(oob_score=True)
rf.fit(X, y)
rf.oob_score_

the output is 0.838

Expected Results

Since there is no relationship between the X, y (y are just independent coin flips) OOB should be around 0.5

Actual Results

Something in the range of 0.8, which is very significant on a sample size of 1000.

Versions

Windows-10-10.0.18362-SP0
Python 3.7.4 (default, Aug 9 2019, 18:34:13) [MSC v.1915 64 bit (AMD64)]
NumPy 1.16.5
SciPy 1.3.1
Scikit-Learn 0.21.3
Imbalanced-Learn 0.5.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type: BugIndicates an unexpected problem or unintended behavior

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions