-
-
Notifications
You must be signed in to change notification settings - Fork 260
Description
What happened:
Test failures when fitting sklearn estimators with dataframes. As of scikit-learn=1.0
, all estimators store feature_names_in_
when fitted on dataframes and column name consistency checks issue a FutureWarning
when column names are not consistent with the X
columns used to fit. dask-ml
's pytest configuration fails tests with sklearn warnings.
What you expected to happen:
Tests to pass with scikit-learn=1.0
and dask-ml
should be updated to hand dataframes contently with scikit-learn>=1.0
Minimal Complete Verifiable Example:
From tests/test_partial.py, one of the failing tests.
df = pd.DataFrame({"x": range(10), "y": [0, 1] * 5})
ddf = dd.from_pandas(df, npartitions=2)
with dask.config.set(scheduler="single-threaded"):
sgd = SGDClassifier(max_iter=5, tol=1e-3)
sgd = fit(sgd, ddf[["x"]], ddf.y, classes=[0, 1])
sol = sgd.predict(df[["x"]])
result = predict(sgd, ddf[["x"]])
Should result in the following
_________________________________________________________________________________________ test_dataframes _________________________________________________________________________________________
def test_dataframes():
df = pd.DataFrame({"x": range(10), "y": [0, 1] * 5})
ddf = dd.from_pandas(df, npartitions=2)
with dask.config.set(scheduler="single-threaded"):
sgd = SGDClassifier(max_iter=5, tol=1e-3)
sgd = fit(sgd, ddf[["x"]], ddf.y, classes=[0, 1])
sol = sgd.predict(df[["x"]])
> result = predict(sgd, ddf[["x"]])
tests/test_partial.py:103:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
dask_ml/_partial.py:183: in predict
dt = model.predict(xx).dtype
../../../miniconda3/envs/dask-ml-dev/lib/python3.8/site-packages/sklearn/linear_model/_base.py:425: in predict
scores = self.decision_function(X)
../../../miniconda3/envs/dask-ml-dev/lib/python3.8/site-packages/sklearn/linear_model/_base.py:407: in decision_function
X = self._validate_data(X, accept_sparse="csr", reset=False)
../../../miniconda3/envs/dask-ml-dev/lib/python3.8/site-packages/sklearn/base.py:543: in _validate_data
self._check_feature_names(X, reset=reset)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = SGDClassifier(max_iter=5), X = array([[0]])
def _check_feature_names(self, X, *, reset):
"""Set or check the `feature_names_in_` attribute.
.. versionadded:: 1.0
Parameters
----------
X : {ndarray, dataframe} of shape (n_samples, n_features)
The input samples.
reset : bool
Whether to reset the `feature_names_in_` attribute.
If False, the input will be checked for consistency with
feature names of data provided when reset was last True.
.. note::
It is recommended to call `reset=True` in `fit` and in the first
call to `partial_fit`. All other methods that validate `X`
should set `reset=False`.
"""
if reset:
feature_names_in = _get_feature_names(X)
if feature_names_in is not None:
self.feature_names_in_ = feature_names_in
return
fitted_feature_names = getattr(self, "feature_names_in_", None)
X_feature_names = _get_feature_names(X)
if fitted_feature_names is None and X_feature_names is None:
# no feature names seen in fit and in X
return
if X_feature_names is not None and fitted_feature_names is None:
warnings.warn(
f"X has feature names, but {self.__class__.__name__} was fitted without"
" feature names"
)
return
if X_feature_names is None and fitted_feature_names is not None:
> warnings.warn(
"X does not have valid feature names, but"
f" {self.__class__.__name__} was fitted with feature names"
)
E UserWarning: X does not have valid feature names, but SGDClassifier was fitted with feature names
Anything else we need to know?:
In this case, the problem comes from the way dataframes are coerced to arrays in partial.predict. In scikit-learn=1.0
, the model
is expecting X
to be a dataframe.
Environment:
- Dask version: 2021.9.1
- Dask-ML version: latest from main
- Python version: 3.8.12
- Operating System: Ubuntu 21.04
- Install method (conda, pip, source): conda