Skip to content

Estimators fit with dataframes cause UserWarnings on scikit-learn 1.0  #858

@mmccarty

Description

@mmccarty

What happened:
Test failures when fitting sklearn estimators with dataframes. As of scikit-learn=1.0, all estimators store feature_names_in_ when fitted on dataframes and column name consistency checks issue a FutureWarning when column names are not consistent with the X columns used to fit. dask-ml's pytest configuration fails tests with sklearn warnings.

What you expected to happen:
Tests to pass with scikit-learn=1.0 and dask-ml should be updated to hand dataframes contently with scikit-learn>=1.0

Minimal Complete Verifiable Example:
From tests/test_partial.py, one of the failing tests.

df = pd.DataFrame({"x": range(10), "y": [0, 1] * 5})
ddf = dd.from_pandas(df, npartitions=2)

with dask.config.set(scheduler="single-threaded"):
    sgd = SGDClassifier(max_iter=5, tol=1e-3)

    sgd = fit(sgd, ddf[["x"]], ddf.y, classes=[0, 1])

    sol = sgd.predict(df[["x"]])
    result = predict(sgd, ddf[["x"]])

Should result in the following

_________________________________________________________________________________________ test_dataframes _________________________________________________________________________________________

    def test_dataframes():
        df = pd.DataFrame({"x": range(10), "y": [0, 1] * 5})
        ddf = dd.from_pandas(df, npartitions=2)
    
        with dask.config.set(scheduler="single-threaded"):
            sgd = SGDClassifier(max_iter=5, tol=1e-3)
    
            sgd = fit(sgd, ddf[["x"]], ddf.y, classes=[0, 1])
    
            sol = sgd.predict(df[["x"]])
>           result = predict(sgd, ddf[["x"]])

tests/test_partial.py:103: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
dask_ml/_partial.py:183: in predict
    dt = model.predict(xx).dtype
../../../miniconda3/envs/dask-ml-dev/lib/python3.8/site-packages/sklearn/linear_model/_base.py:425: in predict
    scores = self.decision_function(X)
../../../miniconda3/envs/dask-ml-dev/lib/python3.8/site-packages/sklearn/linear_model/_base.py:407: in decision_function
    X = self._validate_data(X, accept_sparse="csr", reset=False)
../../../miniconda3/envs/dask-ml-dev/lib/python3.8/site-packages/sklearn/base.py:543: in _validate_data
    self._check_feature_names(X, reset=reset)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = SGDClassifier(max_iter=5), X = array([[0]])

    def _check_feature_names(self, X, *, reset):
        """Set or check the `feature_names_in_` attribute.
    
        .. versionadded:: 1.0
    
        Parameters
        ----------
        X : {ndarray, dataframe} of shape (n_samples, n_features)
            The input samples.
    
        reset : bool
            Whether to reset the `feature_names_in_` attribute.
            If False, the input will be checked for consistency with
            feature names of data provided when reset was last True.
            .. note::
               It is recommended to call `reset=True` in `fit` and in the first
               call to `partial_fit`. All other methods that validate `X`
               should set `reset=False`.
        """
    
        if reset:
            feature_names_in = _get_feature_names(X)
            if feature_names_in is not None:
                self.feature_names_in_ = feature_names_in
            return
    
        fitted_feature_names = getattr(self, "feature_names_in_", None)
        X_feature_names = _get_feature_names(X)
    
        if fitted_feature_names is None and X_feature_names is None:
            # no feature names seen in fit and in X
            return
    
        if X_feature_names is not None and fitted_feature_names is None:
            warnings.warn(
                f"X has feature names, but {self.__class__.__name__} was fitted without"
                " feature names"
            )
            return
    
        if X_feature_names is None and fitted_feature_names is not None:
>           warnings.warn(
                "X does not have valid feature names, but"
                f" {self.__class__.__name__} was fitted with feature names"
            )
E           UserWarning: X does not have valid feature names, but SGDClassifier was fitted with feature names

Anything else we need to know?:
In this case, the problem comes from the way dataframes are coerced to arrays in partial.predict. In scikit-learn=1.0, the model is expecting X to be a dataframe.

Environment:

  • Dask version: 2021.9.1
  • Dask-ML version: latest from main
  • Python version: 3.8.12
  • Operating System: Ubuntu 21.04
  • Install method (conda, pip, source): conda

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions