Estimators fit with dataframes cause UserWarnings on scikit-learn 1.0 

**What happened**:
Test failures when fitting sklearn estimators with dataframes. As of `scikit-learn=1.0`, all estimators store `feature_names_in_` when fitted on dataframes and [column name consistency checks](https://github.com/scikit-learn/scikit-learn/pull/18010) issue a `FutureWarning` when column names are not consistent with the `X` columns used to fit. `dask-ml`'s pytest configuration fails tests with sklearn warnings.

**What you expected to happen**:
Tests to pass with `scikit-learn=1.0` and `dask-ml` should be updated to hand dataframes contently with `scikit-learn>=1.0`

**Minimal Complete Verifiable Example**:
From [tests/test_partial.py](https://github.com/dask/dask-ml/blob/main/tests/test_partial.py#L92), one of the failing tests.
```python
df = pd.DataFrame({"x": range(10), "y": [0, 1] * 5})
ddf = dd.from_pandas(df, npartitions=2)

with dask.config.set(scheduler="single-threaded"):
    sgd = SGDClassifier(max_iter=5, tol=1e-3)

    sgd = fit(sgd, ddf[["x"]], ddf.y, classes=[0, 1])

    sol = sgd.predict(df[["x"]])
    result = predict(sgd, ddf[["x"]])
```

Should result in the following
```
_________________________________________________________________________________________ test_dataframes _________________________________________________________________________________________

    def test_dataframes():
        df = pd.DataFrame({"x": range(10), "y": [0, 1] * 5})
        ddf = dd.from_pandas(df, npartitions=2)
    
        with dask.config.set(scheduler="single-threaded"):
            sgd = SGDClassifier(max_iter=5, tol=1e-3)
    
            sgd = fit(sgd, ddf[["x"]], ddf.y, classes=[0, 1])
    
            sol = sgd.predict(df[["x"]])
>           result = predict(sgd, ddf[["x"]])

tests/test_partial.py:103: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
dask_ml/_partial.py:183: in predict
    dt = model.predict(xx).dtype
../../../miniconda3/envs/dask-ml-dev/lib/python3.8/site-packages/sklearn/linear_model/_base.py:425: in predict
    scores = self.decision_function(X)
../../../miniconda3/envs/dask-ml-dev/lib/python3.8/site-packages/sklearn/linear_model/_base.py:407: in decision_function
    X = self._validate_data(X, accept_sparse="csr", reset=False)
../../../miniconda3/envs/dask-ml-dev/lib/python3.8/site-packages/sklearn/base.py:543: in _validate_data
    self._check_feature_names(X, reset=reset)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = SGDClassifier(max_iter=5), X = array([[0]])

    def _check_feature_names(self, X, *, reset):
        """Set or check the `feature_names_in_` attribute.
    
        .. versionadded:: 1.0
    
        Parameters
        ----------
        X : {ndarray, dataframe} of shape (n_samples, n_features)
            The input samples.
    
        reset : bool
            Whether to reset the `feature_names_in_` attribute.
            If False, the input will be checked for consistency with
            feature names of data provided when reset was last True.
            .. note::
               It is recommended to call `reset=True` in `fit` and in the first
               call to `partial_fit`. All other methods that validate `X`
               should set `reset=False`.
        """
    
        if reset:
            feature_names_in = _get_feature_names(X)
            if feature_names_in is not None:
                self.feature_names_in_ = feature_names_in
            return
    
        fitted_feature_names = getattr(self, "feature_names_in_", None)
        X_feature_names = _get_feature_names(X)
    
        if fitted_feature_names is None and X_feature_names is None:
            # no feature names seen in fit and in X
            return
    
        if X_feature_names is not None and fitted_feature_names is None:
            warnings.warn(
                f"X has feature names, but {self.__class__.__name__} was fitted without"
                " feature names"
            )
            return
    
        if X_feature_names is None and fitted_feature_names is not None:
>           warnings.warn(
                "X does not have valid feature names, but"
                f" {self.__class__.__name__} was fitted with feature names"
            )
E           UserWarning: X does not have valid feature names, but SGDClassifier was fitted with feature names
```

**Anything else we need to know?**:
In this case, the problem comes from the way dataframes are coerced to arrays in [partial.predict](https://github.com/dask/dask-ml/blob/main/dask_ml/_partial.py#L182). In `scikit-learn=1.0`, the `model` is expecting `X` to be a dataframe.

**Environment**:

- Dask version: 2021.9.1
- Dask-ML version: latest from main
- Python version: 3.8.12
- Operating System: Ubuntu 21.04
- Install method (conda, pip, source): conda


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Estimators fit with dataframes cause UserWarnings on scikit-learn 1.0 #858

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Estimators fit with dataframes cause UserWarnings on scikit-learn 1.0 #858

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions