-
Notifications
You must be signed in to change notification settings - Fork 72
Description
In #886, we removed all dependencies on Dask-ML in favor of scikit-learn, cuML, and our own classes (ParallelPostFit and Incremental). Previously, when creating an experiment, experiment_class
was expected to be a path to a dask_ml
class, but sklearn
classes were also found to be compatible. However, I couldn't get it to work with cuml
, such as with cuml.model_selection.GridSearchCV
. For example:
c.sql(
"""
CREATE EXPERIMENT my_exp WITH (
model_class = 'sklearn.ensemble.GradientBoostingClassifier',
experiment_class = 'cuml.model_selection.GridSearchCV',
tune_parameters = (n_estimators = ARRAY [16, 32, 2],learning_rate = ARRAY [0.1,0.01,0.001],
max_depth = ARRAY [3,4,5,10]),
target_column = 'target'
) AS (
SELECT x, y, x*y > 0 AS target
FROM timeseries
LIMIT 100
)
"""
)
errors with:
INFO:dask_sql.physical.rel.custom.create_experiment:{'n_estimators': [16, 32, 2], 'learning_rate': [0.1, 0.01, 0.001], 'max_depth': [3, 4, 5, 10]}
INFO:dask_sql.physical.rel.custom.create_experiment:{}
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In [8], line 1
----> 1 c.sql(
2 """
3 CREATE EXPERIMENT my_exp WITH (
4 model_class = 'sklearn.ensemble.GradientBoostingClassifier',
5 experiment_class = 'cuml.model_selection.GridSearchCV',
6 tune_parameters = (n_estimators = ARRAY [16, 32, 2],learning_rate = ARRAY [0.1,0.01,0.001],
7 max_depth = ARRAY [3,4,5,10]),
8 target_column = 'target'
9 ) AS (
10 SELECT x, y, x*y > 0 AS target
11 FROM timeseries
12 LIMIT 100
13 )
14 """
15 )
File ~/miniconda3/envs/dsql_rapids-22.12/lib/python3.9/site-packages/dask_sql/context.py:501, in Context.sql(self, sql, return_futures, dataframes, gpu, config_options)
496 else:
497 raise RuntimeError(
498 f"Encountered unsupported `LogicalPlan` sql type: {type(sql)}"
499 )
--> 501 return self._compute_table_from_rel(rel, return_futures)
File ~/miniconda3/envs/dsql_rapids-22.12/lib/python3.9/site-packages/dask_sql/context.py:830, in Context._compute_table_from_rel(self, rel, return_futures)
829 def _compute_table_from_rel(self, rel: "LogicalPlan", return_futures: bool = True):
--> 830 dc = RelConverter.convert(rel, context=self)
832 # Optimization might remove some alias projects. Make sure to keep them here.
833 select_names = [field for field in rel.getRowType().getFieldList()]
File ~/miniconda3/envs/dsql_rapids-22.12/lib/python3.9/site-packages/dask_sql/physical/rel/convert.py:61, in RelConverter.convert(cls, rel, context)
55 raise NotImplementedError(
56 f"No relational conversion for node type {node_type} available (yet)."
57 )
58 logger.debug(
59 f"Processing REL {rel} using {plugin_instance.__class__.__name__}..."
60 )
---> 61 df = plugin_instance.convert(rel, context=context)
62 logger.debug(f"Processed REL {rel} into {LoggableDataFrame(df)}")
63 return df
File ~/miniconda3/envs/dsql_rapids-22.12/lib/python3.9/site-packages/dask_sql/physical/rel/custom/create_experiment.py:169, in CreateExperimentPlugin.convert(self, rel, context)
167 search = ExperimentClass(model, {**parameters}, **experiment_kwargs)
168 logger.info(tune_fit_kwargs)
--> 169 search.fit(
170 X.to_dask_array(lengths=True),
171 y.to_dask_array(lengths=True),
172 **tune_fit_kwargs,
173 )
174 df = pd.DataFrame(search.cv_results_)
175 df["model_class"] = model_class
File ~/miniconda3/envs/dsql_rapids-22.12/lib/python3.9/site-packages/sklearn/model_selection/_search.py:786, in BaseSearchCV.fit(self, X, y, groups, **fit_params)
783 X, y, groups = indexable(X, y, groups)
784 fit_params = _check_fit_params(X, fit_params)
--> 786 cv_orig = check_cv(self.cv, y, classifier=is_classifier(estimator))
787 n_splits = cv_orig.get_n_splits(X, y, groups)
789 base_estimator = clone(self.estimator)
File ~/miniconda3/envs/dsql_rapids-22.12/lib/python3.9/site-packages/sklearn/model_selection/_split.py:2331, in check_cv(cv, y, classifier)
2326 cv = 5 if cv is None else cv
2327 if isinstance(cv, numbers.Integral):
2328 if (
2329 classifier
2330 and (y is not None)
-> 2331 and (type_of_target(y, input_name="y") in ("binary", "multiclass"))
2332 ):
2333 return StratifiedKFold(cv)
2334 else:
File ~/miniconda3/envs/dsql_rapids-22.12/lib/python3.9/site-packages/sklearn/utils/multiclass.py:286, in type_of_target(y, input_name)
283 if sparse_pandas:
284 raise ValueError("y cannot be class 'SparseSeries' or 'SparseArray'")
--> 286 if is_multilabel(y):
287 return "multilabel-indicator"
289 # DeprecationWarning will be replaced by ValueError, see NEP 34
290 # https://numpy.org/neps/nep-0034-infer-dtype-is-object.html
File ~/miniconda3/envs/dsql_rapids-22.12/lib/python3.9/site-packages/sklearn/utils/multiclass.py:152, in is_multilabel(y)
150 warnings.simplefilter("error", np.VisibleDeprecationWarning)
151 try:
--> 152 y = np.asarray(y)
153 except (np.VisibleDeprecationWarning, ValueError):
154 # dtype=object should be provided explicitly for ragged arrays,
155 # see NEP 34
156 y = np.array(y, dtype=object)
File ~/miniconda3/envs/dsql_rapids-22.12/lib/python3.9/site-packages/dask/array/core.py:1704, in Array.__array__(self, dtype, **kwargs)
1702 x = x.astype(dtype)
1703 if not isinstance(x, np.ndarray):
-> 1704 x = np.array(x)
1705 return x
File cupy/_core/core.pyx:1473, in cupy._core.core._ndarray_base.__array__()
TypeError: Implicit conversion to a NumPy array is not allowed. Please use `.get()` to construct a NumPy array explicitly.
Using model_class = 'xgboost.XGBClassifier'
or model_class = 'xgboost.dask.XGBClassifier'
results in the same error as above.
When I try it with a model_class
from cuML, more errors arise. For example, if I try it with model_class = 'cuml.dask.ensemble.RandomForestClassifier'
(cuML has no GradientBoostingClassifier
), scikit-learn raises a
TypeError: If no scoring is specified, the estimator passed should have a 'score' method. The estimator <cuml.dask.ensemble.randomforestclassifier.RandomForestClassifier object at 0x7f0c5f692820> does not.
I tried a couple of different changes on the Dask-SQL side but have yet to find a solution. It's possible that this will require changes on the Dask and/or cuML side of things.