-
Notifications
You must be signed in to change notification settings - Fork 403
Closed
Labels
Description
I'm trying to understand the behavior (and intent) of the handle_unknown option for OneHotEncoder (and by extension OrdinalEncoder). The docs imply that this should control NaN handling but below examples seem to indicate otherwise (category_encoders==1.2.8)
In [2]: import pandas as pd
...: import numpy as np
...: from category_encoders import OneHotEncoder
...:
In [3]: X = pd.DataFrame({'a': ['foo', 'bar', 'bar'],
...: 'b': ['qux', np.nan, 'foo']})
...: X
...:
Out[3]:
a b
0 foo qux
1 bar NaN
2 bar foo
In [4]: encoder = OneHotEncoder(cols=['a', 'b'], handle_unknown='ignore',
...: impute_missing=True, use_cat_names=True)
...: encoder.fit_transform(X)
...:
Out[4]:
a_foo a_bar b_qux b_nan b_foo
0 1 0 1 0 0
1 0 1 0 1 0
2 0 1 0 0 1
In [5]: encoder = OneHotEncoder(cols=['a', 'b'], handle_unknown='impute',
...: impute_missing=True, use_cat_names=True)
...: encoder.fit_transform(X)
...:
Out[5]:
a_foo a_bar a_-1 b_qux b_nan b_foo b_-1
0 1 0 0 1 0 0 0
1 0 1 0 0 1 0 0
2 0 1 0 0 0 1 0
In [6]: encoder = OneHotEncoder(cols=['a', 'b'], handle_unknown='error',
...: impute_missing=True, use_cat_names=True)
...: encoder.fit_transform(X)
...:
Out[6]:
a_foo a_bar b_qux b_nan b_foo
0 1 0 1 0 0
1 0 1 0 1 0
2 0 1 0 0 1
In [7]: encoder = OneHotEncoder(cols=['a', 'b'], handle_unknown='ignore',
...: impute_missing=False, use_cat_names=True)
...: encoder.fit_transform(X)
...:
Out[7]:
a_foo a_bar b_qux b_nan b_foo
0 1 0 1 0 0
1 0 1 0 1 0
2 0 1 0 0 1
In particular, 'error' and 'ignore' give the same behavior, treating missing observations as another category. 'impute' adds constant zero-valued columns but also treats missing observations as another category. Naively would've expected behavior similar to pd.get_dummies(X, dummy_na={True|False})
, with handle_unknown=ignore
corresponding to dummy_na=False
.