Behavior of OneHotEncoder handle_unknown option

I'm trying to understand the behavior (and intent) of the handle_unknown option for OneHotEncoder (and by extension OrdinalEncoder). The docs imply that this should control NaN handling but below examples seem to indicate otherwise (category_encoders==1.2.8)

```python
In [2]: import pandas as pd
   ...: import numpy as np
   ...: from category_encoders import OneHotEncoder
   ...: 

In [3]: X = pd.DataFrame({'a': ['foo', 'bar', 'bar'],
   ...:                   'b': ['qux', np.nan, 'foo']})
   ...: X
   ...: 
Out[3]: 
     a    b
0  foo  qux
1  bar  NaN
2  bar  foo

In [4]: encoder = OneHotEncoder(cols=['a', 'b'], handle_unknown='ignore', 
   ...:                         impute_missing=True, use_cat_names=True)
   ...: encoder.fit_transform(X)
   ...: 
Out[4]: 
   a_foo  a_bar  b_qux  b_nan  b_foo
0      1      0      1      0      0
1      0      1      0      1      0
2      0      1      0      0      1

In [5]: encoder = OneHotEncoder(cols=['a', 'b'], handle_unknown='impute', 
   ...:                         impute_missing=True, use_cat_names=True)
   ...: encoder.fit_transform(X)
   ...: 
Out[5]: 
   a_foo  a_bar  a_-1  b_qux  b_nan  b_foo  b_-1
0      1      0     0      1      0      0     0
1      0      1     0      0      1      0     0
2      0      1     0      0      0      1     0

In [6]: encoder = OneHotEncoder(cols=['a', 'b'], handle_unknown='error', 
   ...:                         impute_missing=True, use_cat_names=True)
   ...: encoder.fit_transform(X)
   ...: 
Out[6]: 
   a_foo  a_bar  b_qux  b_nan  b_foo
0      1      0      1      0      0
1      0      1      0      1      0
2      0      1      0      0      1

In [7]: encoder = OneHotEncoder(cols=['a', 'b'], handle_unknown='ignore', 
   ...:                         impute_missing=False, use_cat_names=True)
   ...: encoder.fit_transform(X)
   ...: 
Out[7]: 
   a_foo  a_bar  b_qux  b_nan  b_foo
0      1      0      1      0      0
1      0      1      0      1      0
2      0      1      0      0      1

```   
In particular, 'error' and 'ignore' give the same behavior, treating missing observations as another category. 'impute' adds constant zero-valued columns but also treats missing observations as another category. Naively would've expected behavior similar to `pd.get_dummies(X, dummy_na={True|False})`, with `handle_unknown=ignore` corresponding to `dummy_na=False`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Behavior of OneHotEncoder handle_unknown option #92

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Behavior of OneHotEncoder handle_unknown option #92

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions