Skip to content

Behavior of OneHotEncoder handle_unknown option #92

@multiloc

Description

@multiloc

I'm trying to understand the behavior (and intent) of the handle_unknown option for OneHotEncoder (and by extension OrdinalEncoder). The docs imply that this should control NaN handling but below examples seem to indicate otherwise (category_encoders==1.2.8)

In [2]: import pandas as pd
   ...: import numpy as np
   ...: from category_encoders import OneHotEncoder
   ...: 

In [3]: X = pd.DataFrame({'a': ['foo', 'bar', 'bar'],
   ...:                   'b': ['qux', np.nan, 'foo']})
   ...: X
   ...: 
Out[3]: 
     a    b
0  foo  qux
1  bar  NaN
2  bar  foo

In [4]: encoder = OneHotEncoder(cols=['a', 'b'], handle_unknown='ignore', 
   ...:                         impute_missing=True, use_cat_names=True)
   ...: encoder.fit_transform(X)
   ...: 
Out[4]: 
   a_foo  a_bar  b_qux  b_nan  b_foo
0      1      0      1      0      0
1      0      1      0      1      0
2      0      1      0      0      1

In [5]: encoder = OneHotEncoder(cols=['a', 'b'], handle_unknown='impute', 
   ...:                         impute_missing=True, use_cat_names=True)
   ...: encoder.fit_transform(X)
   ...: 
Out[5]: 
   a_foo  a_bar  a_-1  b_qux  b_nan  b_foo  b_-1
0      1      0     0      1      0      0     0
1      0      1     0      0      1      0     0
2      0      1     0      0      0      1     0

In [6]: encoder = OneHotEncoder(cols=['a', 'b'], handle_unknown='error', 
   ...:                         impute_missing=True, use_cat_names=True)
   ...: encoder.fit_transform(X)
   ...: 
Out[6]: 
   a_foo  a_bar  b_qux  b_nan  b_foo
0      1      0      1      0      0
1      0      1      0      1      0
2      0      1      0      0      1

In [7]: encoder = OneHotEncoder(cols=['a', 'b'], handle_unknown='ignore', 
   ...:                         impute_missing=False, use_cat_names=True)
   ...: encoder.fit_transform(X)
   ...: 
Out[7]: 
   a_foo  a_bar  b_qux  b_nan  b_foo
0      1      0      1      0      0
1      0      1      0      1      0
2      0      1      0      0      1

In particular, 'error' and 'ignore' give the same behavior, treating missing observations as another category. 'impute' adds constant zero-valued columns but also treats missing observations as another category. Naively would've expected behavior similar to pd.get_dummies(X, dummy_na={True|False}), with handle_unknown=ignore corresponding to dummy_na=False.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions