Skip to content

Conversation

@smithto1
Copy link
Contributor

Behavioural Changes
Fixing two related bugs: when grouping on multiple categoricals, .sum() and .count() would return NaN for the missing categories, but they are expected to return 0 for the missing categories. Both these bugs are fixed.

Tests
Tests were added in PR #35022 when these bugs were discovered and the tests were marked with an xfail. For this PR the xfails are removed and the tests are passing normally. As well, a few other existing tests were expecting sum() to return NaN; these have been updated so that the tests now expect to get 0 (which is the desired behaviour).

Pivot
The change to .sum() also impacts the df.pivot_table() if it is called with aggfunc=sum and is pivoted on a Categorical column with observed=False. This is not explicitly mentioned in either of the bugs, but it does make the behaviour consistent (i.e. the sum of a missing category is zero, not NaN). One test on test_pivot.py was updated to reflect this change.

@smithto1
Copy link
Contributor Author

@jreback Apologies for the PR spam, but because my new work was on a new branch I couldn't push to the old PR.

Please have a reivew of this one.

@smithto1
Copy link
Contributor Author

GroupBy.apply(sum)
One issue to note is that GroupBy.agg(sum) and GroupBy.apply(sum) now produce different outputs. .agg(sum) produces the desired behaviour of returning 0 for missing categories, while .apply(sum) still has the old behaviour of returning NaN.

import pandas as pd

df = pd.DataFrame(
    {
        "cat_1": pd.Categorical(list("AABB"), categories=list("ABC")),
        "cat_2": pd.Categorical(list("1111"), categories=list("12")),
        "value": [0.1, 0.1, 0.1, 0.1],
    }
)
df_grp = df.groupby(["cat_1", "cat_2"], observed=False)

# Using branch from this PR:
# .sum() returns zeros 
print( df_grp.sum() )

# agg returns zeros  
print( df_grp.agg(sum) )

# apply still returns NaN 
print( df_grp.apply(sum) )

Calling .apply(sum) never actually calls the GroupBy.sum() method, it just applies the passed in function. .agg(sum) does call the GroupBy.sum() method (happening here https://github.com/pandas-dev/pandas/blob/master/pandas/core/groupby/generic.py#L253). Thes means that .agg(sum) does get the special behaviour to return zero for missing categories, while .apply(sum) misses out on the special behaviour and still returns NaN.

This change necessitated an update to groupby\test_categorical.py::test_seriesgroupby_observed_false_or_none.

@jreback , what are your thoughts on this? Is it fine to leave .apply(sum) as it is (returning NaN for missing categories).

I am not inclined to address it in this PR, but if we do want it fixed, I'll raise an issue for it.

@jreback jreback added Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Jul 16, 2020
@jreback
Copy link
Contributor

jreback commented Jul 16, 2020

can u show the output of #35280 (comment) here (and code as welll)

@smithto1
Copy link
Contributor Author

smithto1 commented Jul 16, 2020

can u show the output of #35280 (comment) here (and code as welll)

@jreback

import pandas as pd

df = pd.DataFrame(
    {
        "cat_1": pd.Categorical(list("AABB"), categories=list("ABC")),
        "cat_2": pd.Categorical(list("1111"), categories=list("12")),
        "value": [0.1, 0.1, 0.1, 0.1],
    }
)
df_grp = df.groupby(["cat_1", "cat_2"], observed=False)

# Using branch from this PR:
# .sum() returns zeros 
print( df_grp.sum() )

#              value
# cat_1 cat_2
# A     1        0.2
#       2        0.0
# B     1        0.2
#       2        0.0
# C     1        0.0
#       2        0.0

# agg returns zeros  
print( df_grp.agg(sum) )

#              value
# cat_1 cat_2
# A     1        0.2
#       2        0.0
# B     1        0.2
#       2        0.0
# C     1        0.0
#       2        0.0

# apply still returns NaN 
print( df_grp.apply(sum) )

#              value
# cat_1 cat_2
# A     1        0.2
#       2        NaN
# B     1        0.2
#       2        NaN
# C     1        NaN
#       2        NaN

@smithto1 smithto1 requested a review from jreback July 16, 2020 20:08
@smithto1
Copy link
Contributor Author

@jreback I think all issues are addressed here. Can you do a second review?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. doc updates.

@jreback jreback requested a review from TomAugspurger August 4, 2020 00:22
@jreback jreback added this to the 1.2 milestone Aug 4, 2020
@jreback
Copy link
Contributor

jreback commented Aug 6, 2020

lgtm. pls merge master and ping on green.

@jreback
Copy link
Contributor

jreback commented Aug 7, 2020

cc @TomAugspurger if you can give a glance.

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @smithto1. Looks good.

@jreback jreback merged commit a21ce87 into pandas-dev:master Aug 7, 2020
@jreback
Copy link
Contributor

jreback commented Aug 7, 2020

thanks @smithto1 very nice

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: df.groupby().count() returns NaN instead of Zero Inconsistent behavior when groupby pandas Categorical variables

3 participants