-
-
Notifications
You must be signed in to change notification settings - Fork 19.3k
BUG: GroupBy.count() and GroupBy.sum() incorreclty return NaN instead of 0 for missing categories #35280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@jreback Apologies for the PR spam, but because my new work was on a new branch I couldn't push to the old PR. Please have a reivew of this one. |
|
GroupBy.apply(sum) Calling This change necessitated an update to @jreback , what are your thoughts on this? Is it fine to leave I am not inclined to address it in this PR, but if we do want it fixed, I'll raise an issue for it. |
|
can u show the output of #35280 (comment) here (and code as welll) |
|
|
@jreback I think all issues are addressed here. Can you do a second review? |
jreback
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. doc updates.
|
lgtm. pls merge master and ping on green. |
|
cc @TomAugspurger if you can give a glance. |
TomAugspurger
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @smithto1. Looks good.
|
thanks @smithto1 very nice |
black pandasgit diff upstream/master -u -- "*.py" | flake8 --diffBehavioural Changes
Fixing two related bugs: when grouping on multiple categoricals,
.sum()and.count()would returnNaNfor the missing categories, but they are expected to return0for the missing categories. Both these bugs are fixed.Tests
Tests were added in PR #35022 when these bugs were discovered and the tests were marked with an
xfail. For this PR thexfailsare removed and the tests are passing normally. As well, a few other existing tests were expectingsum()to returnNaN; these have been updated so that the tests now expect to get0(which is the desired behaviour).Pivot
The change to
.sum()also impacts thedf.pivot_table()if it is called withaggfunc=sumand is pivoted on a Categorical column withobserved=False. This is not explicitly mentioned in either of the bugs, but it does make the behaviour consistent (i.e. the sum of a missing category is zero, notNaN). One test on test_pivot.py was updated to reflect this change.