BUG: GroupBy.count() and GroupBy.sum() incorreclty return NaN instead of 0 for missing categories (Version 2) #35241
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
black pandasgit diff upstream/master -u -- "*.py" | flake8 --diffBehavioural Changes
Fixing two related bugs: when grouping on multiple categoricals, .sum() and .count() would return NaN for the missing categories, but they are expected to return 0 for the missing categories. Both these bugs are fixed.
Tests
Tests were added in PR #35022 when these bugs were discovered and the tests were marked with an xfail. For this PR the xfails are removed and the tests are passing normally. As well, a few other existing tests were expecting
sum()to returnNaN; these have been updated so that the tests now expect to get0(which is the desired behaviour).One new test is added to ensure that the exception handling of the new
try-except-finallyblock behaves as expected.df.pivot_table
The changes to
.sum()&.count()also impacts thedf.pivot_table()if it is called withaggfunc=sum/countand is pivoted on a Categorical column with observed=False. This is not explicitly mentioned in either of the bugs, but it does make the behaviour consistent (i.e. the sum of a missing category is zero, not NaN). Two tests on test_pivot.py was updated to reflect this change.