Skip to content
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.23.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -890,6 +890,7 @@ Groupby/Resample/Rolling
- Bug in :func:`DataFrame.groupby` passing the `on=` kwarg, and subsequently using ``.apply()`` (:issue:`17813`)
- Bug in :func:`DataFrame.resample().aggregate` not raising a ``KeyError`` when aggregating a non-existent column (:issue:`16766`, :issue:`19566`)
- Fixed a performance regression for ``GroupBy.nth`` and ``GroupBy.last`` with some object columns (:issue:`19283`)
- Bug in :func:`DataFrame.groupby.cumsum` and :func:`DataFrame.groupby.cumprod` where skipna is not an option
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be

:meth:`DataFrameGroupBy.cumsum`

likewise for cumprod.

And skipna should probably be in double backticks.

And could you add a :issue: for the issue you're closing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added those in


Sparse
^^^^^^
Expand Down
18 changes: 16 additions & 2 deletions pandas/_libs/groupby.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -139,7 +139,8 @@ def group_median_float64(ndarray[float64_t, ndim=2] out,
def group_cumprod_float64(float64_t[:, :] out,
float64_t[:, :] values,
int64_t[:] labels,
bint is_datetimelike):
bint is_datetimelike,
bint skipna=True):
"""
Only transforms on axis=0
"""
Expand All @@ -163,14 +164,21 @@ def group_cumprod_float64(float64_t[:, :] out,
if val == val:
accum[lab, j] *= val
out[i, j] = accum[lab, j]
if val != val:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can just be else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that makes more sense. I changed that for both cumsum and cumprod.

if skipna:
out[i, j] = NaN
else:
accum[lab, j] = NaN
out[i, j] = accum[lab, j]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you break early here? If you ever hit a NaN in a group, final output will be NaN so no need to continue adding to NaN.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true - I added a break there.



@cython.boundscheck(False)
@cython.wraparound(False)
def group_cumsum(numeric[:, :] out,
numeric[:, :] values,
int64_t[:] labels,
is_datetimelike):
is_datetimelike,
bint skipna=True):
"""
Only transforms on axis=0
"""
Expand All @@ -196,6 +204,12 @@ def group_cumsum(numeric[:, :] out,
if val == val:
accum[lab, j] += val
out[i, j] = accum[lab, j]
if val != val:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comments about if -> else and early breaking.

if skipna:
out[i, j] = NaN
else:
accum[lab, j] = NaN
out[i, j] = accum[lab, j]
else:
accum[lab, j] += val
out[i, j] = accum[lab, j]
Expand Down
6 changes: 4 additions & 2 deletions pandas/core/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -1839,7 +1839,8 @@ def rank(self, method='average', ascending=True, na_option='keep',
@Appender(_doc_template)
def cumprod(self, axis=0, *args, **kwargs):
"""Cumulative product for each group"""
nv.validate_groupby_func('cumprod', args, kwargs, ['numeric_only'])
nv.validate_groupby_func('cumprod', args, kwargs,
['numeric_only', 'skipna'])
if axis != 0:
return self.apply(lambda x: x.cumprod(axis=axis, **kwargs))

Expand All @@ -1849,7 +1850,8 @@ def cumprod(self, axis=0, *args, **kwargs):
@Appender(_doc_template)
def cumsum(self, axis=0, *args, **kwargs):
"""Cumulative sum for each group"""
nv.validate_groupby_func('cumsum', args, kwargs, ['numeric_only'])
nv.validate_groupby_func('cumsum', args, kwargs,
['numeric_only', 'skipna'])
if axis != 0:
return self.apply(lambda x: x.cumsum(axis=axis, **kwargs))

Expand Down
20 changes: 20 additions & 0 deletions pandas/tests/groupby/test_groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -2521,6 +2521,26 @@ def test_groupby_cumprod(self):
expected.name = 'value'
tm.assert_series_equal(actual, expected)

# make sure that skipna works
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you split this into a new test? Make sure to add a reference to the Github issue as a comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd find this a bit easier to read as

df = pd.DataFrame({"x": [...], "gp": ['a'] * 6 + ['b'] * 6})

instead of concat.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And maybe have a group with all-NA to ensure that is handled properly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a new test (test_groupby_cum_skipna) for all of the skipna-related cases. And added a groupby+cumsum/prod on all-NA values.

df = pd.concat(
[pd.DataFrame({'x': [1.0, 2.0, np.nan, np.nan, 3.0, 2.0],
'gp': 'a'}),
pd.DataFrame({'x': [2.0, 5.0, 6.0, 1.0, np.nan, 1.0],
'gp': 'b'})])
result = df.groupby('gp')['x'].cumprod(skipna=False)
expected = pd.Series([1.0, 2.0, np.nan, np.nan, np.nan, np.nan,
2.0, 10.0, 60.0, 60.0, np.nan, np.nan],
name='x', index=(0, 1, 2, 3, 4, 5,
0, 1, 2, 3, 4, 5))
tm.assert_series_equal(result, expected)

result = df.groupby('gp')['x'].cumprod()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think you need this test. skipna=False should be adequately tested as is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, removed it.

expected = pd.Series([1.0, 2.0, np.nan, np.nan, 6.0, 12.0,
2.0, 10.0, 60.0, 60.0, np.nan, 60.0],
name='x', index=(0, 1, 2, 3, 4, 5,
0, 1, 2, 3, 4, 5))
tm.assert_series_equal(result, expected)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a test for cumsum with skipna=False`?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should be covered by test_groupby_cum_skipna now.

def test_ops_general(self):
ops = [('mean', np.mean),
('median', np.median),
Expand Down