-
-
Notifications
You must be signed in to change notification settings - Fork 19.3k
BUG: Preserve Series/DataFrame subclasses through groupby operations #33884
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Preserve Series/DataFrame subclasses through groupby operations #33884
Conversation
|
I updated my branch with additional tests, and a revised entry in the whatsnew rst file. Please let me know if there are any additional recommendations for how to proceed with this PR. |
|
Thanks for the mypy advice -- I managed to refactor that section a bit to only call The one additional change made here is to make sure |
|
@jreback -- please let me know if there are any additional changes to this PR that you think are necessary. |
jreback
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small comment, ping on green.
|
@jreback I made the recommend changes:
|
simonjayhawkins
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JBGreisman generally lgtm. one question.
|
@simonjayhawkins Thanks for the comment -- I agree that line should have used |
pandas/core/groupby/groupby.py
Outdated
| """ | ||
|
|
||
| @property | ||
| def _constructor(self) -> Type["Series"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm this should be FrameOrSeries I think.
why is there an else here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had thought this should be a Series because if self.obj is a DataFrame, it returns self.obj._constructor_sliced. Do you think I should name this property _series_constructor or something comparable to make that behavior more apparent?
The else was there because mypy was complaining about a missing return statement. I can restructure this with an assertion to avoid an else statement and keep mypy from complaining:
@property
def _series_constructor(self) -> Type["Series"]:
# GH28330 preserve subclassed Series/DataFrames
if isinstance(self.obj, DataFrame):
return self.obj._constructor_sliced
assert isinstance(self.obj, Series)
return self.obj._constructorPlease let me know if you have a better way to structure this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes pls use an assertion
maybe @simonjayhawkins or @WillAyd can help with the annotation itself
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks -- I changed _constructor to use the assertion as above. I think it would also make sense to change the name of the property to _series_constructor in order to clarify the return type, but I'll hold off for additional comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the annotation is correct given the way it is implemented, though I am not sure about the implementation. Why do we need to dispatch to constructor_sliced for DataFrames? Seems slightly unnatural to have to force that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ends up being used in GroupBy.ngroup(), GroupBy.cumcount(), and GroupBy.size(). Prior to these changes, these methods had returned Series, regardless of whether they were called from a SeriesGroupBy or DataFrameGroupBy object. I had updated this to still return a Series-type while preserving the subclasses -- to avoid things reverting back to pd.Series if they were called from a subclassed DataFrame/Series.
As such, the motivation for dispatching to constructor_sliced for DataFrames was to avoid changing the return type for these different GroupBy methods from their prior behavior. Do you think it would make sense to make a larger change here that alters these functions to have different return types if called from SeriesGroupBy vs. DataFrameGroupBy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm that's unfortunate... Can you rename this property to _obj_constructor instead? I find the current name a little confusing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe even _obj_1d_constructor to be even more explicit. This is definitely for special cases so want to signal as such
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah -- I think _obj_1d_constructor is most clear. I'll make the changes.
|
As per the discussion above, the GroupBy property has been renamed to |
|
thanks @JBGreisman very nice, thanks for sticking with it and being responsive! |
|
no worries -- thanks for the help/suggestions! |
This pull request fixes a bug that caused subclassed Series/DataFrames to revert to
pd.Seriesandpd.DataFrameaftergroupby()operations. This is a follow-up on a previously abandoned pull request (#28573)