-
-
Notifications
You must be signed in to change notification settings - Fork 19.3k
BUG: GroupBy.get_group doesnt work with TimeGrouper #6914
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this assume each group is contiguous / sorted?
I have a feeling there is a more efficient way to do this get this out, @jreback ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is what _groupby_indices does (a cython routine). also make an example that has an unsorted bins for testing as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that binlabels and bins are always sorted before passed to BinGrouper. Is it incorrect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so. in any event, does not _groupby_indices work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following is the _groupby_indices result. It returns correct index even if the input is not sorted?
>>> import pandas.algos as _algos
>>> import pandas.core.common as com
>>> values = ['A', 'A', 'A', 'A', 'A', 'B', 'B', 'C']
>>> _algos.groupby_indices(com._ensure_object(values))
{'A': array([0, 1, 2, 3, 4]), 'C': array([7]), 'B': array([5, 6])}
>>> values = ['B', 'B', 'C', 'A', 'A', 'A', 'A', 'A']
>>> _algos.groupby_indices(com._ensure_object(values))
{'A': array([3, 4, 5, 6, 7]), 'C': array([2]), 'B': array([0, 1])}
But BinGrouper doesn't know actual data index by itself, so I'm not sure what results are actually correct. Should it return the same indices regardless of bins order?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, different. BinGrouper.indices must be a dict which key is timestamp. The logic cannot be replaced with _groupby_indices.
>>> b.indices
defaultdict(<type 'list'>, {Timestamp('2013-12-31 00:00:00', offset='M'): [6, 7], Timestamp('2013-10-31 00:00:00', offset='M'): [2, 3, 4, 5], Timestamp('2013-01-31 00:00:00', offset='M'): [0, 1]})
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use _get_indicies_dict; don't reinvent the wheel here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using _get_indices_dict is not easy, because BinGrouper doesn't have information which can be passed to the function as it is.
>>> import pandas as pd
>>> import datetime
>>> from pandas.core.groupby import GroupBy, _get_indices_dict
>>> df = pd.DataFrame({'Branch' : 'A A A A A A A B'.split(),
'Buyer': 'Carl Mark Carl Carl Joe Joe Joe Carl'.split(),
'Quantity': [1,3,5,1,8,1,9,3],
'Date' : [
datetime.datetime(2013,1,1,13,0), datetime.datetime(2013,1,1,13,5),
datetime.datetime(2013,10,1,20,0), datetime.datetime(2013,10,2,10,0),
datetime.datetime(2013,10,1,20,0), datetime.datetime(2013,10,2,10,0),
datetime.datetime(2013,12,2,12,0), datetime.datetime(2013,12,2,14,0),]})
>>> grouped = df.groupby(pd.Grouper(freq='1M',key='Date'))
>>> grouped.grouper.bins
[2 2 2 2 2 2 2 2 2 6 6 8]
>>> grouped.grouper.binlabels
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-31, ..., 2013-12-31]
Length: 12, Freq: M, Timezone: None
Thus, I have to convert it using the similar logic as current implementation.
>>> indices = []
>>> i = 0
>>> for j, bin in enumerate(grouped.grouper.bins):
>>> if i < bin:
>>> indices.extend([j] * (bin - i))
>>> i = bin
>>> indices = np.array(indices)
>>> indices
[ 0 0 9 9 9 9 11 11]
And _get_indices_dict returns keys as tuple, further conversion required.
>>> _get_indices_dict([indices], [grouped.grouper.binlabels])
{(numpy.datetime64('2013-10-31T09:00:00.000000000+0900'),): array([2, 3, 4, 5]), (numpy.datetime64('2013-12-31T09:00:00.000000000+0900'),): array([6, 7]), (numpy.datetime64('2013-01-31T09:00:00.000000000+0900'),): array([0, 1])}
Do you have better logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we not do something like (?):
{g.grouper.levels[k] for k, v in pd.core.groupby._groupby_indices(b.bins).iteritems()}
may be faster...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, but failed in test. I think simply passing bins to existing method shouldn't work, because bins are corresponding to frequencies to be split, not index. Thus its length can differ from index.
Using dataframe in above example, returnes values are different from expected.
>>> list(pd.core.groupby._groupby_indices(grouped.grouper.bins).iteritems())
[(8, array([11])), (2, array([0, 1, 2, 3, 4, 5, 6, 7, 8])), (6, array([ 9, 10]))]
|
Thanks, this is definitely a bug and tests look good! I reckon there's a more efficient way to get indices though. |
|
@sinhrks as a side note, pls make sure that your examples are easily copy-pasted (e.g. put the |
|
put this release note next to or with the one for #5267 |
|
@sinhrks ping when you update |
BUG: GroupBy.get_group doesnt work with TimeGrouper
|
ok...this is fine, if you think of someway to reuse the current groupby functions for that thanks! |
get_groupraisesAttributeErrorwhen the group is created byTimeGrouper.