Skip to content

Conversation

@topper-123
Copy link
Contributor

@topper-123 topper-123 commented May 14, 2020

A unneeded numpy array is created for each group when calling groupby.first and groupby.last on ExtensionArrays. This avoids that.

>>> cat = pd.Categorical(["a"] * 1_000_000 + ["b"] * 1_000_000)
>>> ser = pd.Series(cat)
>>> %timeit ser.groupby(cat).first()
210 ms ± 3.03 ms per loop  # master
78.4 ms ± 766 µs per loop  # this PR

The same speedup is archieved for groupby.last. The above is 3x faster than in master because there are two groups == we save creating two arrays. If there were more groups/larger arrays, we'd get even more improvements.

Also adds some type hints to help understand what parameters these funtions accept.

@topper-123 topper-123 force-pushed the groupby_first_last branch from a457506 to 0389edc Compare May 14, 2020 18:37
@jreback jreback added Groupby Performance Memory or execution speed performance labels May 15, 2020
@jreback jreback added this to the 1.1 milestone May 15, 2020
@jreback jreback merged commit 1f1735e into pandas-dev:master May 15, 2020
@jreback
Copy link
Contributor

jreback commented May 15, 2020

thanks @topper-123 very nice

@topper-123 topper-123 deleted the groupby_first_last branch May 24, 2020 16:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Groupby Performance Memory or execution speed performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants