-
-
Notifications
You must be signed in to change notification settings - Fork 19.3k
BUG: support median function for custom BaseIndexer rolling windows #33626
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Place of implementationI think touching if (end - start).max() == 0:
output[:] = NaN
return output
win = (end - start).max()
sl = skiplist_init(<int>win)This would set UPDATE: After discussing with @mroeschke , we decided to implement the bugfix directly in roll_median_c. |
|
@mroeschke @jreback I'm interested in what you think on where we should implement the bugfix. For now, I added the code to I checked both options, and they both work. Because rolling median calculation itself is O(N * log(window_size)) , the performance hit from the extra All the details are above. UPDATE: After discussing with @mroeschke , we decided to implement the bugfix directly in roll_median_c. |
|
I think I would prefer the solution described here: #33626 (comment) And we could make due with a comment in I'd prefer to call |
|
@mroeschke Because we are now ignoring the |
jreback
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. minor comment, ping on green.
| tm.assert_equal(result, expected) | ||
| expected2 = constructor(rolling.apply(lambda x: np_func(x, **np_kwargs))) | ||
| tm.assert_equal(result, expected2) | ||
| # Check that the function works without min_periods supplied |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you put a blank line before each case
and add comments for lines 169, 173.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback
Done, all green.
|
thanks @AlexKirko |
|
This PR caused a big performance regression for rolling median (see eg https://pandas.pydata.org/speed/pandas/#rolling.ForwardWindowMethods.time_rolling?p-constructor='DataFrame'&p-window_size=1000&p-dtype='float'&p-method='median'&commits=3a801661). |
|
@jorisvandenbossche pls read the perf section above |
black pandasgit diff upstream/master -u -- "*.py" | flake8 --diffScope of PR
This PR makes sure that when we calculate the median in
roll_median_c, we override the maximum window width with the correct value, and the function doesn't shortcut by returning all NaNs.Details
The
medianfunction eventually callsroll_median_cwhich accepts awinparameter (maximum window width) to correctly allocate memory and initialize the skiplist data structure which is the backbone of the rolling median algorithm. Currently,winis determined by the_get_windowfunction which returnsmin_periods or 0for customBaseIndexersubclasses:Thus,
roll_median_ceither shortcuts or initializes to incorrect depth:... if win == 0 or (end - start).max() == 0: output[:] = NaN return output win = (end - start).max() sl = skiplist_init(<int>win) ...I propose we determine max window length directly in themedianfunction. This means thatstartandendarrays get calculated twice: here and in_apply. However, I belive this is better than injecting a median-specific crutch into_applyor messing with the shortcut inroll_median_c(we could attempt to overridewinif(end - start).max() > 0. This other option is explored below.After discussing with @mroeschke , we decided to implement the bugfix directly in
roll_median_c. Details below.Please say if you think another approach would be preferable.
Background on the wider issue
We currently don't support several rolling window functions when building a rolling window object using a custom class descended from
pandas.api.indexers.Baseindexer. The implementations were written with backward-looking windows in mind, and this led to these functions breaking.Currently, using these functions returns a
NotImplementederror thanks to #33057, but ideally we want to update the implementations, so that they will work without a performance hit. This is what I aim to do over a series of PRs.Perf notes
The function currently shortcuts because of the bug or initializes the main datastructure incorrectly. For this reason, benchmarks are meaningless.
Ran benchmarks vs. the shortcut anyway: