-
-
Notifications
You must be signed in to change notification settings - Fork 19.3k
ENH: Add numba engine to groupby.transform #32854
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hello @mroeschke! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2020-04-16 04:34:45 UTC |
WillAyd
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable
pandas/core/util/numba_.py
Outdated
| return numba_func | ||
|
|
||
|
|
||
| def split_for_numba(arg: FrameOrSeries): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you annotate return types here?
pandas/core/util/numba_.py
Outdated
| return arg.to_numpy(), arg.index.to_numpy(), columns_as_array | ||
|
|
||
|
|
||
| def validate_udf(func: Callable, include_columns: bool = False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment
|
@mroeschke this is orthogonal to the .agg one? IOW does ordering of merge matter? |
jreback
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks pretty good.
I would introduce a Dispatcher concept here, with a Cython and a Numba Dispatcher.
this way we can move all of the messy logic to that class and just call generically.
| for name, group in self: | ||
| object.__setattr__(group, "name", name) | ||
| res = func(group, *args, **kwargs) | ||
| if engine == "numba": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
like to see this as
def _evaluate_udf
| def check_kwargs_and_nopython( | ||
| kwargs: Optional[Dict] = None, nopython: Optional[bool] = None | ||
| ): | ||
| ) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a doc-string
|
|
||
| def split_for_numba(arg: FrameOrSeries) -> Tuple[np.ndarray, np.ndarray, np.ndarray]: | ||
| """ | ||
| Split pandas object into its components as numpy arrays for numba functions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add Parameters / Returns section
|
|
||
|
|
||
| def validate_udf(func: Callable, include_columns: bool = False) -> None: | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
|
for timings you can use this code |
|
Here the timings with the above benchmark. I'll use a modified version of it for the ASV |
|
It's also not immediately obvious how to include a Dispatcher concept here, especially since the "cython" path has a lot of fallback behavior. I can look into introducing a dispatcher concept in a followup |
|
Also @jreback, is it really useful that the if we are doing a DataFrame object transform that the column name is passed to the udf? I can see it as a potential pitfall as column names are usually strings and numba functions do not support objects? EDIT: Made a decision to not pass in the dataframe column as an array into the UDF before calling transform |
black pandasgit diff upstream/master -u -- "*.py" | flake8 --diffIn the same spirit of #31845, adding
engineandengine_kwargsarguments togroupby.transform(which was easier to tackle first thangroupby.apply). This signature is the same as what was added torolling.apply.Constraints:
def f(values, index, ...), explicitly those names, as we will pass in the the values and the pandas index (as a numpy array) into the udf