-
-
Notifications
You must be signed in to change notification settings - Fork 19.3k
Optimize array_equivalent for NDFrame.equals #35328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize array_equivalent for NDFrame.equals #35328
Conversation
|
restarting failed travis job |
| if left.shape != right.shape: | ||
| return False | ||
|
|
||
| if dtype_equal: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i am not sure i understand why you actually need this
the dtype check is cheap compared to the actual check no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- They add up, especially when doing things columnwise for small, wide dataframes
- We get to skip a few other checks, like swapping
np.isnanforpd.isnafor float, and we can skip the empty check at https://github.com/pandas-dev/pandas/pull/35328/files#diff-ff8364cee9a3e1ef3a3825cb2cdd26d8L431.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok that's fine
pandas/core/dtypes/missing.py
Outdated
| return np.array_equal(left, right) | ||
|
|
||
|
|
||
| def array_equivalent_float(left, right): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe make all of these helpers private functions
|
|
||
|
|
||
| def _array_equivalent_float(left, right): | ||
| return ((left == right) | (np.isnan(left) & np.isnan(right))).all() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about use_inf_as_na?
Can this be re-used on L426? (also the prod(shape) check on L424-425 i think is redundant with earlier shape check)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like 1.0.5 doesn't care about that
In [15]: df1 = pd.DataFrame({"A": np.array([np.nan, 1, np.inf])})
In [16]: df2 = pd.DataFrame({"A": np.array([np.nan, 1, np.nan])})
In [17]: with pd.option_context('mode.use_inf_as_na', True):
...: print(df1.equals(df2))
...:
...:
FalseThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, thanks for checking
|
Travis finished. I'm going to selfishly ignore #35328 (comment) so that I can get the RC process rolling, but happy to engage more on followups if I missed something. |
Closes #35249
And proof of this being better for short-circuiting:
Discussion: the key thing here is that
BlockManager.equalsrequires that the dtypes ofleftandrightmatch. If they don't then we've already returned False. That lets us do a few thingsis_foo_dtypechecks.np.isnannp.array([1, 2])equivalent tonp.array([1.0, 2.0]).