-
-
Notifications
You must be signed in to change notification settings - Fork 19.3k
ENH: .equals for Extension Arrays #30652
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
thanks for your contribution! @dwhu in general here, if it is Good luck! and hope this helps |
@charlesdong1991 - Sounds good. Thanks for the feedback. Wanted to make sure I wasn't missing something or breaking a behavior contract by adding this behavior. |
TomAugspurger
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think most of the __eq__ implementations you've added can be removed.
@jschendel is in the middle of a PR fixing IntervalArray.__eq__, so ignore that for now if it's causing failures.
We'll need some base tests in pandas/tests/extensions/base/ops.py ensuring that __eq__ and __ne__ are implemented correctly. We'll want a few tests in tests/extensions/base/methods.py ensuring that .equals is tested. These tests are a bit tricky to write, so let us know if you have issues.
pandas/core/arrays/base.py
Outdated
| bool | ||
| """ | ||
|
|
||
| raise AbstractMethodError(self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be return ~(self == other)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
pandas/core/arrays/base.py
Outdated
| Whether the arrays are equivalent. | ||
| """ | ||
| return ((self == other) | (self.isna() == other.isna())).all() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this also needs to ensure that the type of other matches the type of self. We don't want pd.array([1, 2]).equals(np.array([1, 2])) to be True (I think).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
pandas/core/arrays/boolean.py
Outdated
| if not isinstance(other, BooleanArray): | ||
| return NotImplemented | ||
| return ( | ||
| hasattr(other, "_data") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BooleanArray should already have a __eq__ method (see create_comparison_method Was this needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not needed. Missed that in my WIP PR.
Adding __eq__ to ExtensionArray Abstract method doc string. Adding ne implementation to EA base class. Also removing other implementations. Updating EA equals method and adding tests. pandas-devGH-27081
pandas/core/arrays/base.py
Outdated
| for i in range(len(self)): | ||
| yield self[i] | ||
|
|
||
| def __eq__(self, other: ABCExtensionArray) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The type of other should be Any I think. Passing a list / etc here is valid, it just returns False.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure to update the type in the docstring too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
pandas/core/arrays/base.py
Outdated
|
|
||
| raise AbstractMethodError(self) | ||
|
|
||
| def __ne__(self, other: ABCExtensionArray) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment about the type on other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
pandas/core/arrays/base.py
Outdated
| bool | ||
| """ | ||
|
|
||
| raise AbstractMethodError(self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, does this need to be abstract? Or can we define it as something like
# check that the types, length match, etc.
# check that NA values in self match NA values in other...
return np.all(np.array(self) == np.array(other))Then subclasses can override with implementations that may be faster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion. I like that much better.
pandas/core/arrays/base.py
Outdated
| Whether the arrays are equivalent. | ||
| """ | ||
| return isinstance(other, self.__class__) and ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps type(self) == type(other)? It's not clear to me whether a subclass should compare .equals to a parent class. Thoughts @jreback?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will want some length checking here too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added type(self) == type(other) and length checks.
pandas/core/arrays/integer.py
Outdated
| def __eq__(self, other): | ||
| return ( | ||
| isinstance(other, IntegerArray) | ||
| and hasattr(other, "_data") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have IntegerArray.__eq__ on master. This shouldn't be necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
| ser = pd.Series(cls._from_sequence(data, dtype=data.dtype)) | ||
| na_ser = pd.Series(cls._from_sequence([na_value], dtype=data.dtype)) | ||
|
|
||
| assert data.equals(data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make all these assert <condition> is True, to ensure that True or False is actually returned
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran into a funny issue doing this: numpy/numpy#5942
I'm going to keep this as is now to keep the test consistent within itself.
.equals() returns a np.bool_ type. As result when you compare it to bool type True using is (i.e. some_np_bool_True_val is True == False
Coercing np.bool_ to bool using the bool() constructor did not work either. == True causes PEP8 errors asking me to use assert <condition> or assert <condition> is True.
Open to suggestions if you have any.
| assert not na_ser.equals(ser) | ||
| assert not ser.equals(na_ser) | ||
| assert not ser.equals(0) | ||
| assert not na_ser.equals(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add test for other inputs like an ndarray.
Add a test for unequal length things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| other = pd.Series([data[0]] * len(data)) | ||
| self._compare_other(s, data, op_name, other) | ||
|
|
||
| def test_direct_arith_with_series_returns_not_implemented(self, data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Undo this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as the other comment. The whole PR is intended to break this test since we implement __eq__ and __ne__.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as the other comment. The whole PR is intended to break this test since we implement eq and ne.
The reason this needs to be kept is that, although __eq__ is a required method on an ExtensionArray now, we still require it to return NotImplemented when other is a Series or Index.
|
@dwhu looks like CI is red - can you take a look and fix up the merge conflict? |
|
Closing as I think this has gone stale but ping if you'd like to pick back up. Thanks! |
|
@dwhu I am going to look into further updating this, if that's OK with you |
|
OK, I updated this with the following things:
Should be ready for a new review now @TomAugspurger @jbrockmendel |
TomAugspurger
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks nice. Can you add a release note?
|
|
||
| def __eq__(self, other: Any) -> Union[np.ndarray, "ExtensionArray"]: | ||
| """ | ||
| Return for `self == other` (element-wise equality). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should have some guidance here that if other is a Series then you should return NotImplemented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could use ops.common.unpack_zerodim_and_defer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could use ops.common.unpack_zerodim_and_defer
This method is to be implemented by EA authors, so those can't use that helper (unless we expose somewhere a public version of this).
(we could of course use that for our own EAs, but this PR is not changing any existing __eq__ implementation at the moment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should have some guidance here that if other is a Series then you should return NotImplemented.
Added a comment about that
| if not type(self) == type(other): | ||
| return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should verify that this is the behavior we want. Namely
- other array-likes are not equivalent, even if they are all equal.
- subclasses are not equivalent, even if they are all equal.
The first seems fine. Not sure about the second.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was planning to open an issue about this after this PR (and keep it strict here), because this is right now a bit inconsistent within pandas, and might require a more general discussion / clean-up (eg Series.equals is more strict (requires same dtype) than Index.equals ...)
But we can certainly also have the discussion here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to be strict for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made it even more strict (same dtype, not just same class), and added a test for that.
Will open an issue for the general discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened an issue for this at #33940
In the end, it seems mainly to come to whether the dtype should exactly be equal or not.
Since for EAs, the dtype is right now tied to the array class, using equal dtype for now also implies the same class (no additional check to allow sublcasses).
| else: | ||
| data.repeat(repeats, **kwargs) | ||
|
|
||
| def test_equals(self, data, na_value, as_series): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also include DataFrame in this parameterization, or do it somewhere else? Just to ensure coverage of the changes to blocks.py.
pandas/core/arrays/base.py
Outdated
| """ | ||
| raise AbstractMethodError(self) | ||
|
|
||
| def __ne__(self, other: Any) -> Union[np.ndarray, "ExtensionArray"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should allow us to remove the equivalent IntervalArray definition here:
pandas/pandas/core/arrays/interval.py
Lines 609 to 610 in d149f41
| def __ne__(self, other): | |
| return ~self.__eq__(other) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Union[np.ndarray, "ExtensionArray"] -> ArrayLike
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jschendel done!
| # boolean array with NA -> fill with False | ||
| equal_values = equal_values.fillna(False) | ||
| equal_na = self.isna() & other.isna() | ||
| return (equal_values | equal_na).all().item() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the .item() necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the .item() necessary?
It's to have a python bool, instead of a numpy bool, as result.
I added a comment to the tests to make it explicit this is the reason we are asserting with is True/False (and later on we don't inadvertedly "clean" that up)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for clarifying. i was recently reminded that np.array(np.timedelta64(1234, "ns")).item() gives an int instead of timedelta64, so im now cautious around .item()
|
@simonjayhawkins how do I solve this typing error? So our annotation of |
|
Any more comments on this PR? |
|
will look soon |
pandas/core/arrays/base.py
Outdated
| yield self[i] | ||
|
|
||
| def __eq__(self, other: Any) -> ArrayLike: | ||
| def __eq__(self, other: Any) -> ArrayLike: # type: ignore[override] # NOQA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
im not familiar with the ignore[overrride] # NOQA pattern. do we use that elsewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought I posted the mypy issue, but apparently forgot.
So the problem here is that mypy expects a "bool" return for __eq__ since the base object is typed that way in the stubs. In python/mypy#2783, the recommended way to solve this is the above with # type: ignore[override].
But flake8 doesn't like that, hence also the #NOQA (PyCQA/pyflakes#475).
This was already done in another PR as well:
Line 5062 in 7aa710a
| def sort_values( # type: ignore[override] # NOQA # issue 27237 |
|
@jreback you wanted to take a look here? |
jreback
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. a couple of minor comment.s
doc/source/whatsnew/v1.1.0.rst
Outdated
| such as ``dict`` and ``list``, mirroring the behavior of :meth:`DataFrame.update` (:issue:`33215`) | ||
| - :meth:`~pandas.core.groupby.GroupBy.transform` and :meth:`~pandas.core.groupby.GroupBy.aggregate` has gained ``engine`` and ``engine_kwargs`` arguments that supports executing functions with ``Numba`` (:issue:`32854`, :issue:`33388`) | ||
| - :meth:`~pandas.core.resample.Resampler.interpolate` now supports SciPy interpolation method :class:`scipy.interpolate.CubicSpline` as method ``cubicspline`` (:issue:`33670`) | ||
| - The ``ExtensionArray`` class has now an ``equals`` method, similarly to ``Series.equals()`` (:issue:`27081`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can add the doc reference here
pandas/core/internals/blocks.py
Outdated
| return [self.make_block_same_class(result, placement=self.mgr_locs)] | ||
|
|
||
| def equals(self, other) -> bool: | ||
| if self.dtype != other.dtype or self.shape != other.shape: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you need this check here? (its already done on the .values no?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, good question. I just copied that from the base Block.equals method (so it's done there as well). There, array_equivalent is used. Maybe that is less strict and the extra check is needed, which is not needed here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least from the docstring of array_equivalent (not sure how up to date that is), using that method indeed requires a more strict check in advance of calling it:
pandas/pandas/core/dtypes/missing.py
Lines 360 to 363 in 3ed7dff
| in corresponding locations. False otherwise. It is assumed that left and | |
| right are NumPy arrays of the same dtype. The behavior of this function | |
| (particularly with respect to NaNs) is not defined if the dtypes are | |
| different. |
But so, for EA.equals that is not needed, so will remove that check here.
|
Travis failures are unrelated (network issues, S3) |
Co-authored-by: Joris Van den Bossche <[email protected]>
black pandasgit diff upstream/master -u -- "*.py" | flake8 --diff--- Jan 3 10:00 PST
Hey team,
Looking for a bit of feedback on this API design and appropriate testing for this PR before I get too deep into it. Could you provide some guidance regarding how to appropriately handle updating this test?
pandas/pandas/tests/extension/base/ops.py
Line 162 in 6f03e76
Appreciate the help,
Dave