-
Notifications
You must be signed in to change notification settings - Fork 234
**Breaking**: data_kind: Now 'matrix' represents a 2-D numpy array and unrecognized data types fall back to 'vectors' #3351
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ea90891 to
c0a5bdb
Compare
c0a5bdb to
78066a4
Compare
f160dd6 to
1c517e6
Compare
1c517e6 to
21f581a
Compare
52b711b to
713d0a2
Compare
pygmt/helpers/utils.py
Outdated
| return "geojson" | ||
|
|
||
| # A 2-D numpy.ndarray | ||
| if isinstance(data, np.ndarray) and data.ndim == 2: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably can change this line to:
| if isinstance(data, np.ndarray) and data.ndim == 2: | |
| if hasattr(data, "__array_interface__") and len(data.__array_interface__.shape) == 2: |
to support more array-like objects that implements the __array_interface__ protocol (https://numpy.org/doc/stable/reference/arrays.interface.html#object.__array_interface__), but it can be done in a future PR.
| case "file": | ||
| file_contexts.append(contextlib.nullcontext(track)) | ||
| case "matrix": | ||
| case "vectors": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pandas.DataFrame now is "vectors" kind.
pygmt/helpers/utils.py
Outdated
| case np.ndarray() if data.ndim == 2: # A 2-D numpy.ndarray object. | ||
| kind = "matrix" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two things:
- Should we extend
kind="matrix"to 3-D numpy.ndarray objects? - I'm hesitant to match based on
np.ndarray()class only, because while it's not widely advertised, checking fordata is not Nonewould have caught other array types implementing the__array_function__protocol (see NEP18), but checking forisinstance(data, np.ndarray)would exclude those. That said, I'm wondering if the fallback tokind="vectors"might just work, but need to test this out more extensively.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Should we extend
kind="matrix"to 3-D numpy.ndarray objects?
It's unclear if we can pass a 3-D numpy array yet. Even if we can, it means more work, since in virtualfile_from_vectors/virtualfile_from_matrix, we explicitly check if the ndim=1 or 2. So, better to revisit the 3-D numpy array support in a separate PR if necessary.
- I'm hesitant to match based on
np.ndarray()class only, because while it's not widely advertised, checking fordata is not Nonewould have caught other array types implementing the__array_function__protocol (see NEP18), but checking forisinstance(data, np.ndarray)would exclude those. That said, I'm wondering if the fallback tokind="vectors"might just work, but need to test this out more extensively.
Yes, I was thinking about checking __array_interface__ in #3351 (comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 4a4f192, although other array-like objects are not tested.
| if data.dtype.kind not in "iuf": | ||
| _virtualfile_from = self.virtualfile_from_vectors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing test coverage for these lines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can add a test for it. Before adding more tests, I'm wondering if we should split the big "test_clib_virtualfiles.py" file (with more than 500 lines) into separate smaller test files, i.e., one test file for each Session method.
test_clib_virtualfile_intest_clib_virtualfile_from_vectorstest_clib_virtualfile_from_matrixtest_clib_open_virtualfile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've split it before in #2784, so yes, ok to split it again 😆
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a test in cfa32ed to cover this line.
Co-authored-by: Wei Ji <[email protected]>
| z=data[:, 2], | ||
| data=data, | ||
| ) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is added to address #3351 (comment).
This new test passes a string dtype numpy array into GMT C API, which contains longitude/latitude strings. The data kind is "matrix", but since the data dtype is not in iuf, PyGMT will use virtualfile_from_vectors rather than virtualfile_from_matrix to pass the data into GMT C API. Ideally, we should check that the virtualfile_from_vectors is called once and virtualfile_from_matrix is not called, but I find it technically complicated with unittest.mock, so I'll leave it untested.
The function is marked as xfail with GMT 6.5 due to a newly found upstream bug at GenericMappingTools/gmt#8600.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, and thanks for adding the test!
6b7a578 to
423e5dc
Compare
Need to wait for #3481.Description of proposed changes
As can be seen in the doctests added in #3480, currently,
matrixandvectorskinds are defined like below:vectorskind meansdata=Noneandrequired=True. It means that the input is given via a series of vectors (e.g., x/y/z).matrixrepresents any data types not recognized as arg/file/stringio/grid/image/geojson. The most common data types includepd.DataFrame,pd.Series,xr.Dataset,np.ndarray, and sequence of sequences.So,
matrixhere is an inclusive concept. However, in both Python and GMT,matrixusually means a homogenous 2-D array. When passing a "matrix" data to GMT, we usually treat the "matrix" as a series of vectors. The only exception is when the "matrix" data is a strict matrix (a 2-D numpy array) withdata.dtype.kind in "iuf".pygmt/pygmt/clib/session.py
Lines 1742 to 1755 in d7560fa
I think it makes more sense to use a more strict definition for
matrixand let all other data types bevectors. I propose to change the definitions of data kinds to:matrix: a 2-D homogenous numpy.ndarrayvectors: fallbacks to this kind for any unrecognized data types, including a dictionary,pd.DataFrame,pd.Series,xr.Dataset, nested lists/tuples, or any other unrecognized data types.The refactoring is mainly done in 808755d.
Of course,
data=None, required=Trueis still recognized as "vectors". I also think we should give this case another special kind like"none", which makes more sense and can also simplify the codes. [This is done in a separate PR at #3482, but I'm OK to merge #3482 into this PR first before merging this PR into main].Other data kinds are unchanged.
This is a breaking change since some previous "matrix" data are now recognized "vectors" and previously "vectors" is "none". But I guess very few users are using this function, so the breaking change should have minimal effects.