Skip to content

Conversation

dcherian
Copy link
Contributor

@dcherian dcherian commented Aug 3, 2022

This is motivated by https://docs.rapids.ai/api/kvikio/stable/api.html#kvikio.zarr.GDSStore which on read loads the data directly into GPU memory.

Currently we rely on np.asarray to convert a BackendArray wrapped with a number of lazy indexing classes to a real array but this breaks for GDSStore because the underlying array is a cupy array, so using np.asarray raises an error.

np.asarray will raise if a non-numpy array is returned so we need to use something else.

Here I added get_array which like np.array recurses down until it receives a duck array.

Quite a few things are broken I think , but I'd like feedback on the approach.

I considered np.asanyarray(..., like=...) but that would require the lazy indexing classes to know what they're wrapping which doesn't seem right.

Ref: xarray-contrib/cupy-xarray#10 which adds a kvikio backend entrypoint

dcherian and others added 6 commits August 3, 2022 09:08
This returns the underlying array type instead of always casting
to np.array. This is necessary for Zarr stores where the
Zarr Array wraps a cupy array (for example kvikio.zarr.GDSStoree).
In that case, we cannot call np.asarray because __array__ is
expected to always return a numpy array.

We use get_array in Variable.data to make sure we don't load arrays
from such GDSStores.
instead of always casting to np.asarray
@shoyer
Copy link
Member

shoyer commented Aug 3, 2022

As I understand it, the main purpose here is to remove Xarray lazy indexing class.

Maybe call this get_duck_array(), just to be a little more descriptive?

@dcherian dcherian changed the title Avoid calling np.asarray on backend arrays Avoid calling np.asarray on lazy indexing classes Aug 3, 2022
@dcherian dcherian marked this pull request as ready for review August 3, 2022 20:57
@dcherian dcherian requested a review from shoyer August 3, 2022 20:58
# so we need the explicit check for ExplicitlyIndexed
if isinstance(array, ExplicitlyIndexed):
array = array.get_duck_array()
return _wrap_numpy_scalars(array)
Copy link
Contributor Author

@dcherian dcherian Jan 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding _wrap_numpy_scalars allows us to handle scalars being returned by the backend. This seems OK to me in that we place fewer restrictions on the backend (and is backward compatible).

def _wrap_numpy_scalars(array):
"""Wrap NumPy scalars in 0d arrays."""
if np.isscalar(array):
return np.array(array)
else:
return array

But now the issue is that we should pass an appropriate like argument to np.array but I don't see how to that from a scalar array

Good news is that backends can avoid this complication by returning arrays, so we could just ignore this ugly bit for now.

dcherian and others added 2 commits February 16, 2023 12:51
Co-authored-by: Illviljan <[email protected]>
Co-authored-by: Stephan Hoyer <[email protected]>
@dcherian
Copy link
Contributor Author

@Illviljan feel free to push any typing changes to this PR. I think that would really help clarify the interface. I tried adding a DuckArray type but that didn't go to far.

@Illviljan
Copy link
Contributor

I don't have a better idea than to do DuckArray = Any # ndarray/cupy/sparse etc. and add that as output, but that wont change anything mypy-wise besides making it easier for us to read the code.

@dcherian
Copy link
Contributor Author

T_ExplicitlyIndexed may be a different thing to add

dcherian and others added 9 commits March 3, 2023 16:49
* main: (40 commits)
  Faq pull request (According to pull request pydata#7604 & issue pydata#1285 (pydata#7638)
  add timeouts for tests (pydata#7657)
  Pull Request Labeler - Undo workaround sync-labels bug (pydata#7667)
  [pre-commit.ci] pre-commit autoupdate (pydata#7651)
  Allow all integer dtypes in `polyval` (pydata#7619)
  [skip-ci] dev whats-new (pydata#7660)
  Redo whats-new for 2023.03.0 (pydata#7659)
  Set copy=False when calling pd.Series (pydata#7642)
  Pin pandas < 2 (pydata#7650)
  Whats-new for release 2023.03.0 (pydata#7643)
  Bump pypa/gh-action-pypi-publish from 1.7.1 to 1.8.1 (pydata#7648)
  Use more descriptive link texts (pydata#7625)
  Fix missing 'dim' argument in _get_nan_block_lengths (pydata#7598)
  Fix `pcolormesh` with str coords (pydata#7612)
  [skip-ci] Fix groupby binary ops benchmarks (pydata#7603)
  Remove incomplete sentence in IO docs (pydata#7631)
  Allow indexing unindexed dimensions using dask arrays (pydata#5873)
  Bump pypa/gh-action-pypi-publish from 1.6.4 to 1.7.1 (pydata#7618)
  [pre-commit.ci] pre-commit autoupdate (pydata#7620)
  add a test for scatter colorbar extend (pydata#7616)
  ...
@dcherian dcherian added the plan to merge Final call for comments label Mar 26, 2023
@dcherian
Copy link
Contributor Author

I'd like to merge this at the end of next week.

It now has tests and should be backwards compatible with external backends.

A good next step would be to finish up #7020

@Illviljan Illviljan mentioned this pull request Mar 31, 2023
4 tasks
@dcherian dcherian merged commit aa4361d into pydata:main Mar 31, 2023
@dcherian dcherian deleted the kvikio branch March 31, 2023 15:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
io plan to merge Final call for comments topic-backends topic-indexing topic-zarr Related to zarr storage library
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants