-
Notifications
You must be signed in to change notification settings - Fork 681
Handle Dask arrays in some utilities #2696
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Co-authored-by: Isaac Virshup <[email protected]>
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## master #2696 +/- ##
==========================================
+ Coverage 72.87% 72.92% +0.04%
==========================================
Files 110 111 +1
Lines 12100 12133 +33
==========================================
+ Hits 8818 8848 +30
- Misses 3282 3285 +3
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find that an implementation (using master branch) that looks like:
X_dask_f32.map_blocks(
check_nonnegative_integers,
dtype=bool,
drop_axis=(0, 1)
).compute()Has a few strong advantages to this approach:
- It's about 4x faster for a the case
X = da.array(rng.poisson(size=(20_000, 10_000)), chunks=(1000, 10_000))(similar speed to in-memory) - It works when each chunk of the dask array is a sparse matrix
As mentioned in the last PR, this is also the approach that the docs for dask.array seems to suggest. I think we should go with this approach here.
Code for benchmarking
Setup
import numpy as np, anndata as ad, h5py
from scipy import sparse
rng = np.random.default_rng()
X = rng.poisson(size=(20_000, 10_000))
X_dense_f32 = X.astype(np.float32)
X_sparse_f32 = sparse.csr_matrix(X_dense_f32)
with h5py.File("arrays.h5", "w") as f:
g = f["/"]
ad.experimental.write_elem(g, "X_dense_f32", X_dense_f32)
ad.experimental.write_elem(g, "X_sparse_f32", X_sparse_f32)Benchmarking
import scanpy as sc
import anndata as ad
import h5py
from scanpy._utils import check_nonnegative_integers
from scipy import sparse
import dask.array as da
with h5py.File("arrays.h5") as f:
X_dense_f32 = ad.experimental.read_elem(f["X_dense_f32"])
X_sparse_f32 = ad.experimental.read_elem(f["X_sparse_f32"])
X_dask_f32 = da.from_array(X_dense_f32, chunks=(1000, 10_000))%timeit X_dask_f32.map_blocks(check_nonnegative_integers, dtype=bool, drop_axis=(0, 1)).compute()Testing that it works for sparse arrays:
(
X_dask_f32
.map_blocks(sparse.csr_matrix)
.map_blocks(check_nonnegative_integers, dtype=bool, drop_axis=(0, 1))
.compute()
)I would note that neither case seems to spend that much time doing computation in parallel, which is a little curious.
The docs for the map_blocks function also recommends using da.reduction here, though I believe that would take more rewriting and haven't checked it yet.
|
Makes sense. I think we need a good sparse story before we think about supporting sparse-in-dask. |
|
I went over all the places where we use the For unfinished features, it’s great. Everwhere we can’t say “we fully support this” and gradually build in support, we should use it. It has its disadvantages:
I therefore propose that we use
and fixtures for everything where there’s ~3 or more test functions using the same list of parameter values. |
Does doing something like: |
ivirshup
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the code changes look pretty good (a few comments).
The MAP_ARRAY_TYPES may be overkill, but fine with it for now.
Co-authored-by: Isaac Virshup <[email protected]>
See #2621