Skip to content

Conversation

@flying-sheep
Copy link
Member

See #2621

@flying-sheep flying-sheep requested a review from ivirshup October 17, 2023 08:52
@flying-sheep flying-sheep added this to the 1.10.0 milestone Oct 17, 2023
@flying-sheep flying-sheep self-assigned this Oct 17, 2023
@codecov
Copy link

codecov bot commented Oct 17, 2023

Codecov Report

Merging #2696 (7f122f2) into master (abbee76) will increase coverage by 0.04%.
The diff coverage is 95.89%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2696      +/-   ##
==========================================
+ Coverage   72.87%   72.92%   +0.04%     
==========================================
  Files         110      111       +1     
  Lines       12100    12133      +33     
==========================================
+ Hits         8818     8848      +30     
- Misses       3282     3285       +3     
Files Coverage Δ
scanpy/experimental/pp/_highly_variable_genes.py 63.69% <100.00%> (ø)
scanpy/preprocessing/_utils.py 45.16% <100.00%> (+1.82%) ⬆️
scanpy/testing/_pytest/fixtures/__init__.py 94.11% <ø> (-2.44%) ⬇️
scanpy/tools/_rank_genes_groups.py 92.77% <100.00%> (-0.03%) ⬇️
scanpy/testing/_pytest/params.py 94.73% <94.73%> (ø)
scanpy/_utils/__init__.py 65.32% <94.11%> (+2.05%) ⬆️

Copy link
Member

@ivirshup ivirshup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find that an implementation (using master branch) that looks like:

X_dask_f32.map_blocks(
    check_nonnegative_integers,
    dtype=bool,
    drop_axis=(0, 1)
).compute()

Has a few strong advantages to this approach:

  • It's about 4x faster for a the case X = da.array(rng.poisson(size=(20_000, 10_000)), chunks=(1000, 10_000)) (similar speed to in-memory)
  • It works when each chunk of the dask array is a sparse matrix

As mentioned in the last PR, this is also the approach that the docs for dask.array seems to suggest. I think we should go with this approach here.

Code for benchmarking

Setup

import numpy as np, anndata as ad, h5py
from scipy import sparse

rng = np.random.default_rng()

X = rng.poisson(size=(20_000, 10_000))
X_dense_f32 = X.astype(np.float32)
X_sparse_f32 = sparse.csr_matrix(X_dense_f32)


with h5py.File("arrays.h5", "w") as f:
    g = f["/"]
    ad.experimental.write_elem(g, "X_dense_f32", X_dense_f32)
    ad.experimental.write_elem(g, "X_sparse_f32", X_sparse_f32)

Benchmarking

import scanpy as sc
import anndata as ad
import h5py
from scanpy._utils import check_nonnegative_integers
from scipy import sparse
import dask.array as da

with h5py.File("arrays.h5") as f:
    X_dense_f32 = ad.experimental.read_elem(f["X_dense_f32"])
    X_sparse_f32 = ad.experimental.read_elem(f["X_sparse_f32"])


X_dask_f32 = da.from_array(X_dense_f32, chunks=(1000, 10_000))
%timeit X_dask_f32.map_blocks(check_nonnegative_integers, dtype=bool, drop_axis=(0, 1)).compute()

Testing that it works for sparse arrays:

(
    X_dask_f32
    .map_blocks(sparse.csr_matrix)
    .map_blocks(check_nonnegative_integers, dtype=bool, drop_axis=(0, 1))
    .compute()
)

I would note that neither case seems to spend that much time doing computation in parallel, which is a little curious.

The docs for the map_blocks function also recommends using da.reduction here, though I believe that would take more rewriting and haven't checked it yet.

@flying-sheep
Copy link
Member Author

Makes sense. I think we need a good sparse story before we think about supporting sparse-in-dask.

@flying-sheep
Copy link
Member Author

flying-sheep commented Oct 26, 2023

I went over all the places where we use the array_type fixture and thought about your idea to use @pytest.mark.parametrize and I came around to it for this case:

For unfinished features, it’s great. Everwhere we can’t say “we fully support this” and gradually build in support, we should use it.

It has its disadvantages:

  • @pytest.mark.parametrize("array_type", ARRAY_TYPES) is so long that in practice, it’s hard to see the difference to something like this: @pytest.mark.parametrize("array_type", ARRAY_TYPES_XYZ)

    E.g. I don’t like seeing

     @pytest.mark.parametrize("array_type", ARRAY_TYPES)
     @pytest.mark.parametrize("dtype", ["float32", "int64"])

    4 times in test_normalize_total. If the 3rd test had a different list of values in one of the params, it would be near impossible to see.

  • Fixtures can depend on other fixtures, but can’t easily have a parameter matrix without that. (pytest.fixture(params=...) only accepts a single list of parameters, we’d have to manually use product in there for a matrix)

    That’s why I didn’t go away from a fixture in test_pca.py

I therefore propose that we use @pytest.mark.parametrize for

  • things that aren’t heavily reused
  • things we don’t fully support

and fixtures for everything where there’s ~3 or more test functions using the same list of parameter values.

@flying-sheep flying-sheep requested a review from ivirshup October 27, 2023 11:05
@ivirshup
Copy link
Member

@pytest.mark.parametrize("array_type", ARRAY_TYPES) is so long that in practice, it’s hard to see the difference to something like this: @pytest.mark.parametrize("array_type", ARRAY_TYPES_XYZ)

Does doing something like: ARRAY_TYPES + [DaskArray] help here? I think this is nice and explicit, plus super obvious how to modify.

Copy link
Member

@ivirshup ivirshup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the code changes look pretty good (a few comments).

The MAP_ARRAY_TYPES may be overkill, but fine with it for now.

flying-sheep and others added 2 commits October 30, 2023 17:41
Co-authored-by: Isaac Virshup <[email protected]>
@flying-sheep flying-sheep requested a review from ivirshup October 30, 2023 16:44
@flying-sheep flying-sheep merged commit 52926cc into master Nov 7, 2023
@flying-sheep flying-sheep deleted the dask-utils branch November 7, 2023 13:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants