Handle Dask arrays in some utilities #2696

flying-sheep · 2023-10-17T08:51:53Z

Co-authored-by: Isaac Virshup <[email protected]>

codecov · 2023-10-17T09:21:06Z

Codecov Report

Merging #2696 (7f122f2) into master (abbee76) will increase coverage by 0.04%.
The diff coverage is 95.89%.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #2696      +/-   ##
==========================================
+ Coverage   72.87%   72.92%   +0.04%     
==========================================
  Files         110      111       +1     
  Lines       12100    12133      +33     
==========================================
+ Hits         8818     8848      +30     
- Misses       3282     3285       +3

Files	Coverage Δ
scanpy/experimental/pp/_highly_variable_genes.py	`63.69% <100.00%> (ø)`
scanpy/preprocessing/_utils.py	`45.16% <100.00%> (+1.82%)`	⬆️
scanpy/testing/_pytest/fixtures/__init__.py	`94.11% <ø> (-2.44%)`	⬇️
scanpy/tools/_rank_genes_groups.py	`92.77% <100.00%> (-0.03%)`	⬇️
scanpy/testing/_pytest/params.py	`94.73% <94.73%> (ø)`
scanpy/_utils/__init__.py	`65.32% <94.11%> (+2.05%)`	⬆️

ivirshup

I find that an implementation (using master branch) that looks like:

X_dask_f32.map_blocks(
    check_nonnegative_integers,
    dtype=bool,
    drop_axis=(0, 1)
).compute()

Has a few strong advantages to this approach:

It's about 4x faster for a the case X = da.array(rng.poisson(size=(20_000, 10_000)), chunks=(1000, 10_000)) (similar speed to in-memory)
It works when each chunk of the dask array is a sparse matrix

As mentioned in the last PR, this is also the approach that the docs for dask.array seems to suggest. I think we should go with this approach here.

Code for benchmarking

Setup

import numpy as np, anndata as ad, h5py
from scipy import sparse

rng = np.random.default_rng()

X = rng.poisson(size=(20_000, 10_000))
X_dense_f32 = X.astype(np.float32)
X_sparse_f32 = sparse.csr_matrix(X_dense_f32)


with h5py.File("arrays.h5", "w") as f:
    g = f["/"]
    ad.experimental.write_elem(g, "X_dense_f32", X_dense_f32)
    ad.experimental.write_elem(g, "X_sparse_f32", X_sparse_f32)

Benchmarking

import scanpy as sc
import anndata as ad
import h5py
from scanpy._utils import check_nonnegative_integers
from scipy import sparse
import dask.array as da

with h5py.File("arrays.h5") as f:
    X_dense_f32 = ad.experimental.read_elem(f["X_dense_f32"])
    X_sparse_f32 = ad.experimental.read_elem(f["X_sparse_f32"])


X_dask_f32 = da.from_array(X_dense_f32, chunks=(1000, 10_000))

%timeit X_dask_f32.map_blocks(check_nonnegative_integers, dtype=bool, drop_axis=(0, 1)).compute()

Testing that it works for sparse arrays:

(
    X_dask_f32
    .map_blocks(sparse.csr_matrix)
    .map_blocks(check_nonnegative_integers, dtype=bool, drop_axis=(0, 1))
    .compute()
)

I would note that neither case seems to spend that much time doing computation in parallel, which is a little curious.

The docs for the map_blocks function also recommends using da.reduction here, though I believe that would take more rewriting and haven't checked it yet.

flying-sheep · 2023-10-26T14:10:05Z

Makes sense. I think we need a good sparse story before we think about supporting sparse-in-dask.

flying-sheep · 2023-10-26T15:31:50Z

I went over all the places where we use the array_type fixture and thought about your idea to use @pytest.mark.parametrize and I came around to it for this case:

For unfinished features, it’s great. Everwhere we can’t say “we fully support this” and gradually build in support, we should use it.

It has its disadvantages:

@pytest.mark.parametrize("array_type", ARRAY_TYPES) is so long that in practice, it’s hard to see the difference to something like this: @pytest.mark.parametrize("array_type", ARRAY_TYPES_XYZ)

E.g. I don’t like seeing
```
 @pytest.mark.parametrize("array_type", ARRAY_TYPES)
 @pytest.mark.parametrize("dtype", ["float32", "int64"])
```
4 times in test_normalize_total. If the 3rd test had a different list of values in one of the params, it would be near impossible to see.
Fixtures can depend on other fixtures, but can’t easily have a parameter matrix without that. (pytest.fixture(params=...) only accepts a single list of parameters, we’d have to manually use product in there for a matrix)

That’s why I didn’t go away from a fixture in test_pca.py

I therefore propose that we use @pytest.mark.parametrize for

things that aren’t heavily reused
things we don’t fully support

and fixtures for everything where there’s ~3 or more test functions using the same list of parameter values.

ivirshup · 2023-10-30T15:11:01Z

@pytest.mark.parametrize("array_type", ARRAY_TYPES) is so long that in practice, it’s hard to see the difference to something like this: @pytest.mark.parametrize("array_type", ARRAY_TYPES_XYZ)

Does doing something like: ARRAY_TYPES + [DaskArray] help here? I think this is nice and explicit, plus super obvious how to modify.

ivirshup

I think the code changes look pretty good (a few comments).

The MAP_ARRAY_TYPES may be overkill, but fine with it for now.

scanpy/_utils/__init__.py

scanpy/tests/test_pca.py

scanpy/tests/test_preprocessing.py

Co-authored-by: Isaac Virshup <[email protected]>

Handle Dask arrays in some utilities (#2621)

58c6da3

Co-authored-by: Isaac Virshup <[email protected]>

flying-sheep requested a review from ivirshup October 17, 2023 08:52

flying-sheep added this to the 1.10.0 milestone Oct 17, 2023

flying-sheep self-assigned this Oct 17, 2023

ivirshup requested changes Oct 24, 2023

View reviewed changes

flying-sheep added 5 commits October 24, 2023 15:10

Merge branch 'master' into dask-utils

a9f4984

Merge branch 'master' into dask-utils

583f67e

back to prev style

202e3fc

Fix _get_mean_var

fbc78d2

Make elem_mul work with dask

24512f5

flying-sheep added 2 commits October 26, 2023 16:41

also test sparse-in-dask

391572f

Use parametrization in most places

922b88d

flying-sheep added 4 commits October 26, 2023 17:53

Fix regressions

caf6db6

whoops

50c5a95

Fix PCA warning tests

07471a8

filter warnings in test

af6a11f

flying-sheep requested a review from ivirshup October 27, 2023 11:05

relnote

5db6afe

ivirshup reviewed Oct 30, 2023

View reviewed changes

scanpy/_utils/__init__.py Outdated Show resolved Hide resolved

scanpy/tests/test_pca.py Show resolved Hide resolved

scanpy/tests/test_preprocessing.py Outdated Show resolved Hide resolved

flying-sheep and others added 2 commits October 30, 2023 17:41

let’s try

6441a6a

Co-authored-by: Isaac Virshup <[email protected]>

sparse name

23eeb46

flying-sheep requested a review from ivirshup October 30, 2023 16:44

flying-sheep added 2 commits November 3, 2023 10:34

Merge branch 'master' into dask-utils

f7a8b06

Fix tests

7f122f2

ivirshup approved these changes Nov 7, 2023

View reviewed changes

flying-sheep merged commit 52926cc into master Nov 7, 2023

flying-sheep deleted the dask-utils branch November 7, 2023 13:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Handle Dask arrays in some utilities #2696

Handle Dask arrays in some utilities #2696

Uh oh!

flying-sheep commented Oct 17, 2023

Uh oh!

codecov bot commented Oct 17, 2023 •

edited

Loading

Uh oh!

ivirshup left a comment •

edited

Loading

Uh oh!

flying-sheep commented Oct 26, 2023

Uh oh!

flying-sheep commented Oct 26, 2023 •

edited

Loading

Uh oh!

ivirshup commented Oct 30, 2023

Uh oh!

ivirshup left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Handle Dask arrays in some utilities #2696

Handle Dask arrays in some utilities #2696

Uh oh!

Conversation

flying-sheep commented Oct 17, 2023

Uh oh!

codecov bot commented Oct 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ivirshup left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

flying-sheep commented Oct 26, 2023

Uh oh!

flying-sheep commented Oct 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ivirshup commented Oct 30, 2023

Uh oh!

ivirshup left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov bot commented Oct 17, 2023 •

edited

Loading

ivirshup left a comment •

edited

Loading

flying-sheep commented Oct 26, 2023 •

edited

Loading