Handle Dask arrays in some utilities #2621

flying-sheep · 2023-08-17T14:21:54Z

This PR

makes _utils.check_nonnegative_integers work with Dask arrays and with that also rank_genes_groups
Adds type hints that I added while re-familiarizing myself with some APIs
adds a function pair lazy_{and,or} that make it possible to delay checks with dask:

>>> def fail():
...     raise AssertionError()
>>> # rhs can be a function or dask array
>>> lazy_and(False, fail)
False
>>> lazy_and(False, da.array(True).map_blocks(lambda _: fail(), meta=np.bool_(True)))
False
>>> # lhs can be a function or dask array for nested use
>>> # when not nested, a lhs function will be called eagerly like in `a() and b`
>>> lazy_and(False, lazy_and(fail, _))
False
>>> # will not create a recursive dask array
>>> lazy_and(da.array(True), da.array(False)).compute()
False
>>> # will complain on invalid use
>>> lazy_and(True, lambda: da.array(...))
'AssertionError: Use lazy_*(_, da.array(...)) instead of lazy_*(_, lambda: da.array(...)).'

codecov · 2023-08-17T14:26:56Z

Codecov Report

Merging #2621 (8a12a1b) into master (555b668) will increase coverage by 0.02%.
The diff coverage is 91.80%.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #2621      +/-   ##
==========================================
+ Coverage   71.98%   72.01%   +0.02%     
==========================================
  Files         108      109       +1     
  Lines       11905    11934      +29     
==========================================
+ Hits         8570     8594      +24     
- Misses       3335     3340       +5

Files	Coverage Δ
scanpy/_utils/__init__.py	`65.74% <100.00%> (ø)`
scanpy/experimental/pp/_highly_variable_genes.py	`63.69% <100.00%> (ø)`
scanpy/preprocessing/_utils.py	`44.26% <100.00%> (+0.92%)`	⬆️
scanpy/testing/_pytest/fixtures/__init__.py	`95.45% <100.00%> (+0.45%)`	⬆️
scanpy/tools/_rank_genes_groups.py	`92.77% <100.00%> (-0.03%)`	⬇️
scanpy/_utils/_dask.py	`81.48% <81.48%> (ø)`

ivirshup

What is the goal of this PR?

To me, I think of dask support as:

Making a computation function work all the way through with dask arrays, returning a lazy result
Allow us to scale out a computation, e.g. show that this works on larger data.

I'm wondering if "returns a result for dask arrays" is a level of support we want (for non-plotting functions). WDYT?

flying-sheep · 2023-08-28T14:23:37Z

I don’t think that’s possible without changing what kinds of checks we do. E.g. check_nonnegative_integers checks if every single value in the data is a whole number.

But sure, it’s more useful to make sure no n_obs × ? array ever gets converted into a in-memory array if dask is used. The question is “make everything work with dask now, make sure everything is fast later” or “make a few functions work and go fast now, handle other functions later”

ivirshup · 2023-08-28T14:56:24Z

The question is “make everything work with dask now, make sure everything is fast later” or “make a few functions work and go fast now, handle other functions later”

My worry with this is that if we do ever make it so the computation is delayed it's a behavior change for dask arrays. E.g. the result changes from immediate to delayed and now functions may have to call .compute to get a result.

flying-sheep · 2023-08-28T15:31:01Z

OK, makes sense. I’ll keep working on this then.

ivirshup · 2023-08-28T15:34:46Z

FWIW, I do think this is one of the more difficult cases to make work well with dask.

I would suspect we'd end up wanting something like: https://flox.readthedocs.io/en/latest/ here.

Then I'd do a groupby on the groups, calculate the sum and squared sum for each, do a another level of aggregation for "rest" categories, then derive mean and variance from there.

flying-sheep · 2023-10-09T15:03:04Z

OK, so the graph in test_check_nonnegative_integers, generated via

import dask
dask.visualize(rv, engine="cytoscape", filename=request.node.name)

flowchart LR

4342680003182880388["(0, 0)"]
4363265869429493506((signbit))
4342680003182880388 --> 4363265869429493506

550394025789815978["(0, 1)"]
5167392774578066548((signbit))
550394025789815978 --> 5167392774578066548

3764517430824237394["(1, 0)"]
5336730508589856979((signbit))
3764517430824237394 --> 5336730508589856979

2743123325277761031["(1, 1)"]
2513425685193572888((signbit))
2743123325277761031 --> 2513425685193572888

284808496767994156["(0, 0)"]
4363265869429493506 --> 284808496767994156

6263727941369393084((any))
284808496767994156 --> 6263727941369393084

6222832259334004269["(0, 1)"]
5167392774578066548 --> 6222832259334004269

7256567839680908872((any))
6222832259334004269 --> 7256567839680908872

8881403918513157720["(1, 0)"]
5336730508589856979 --> 8881403918513157720

5898621639535744825((any))
8881403918513157720 --> 5898621639535744825

2373763162411159295["(1, 1)"]
2513425685193572888 --> 2373763162411159295

1659302467096852217((any))
2373763162411159295 --> 1659302467096852217

7195453449900658805["(0, 0)"]
6263727941369393084 --> 7195453449900658805

7976077601232067203((any-\naggregate))
7195453449900658805 --> 7976077601232067203

687812693798660380["(0, 1)"]
7256567839680908872 --> 687812693798660380
687812693798660380 --> 7976077601232067203

3901936098833081796["(1, 0)"]
5898621639535744825 --> 3901936098833081796
3901936098833081796 --> 7976077601232067203

8795010127805778162["(1, 1)"]
1659302467096852217 --> 8795010127805778162
8795010127805778162 --> 7976077601232067203

1203378416021505679["()"]
7976077601232067203 --> 1203378416021505679
9179805111332178500((invert))

1203378416021505679 --> 9179805111332178500
5169565091578776769["()"]
9179805111332178500 --> 5169565091578776769
814146044537405006((and))
5169565091578776769 --> 814146044537405006

1050532709569538834["()"]
814146044537405006 --> 1050532709569538834

I am of course using map_blocks. If we really wanted, I assume we could still replace sequences of two operations like

flowchart LR

step0["(0, 0)"] --> op0((signbit)) --> step1["(0, 0)"] --> op1((any)) --> step2["(0, 0)"]

with individual operations, but I’m not sure if that’s worth the code readability problems.

Smells of premature optimization.

mean_var graph

flowchart LR

step000["(0, 0)"] --> op000((mean_\nchunk)) --> step001["(0, 0)"] --> op00((mean_agg-\naggregate)) --> step00["0"]
step100["(1, 0)"] --> op100((mean_\nchunk)) --> step101["(1, 0)"] --> op00

step010["(0, 1)"] --> op010((mean_\nchunk)) --> step011["(0, 1)"] --> op10((mean_agg-\naggregate)) --> step10["1"]
step110["(1, 1)"] --> op110((mean_\nchunk)) --> step111["(1, 1)"] --> op10

addressed

This reverts commit 02a0f7b.

…2621)" (#2694)

Co-authored-by: Isaac Virshup <[email protected]>

WIP

42e8945

flying-sheep mentioned this pull request Aug 17, 2023

Dask support #2578

Open

21 tasks

flying-sheep added 2 commits August 17, 2023 17:16

more work

536309b

Fix

3744abf

flying-sheep marked this pull request as ready for review August 17, 2023 15:32

private func

01ca0e9

flying-sheep added this to the 1.10.0 milestone Aug 17, 2023

flying-sheep requested review from ivirshup and selmanozleyen August 20, 2023 11:14

flying-sheep self-assigned this Aug 28, 2023

ivirshup reviewed Aug 28, 2023

View reviewed changes

WIP

fb90046

basic stats

f5f41ff

Merge branch 'master' into dask-rank-genes-groups

f6ffd57

flying-sheep changed the title ~~Handle Dask arrays in rank_genes_groups~~ Handle Dask arrays in some utilities Aug 29, 2023

flying-sheep added 10 commits August 29, 2023 15:21

Discard changes to scanpy/tests/test_filter_rank_genes_groups.py

cdcf4b9

revert rgg to numpy

61807e3

undask tests

eb6e84c

flip back

8aee900

more back

fc41209

Merge branch 'master' into dask-rank-genes-groups

f8570f8

fix test helper

ddc8c56

simplify test

a70358e

ids

0bd04b7

back to X*X

a13ce0e

flying-sheep added 4 commits September 1, 2023 13:26

correct types

27d8acc

even correcter types

bbd48b0

move utils

2b49ac1

mostly fixed

65da0af

flying-sheep removed the request for review from ivirshup September 1, 2023 12:18

flying-sheep added 2 commits September 1, 2023 14:19

add check

87070be

flatten graph

a2e66af

flying-sheep requested a review from ivirshup September 1, 2023 12:31

flying-sheep added 5 commits September 1, 2023 14:34

fix 3.8

f0bfb11

Merge branch 'master' into dask-rank-genes-groups

1b90ece

simplify types

06df093

adapt to new signature

0281bfc

docs

3c59162

flying-sheep added PR blocked: Waiting on reviewer and removed PR blocked: Waiting on reviewer labels Oct 6, 2023

Merge branch 'master' into dask-rank-genes-groups

21cbb94

Name ops

8a12a1b

flying-sheep added the PR blocked: Waiting on reviewer label Oct 9, 2023

flying-sheep merged commit 02a0f7b into master Oct 17, 2023

flying-sheep deleted the dask-rank-genes-groups branch October 17, 2023 08:24

flying-sheep added a commit that referenced this pull request Oct 17, 2023

Revert "Handle Dask arrays in some utilities (#2621)"

666aeba

This reverts commit 02a0f7b.

flying-sheep added a commit that referenced this pull request Oct 17, 2023

Revert accidentally merged PR "Handle Dask arrays in some utilities (#…

e4db36d

…2621)" (#2694)

flying-sheep added a commit that referenced this pull request Oct 17, 2023

Handle Dask arrays in some utilities (#2621)

58c6da3

Co-authored-by: Isaac Virshup <[email protected]>

flying-sheep mentioned this pull request Oct 17, 2023

Handle Dask arrays in some utilities #2696

Merged

flying-sheep removed the PR blocked: Waiting on reviewer label Oct 20, 2023

flying-sheep added Enhancement ✨ Area – Performance 🐌 Area - Out of core 💾 Working with on disk data labels Dec 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Handle Dask arrays in some utilities #2621

Handle Dask arrays in some utilities #2621

Uh oh!

flying-sheep commented Aug 17, 2023 •

edited

Loading

Uh oh!

codecov bot commented Aug 17, 2023 •

edited

Loading

Uh oh!

ivirshup left a comment

Uh oh!

flying-sheep commented Aug 28, 2023 •

edited

Loading

Uh oh!

ivirshup commented Aug 28, 2023

Uh oh!

flying-sheep commented Aug 28, 2023

Uh oh!

ivirshup commented Aug 28, 2023 •

edited

Loading

Uh oh!

flying-sheep commented Oct 9, 2023 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Handle Dask arrays in some utilities #2621

Handle Dask arrays in some utilities #2621

Uh oh!

Conversation

flying-sheep commented Aug 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Aug 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ivirshup left a comment

Choose a reason for hiding this comment

Uh oh!

flying-sheep commented Aug 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ivirshup commented Aug 28, 2023

Uh oh!

flying-sheep commented Aug 28, 2023

Uh oh!

ivirshup commented Aug 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

flying-sheep commented Oct 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

flying-sheep commented Aug 17, 2023 •

edited

Loading

codecov bot commented Aug 17, 2023 •

edited

Loading

flying-sheep commented Aug 28, 2023 •

edited

Loading

ivirshup commented Aug 28, 2023 •

edited

Loading

flying-sheep commented Oct 9, 2023 •

edited

Loading