Skip to content

Conversation

@flying-sheep
Copy link
Member

@flying-sheep flying-sheep commented Aug 17, 2023

This PR

  • makes _utils.check_nonnegative_integers work with Dask arrays and with that also rank_genes_groups
  • Adds type hints that I added while re-familiarizing myself with some APIs
  • adds a function pair lazy_{and,or} that make it possible to delay checks with dask:
>>> def fail():
...     raise AssertionError()
>>> # rhs can be a function or dask array
>>> lazy_and(False, fail)
False
>>> lazy_and(False, da.array(True).map_blocks(lambda _: fail(), meta=np.bool_(True)))
False
>>> # lhs can be a function or dask array for nested use
>>> # when not nested, a lhs function will be called eagerly like in `a() and b`
>>> lazy_and(False, lazy_and(fail, _))
False
>>> # will not create a recursive dask array
>>> lazy_and(da.array(True), da.array(False)).compute()
False
>>> # will complain on invalid use
>>> lazy_and(True, lambda: da.array(...))
'AssertionError: Use lazy_*(_, da.array(...)) instead of lazy_*(_, lambda: da.array(...)).'

@flying-sheep flying-sheep mentioned this pull request Aug 17, 2023
21 tasks
@codecov
Copy link

codecov bot commented Aug 17, 2023

Codecov Report

Merging #2621 (8a12a1b) into master (555b668) will increase coverage by 0.02%.
The diff coverage is 91.80%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2621      +/-   ##
==========================================
+ Coverage   71.98%   72.01%   +0.02%     
==========================================
  Files         108      109       +1     
  Lines       11905    11934      +29     
==========================================
+ Hits         8570     8594      +24     
- Misses       3335     3340       +5     
Files Coverage Δ
scanpy/_utils/__init__.py 65.74% <100.00%> (ø)
scanpy/experimental/pp/_highly_variable_genes.py 63.69% <100.00%> (ø)
scanpy/preprocessing/_utils.py 44.26% <100.00%> (+0.92%) ⬆️
scanpy/testing/_pytest/fixtures/__init__.py 95.45% <100.00%> (+0.45%) ⬆️
scanpy/tools/_rank_genes_groups.py 92.77% <100.00%> (-0.03%) ⬇️
scanpy/_utils/_dask.py 81.48% <81.48%> (ø)

@flying-sheep flying-sheep marked this pull request as ready for review August 17, 2023 15:32
@flying-sheep flying-sheep added this to the 1.10.0 milestone Aug 17, 2023
@flying-sheep flying-sheep self-assigned this Aug 28, 2023
Copy link
Member

@ivirshup ivirshup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the goal of this PR?

To me, I think of dask support as:

  • Making a computation function work all the way through with dask arrays, returning a lazy result
  • Allow us to scale out a computation, e.g. show that this works on larger data.

I'm wondering if "returns a result for dask arrays" is a level of support we want (for non-plotting functions). WDYT?

@flying-sheep
Copy link
Member Author

flying-sheep commented Aug 28, 2023

I don’t think that’s possible without changing what kinds of checks we do. E.g. check_nonnegative_integers checks if every single value in the data is a whole number.

But sure, it’s more useful to make sure no n_obs × ? array ever gets converted into a in-memory array if dask is used. The question is “make everything work with dask now, make sure everything is fast later” or “make a few functions work and go fast now, handle other functions later”

@ivirshup
Copy link
Member

The question is “make everything work with dask now, make sure everything is fast later” or “make a few functions work and go fast now, handle other functions later”

My worry with this is that if we do ever make it so the computation is delayed it's a behavior change for dask arrays. E.g. the result changes from immediate to delayed and now functions may have to call .compute to get a result.

@flying-sheep
Copy link
Member Author

OK, makes sense. I’ll keep working on this then.

@ivirshup
Copy link
Member

ivirshup commented Aug 28, 2023

FWIW, I do think this is one of the more difficult cases to make work well with dask.

I would suspect we'd end up wanting something like: https://flox.readthedocs.io/en/latest/ here.

Then I'd do a groupby on the groups, calculate the sum and squared sum for each, do a another level of aggregation for "rest" categories, then derive mean and variance from there.

@flying-sheep flying-sheep changed the title Handle Dask arrays in rank_genes_groups Handle Dask arrays in some utilities Aug 29, 2023
@flying-sheep flying-sheep removed the request for review from ivirshup September 1, 2023 12:18
@flying-sheep
Copy link
Member Author

flying-sheep commented Oct 9, 2023

OK, so the graph in test_check_nonnegative_integers, generated via

import dask
dask.visualize(rv, engine="cytoscape", filename=request.node.name)
flowchart LR

4342680003182880388["(0, 0)"]
4363265869429493506((signbit))
4342680003182880388 --> 4363265869429493506

550394025789815978["(0, 1)"]
5167392774578066548((signbit))
550394025789815978 --> 5167392774578066548

3764517430824237394["(1, 0)"]
5336730508589856979((signbit))
3764517430824237394 --> 5336730508589856979

2743123325277761031["(1, 1)"]
2513425685193572888((signbit))
2743123325277761031 --> 2513425685193572888

284808496767994156["(0, 0)"]
4363265869429493506 --> 284808496767994156

6263727941369393084((any))
284808496767994156 --> 6263727941369393084

6222832259334004269["(0, 1)"]
5167392774578066548 --> 6222832259334004269

7256567839680908872((any))
6222832259334004269 --> 7256567839680908872

8881403918513157720["(1, 0)"]
5336730508589856979 --> 8881403918513157720

5898621639535744825((any))
8881403918513157720 --> 5898621639535744825

2373763162411159295["(1, 1)"]
2513425685193572888 --> 2373763162411159295

1659302467096852217((any))
2373763162411159295 --> 1659302467096852217

7195453449900658805["(0, 0)"]
6263727941369393084 --> 7195453449900658805

7976077601232067203((any-\naggregate))
7195453449900658805 --> 7976077601232067203

687812693798660380["(0, 1)"]
7256567839680908872 --> 687812693798660380
687812693798660380 --> 7976077601232067203

3901936098833081796["(1, 0)"]
5898621639535744825 --> 3901936098833081796
3901936098833081796 --> 7976077601232067203

8795010127805778162["(1, 1)"]
1659302467096852217 --> 8795010127805778162
8795010127805778162 --> 7976077601232067203

1203378416021505679["()"]
7976077601232067203 --> 1203378416021505679
9179805111332178500((invert))

1203378416021505679 --> 9179805111332178500
5169565091578776769["()"]
9179805111332178500 --> 5169565091578776769
814146044537405006((and))
5169565091578776769 --> 814146044537405006

1050532709569538834["()"]
814146044537405006 --> 1050532709569538834
Loading

I am of course using map_blocks. If we really wanted, I assume we could still replace sequences of two operations like

flowchart LR

step0["(0, 0)"] --> op0((signbit)) --> step1["(0, 0)"] --> op1((any)) --> step2["(0, 0)"]
Loading

with individual operations, but I’m not sure if that’s worth the code readability problems.

Smells of premature optimization.

mean_var graph
flowchart LR

step000["(0, 0)"] --> op000((mean_\nchunk)) --> step001["(0, 0)"] --> op00((mean_agg-\naggregate)) --> step00["0"]
step100["(1, 0)"] --> op100((mean_\nchunk)) --> step101["(1, 0)"] --> op00

step010["(0, 1)"] --> op010((mean_\nchunk)) --> step011["(0, 1)"] --> op10((mean_agg-\naggregate)) --> step10["1"]
step110["(1, 1)"] --> op110((mean_\nchunk)) --> step111["(1, 1)"] --> op10
Loading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants