-
Notifications
You must be signed in to change notification settings - Fork 681
Handle Dask arrays in some utilities #2621
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## master #2621 +/- ##
==========================================
+ Coverage 71.98% 72.01% +0.02%
==========================================
Files 108 109 +1
Lines 11905 11934 +29
==========================================
+ Hits 8570 8594 +24
- Misses 3335 3340 +5
|
ivirshup
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the goal of this PR?
To me, I think of dask support as:
- Making a computation function work all the way through with dask arrays, returning a lazy result
- Allow us to scale out a computation, e.g. show that this works on larger data.
I'm wondering if "returns a result for dask arrays" is a level of support we want (for non-plotting functions). WDYT?
|
I don’t think that’s possible without changing what kinds of checks we do. E.g. But sure, it’s more useful to make sure no |
My worry with this is that if we do ever make it so the computation is delayed it's a behavior change for dask arrays. E.g. the result changes from immediate to delayed and now functions may have to call |
|
OK, makes sense. I’ll keep working on this then. |
|
FWIW, I do think this is one of the more difficult cases to make work well with dask. I would suspect we'd end up wanting something like: https://flox.readthedocs.io/en/latest/ here. Then I'd do a groupby on the groups, calculate the sum and squared sum for each, do a another level of aggregation for |
|
OK, so the graph in import dask
dask.visualize(rv, engine="cytoscape", filename=request.node.name)flowchart LR
4342680003182880388["(0, 0)"]
4363265869429493506((signbit))
4342680003182880388 --> 4363265869429493506
550394025789815978["(0, 1)"]
5167392774578066548((signbit))
550394025789815978 --> 5167392774578066548
3764517430824237394["(1, 0)"]
5336730508589856979((signbit))
3764517430824237394 --> 5336730508589856979
2743123325277761031["(1, 1)"]
2513425685193572888((signbit))
2743123325277761031 --> 2513425685193572888
284808496767994156["(0, 0)"]
4363265869429493506 --> 284808496767994156
6263727941369393084((any))
284808496767994156 --> 6263727941369393084
6222832259334004269["(0, 1)"]
5167392774578066548 --> 6222832259334004269
7256567839680908872((any))
6222832259334004269 --> 7256567839680908872
8881403918513157720["(1, 0)"]
5336730508589856979 --> 8881403918513157720
5898621639535744825((any))
8881403918513157720 --> 5898621639535744825
2373763162411159295["(1, 1)"]
2513425685193572888 --> 2373763162411159295
1659302467096852217((any))
2373763162411159295 --> 1659302467096852217
7195453449900658805["(0, 0)"]
6263727941369393084 --> 7195453449900658805
7976077601232067203((any-\naggregate))
7195453449900658805 --> 7976077601232067203
687812693798660380["(0, 1)"]
7256567839680908872 --> 687812693798660380
687812693798660380 --> 7976077601232067203
3901936098833081796["(1, 0)"]
5898621639535744825 --> 3901936098833081796
3901936098833081796 --> 7976077601232067203
8795010127805778162["(1, 1)"]
1659302467096852217 --> 8795010127805778162
8795010127805778162 --> 7976077601232067203
1203378416021505679["()"]
7976077601232067203 --> 1203378416021505679
9179805111332178500((invert))
1203378416021505679 --> 9179805111332178500
5169565091578776769["()"]
9179805111332178500 --> 5169565091578776769
814146044537405006((and))
5169565091578776769 --> 814146044537405006
1050532709569538834["()"]
814146044537405006 --> 1050532709569538834
I am of course using flowchart LR
step0["(0, 0)"] --> op0((signbit)) --> step1["(0, 0)"] --> op1((any)) --> step2["(0, 0)"]
with individual operations, but I’m not sure if that’s worth the code readability problems. Smells of premature optimization. mean_var graphflowchart LR
step000["(0, 0)"] --> op000((mean_\nchunk)) --> step001["(0, 0)"] --> op00((mean_agg-\naggregate)) --> step00["0"]
step100["(1, 0)"] --> op100((mean_\nchunk)) --> step101["(1, 0)"] --> op00
step010["(0, 1)"] --> op010((mean_\nchunk)) --> step011["(0, 1)"] --> op10((mean_agg-\naggregate)) --> step10["1"]
step110["(1, 1)"] --> op110((mean_\nchunk)) --> step111["(1, 1)"] --> op10
|
This reverts commit 02a0f7b.
Co-authored-by: Isaac Virshup <[email protected]>
This PR
_utils.check_nonnegative_integerswork with Dask arrays and with that alsorank_genes_groupslazy_{and,or}that make it possible to delay checks with dask: