Handling sparse matrices

I'm getting unexpectedly high memory usage with flox. Here's what I've been doing:

```python
import dask.distributed as dd
import dask.array as da
import numpy as np
import flox

cluster = dd.LocalCluster(n_workers=3)
client = dd.Client(cluster)

M, N = 1_000_000, 20_000

X = da.random.normal(size=(M, N), chunks=(10_000, N))
by = np.random.choice(5_000, size=M)

res, codes = flox.groupby_reduce(
    X.T,
    by,
    func="sum",
    fill_value=0,
    method="map-reduce",
    reindex=True,
)

res_comp = res.compute()
```

This always warns about memory usage then fails on my dev machine with 64 gb of memory. However, I'm able to do plenty of other operations with an array this size (e.g. PCA, simple reductions). To me, a tree reduction here should be more than capable of handling this size of array.

Is this just me and my compute being odd, or do I have an incorrect expectation here?

cc: @ilan-gold

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Handling sparse matrices #346

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Handling sparse matrices #346

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions