-
Notifications
You must be signed in to change notification settings - Fork 21
Closed
Labels
Description
I'm getting unexpectedly high memory usage with flox. Here's what I've been doing:
import dask.distributed as dd
import dask.array as da
import numpy as np
import flox
cluster = dd.LocalCluster(n_workers=3)
client = dd.Client(cluster)
M, N = 1_000_000, 20_000
X = da.random.normal(size=(M, N), chunks=(10_000, N))
by = np.random.choice(5_000, size=M)
res, codes = flox.groupby_reduce(
X.T,
by,
func="sum",
fill_value=0,
method="map-reduce",
reindex=True,
)
res_comp = res.compute()
This always warns about memory usage then fails on my dev machine with 64 gb of memory. However, I'm able to do plenty of other operations with an array this size (e.g. PCA, simple reductions). To me, a tree reduction here should be more than capable of handling this size of array.
Is this just me and my compute being odd, or do I have an incorrect expectation here?
cc: @ilan-gold