-
-
Notifications
You must be signed in to change notification settings - Fork 27
Description
Problem
Whenever we reuse intermediate results and there is a pipeline breaker (such as shuffles, joins, reductions, or groupby operations), it forces us to materialize the entire intermediate result (thus, it is breaking the pipelining that we could utilize for reuse for example between multiple element-wise operations).
This result materialization puts a hard limit on our ability to scale as I have observed in multiple TPC-H benchmark queries.
To illustrate this, run these two snippets on a cluster of your choice
With full intermediate result materialization
from dask_expr.datasets import timeseries
from distributed import Client
if __name__ == "__main__":
with Client() as client:
print(client.dashboard_link)
df = timeseries(start="2000-01-01", end="2020-12-31", freq="100ms", dtypes={"x": float})
# To compute mean, we have to fully materialize df, and we won't free its data
# until we have reused the chunks to compute a partial of the sum.
mean = df["x"].mean()
df[df["x"] > mean].sum().compute()
Without full intermediate result materialization
from dask_expr.datasets import timeseries
from distributed import Client
if __name__ == "__main__":
with Client() as client:
print(client.dashboard_link)
df = timeseries(start="2000-01-01", end="2020-12-31", freq="100ms", dtypes={"x": float})
# Compute the mean beforehand so that we don't have to keep all of `df` in memory
mean = df["x"].mean().compute()
df[df["x"] > mean].sum().compute()
Possible solution
The easiest approach would be to never reuse any intermediate results. This has a few downsides:
- Non-deterministic functions will lead to unexpected results
- We waste a lot of computational resources on recomputations
...but it will allow us to scale.
We can certainly get smarter about intermediate result materialization, but this will require some effort depending on how smart we want to be. (There's a body of (ongoing) research and implementations in the database world we could draw from.)