Reusing intermediate results causes memory issues

**Problem**

Whenever we reuse intermediate results and there is a pipeline breaker (such as shuffles, joins, reductions, or groupby operations), it forces us to materialize the entire intermediate result (thus, it is breaking the pipelining that we could utilize for reuse for example between multiple element-wise operations).

This result materialization puts a hard limit on our ability to scale as I have observed in multiple TPC-H benchmark queries.

To illustrate this, run these two snippets on a cluster of your choice

**With full intermediate result materialization**
```python
from dask_expr.datasets import timeseries

from distributed import Client

if __name__ == "__main__":
    with Client() as client:
        print(client.dashboard_link)
        df = timeseries(start="2000-01-01", end="2020-12-31", freq="100ms", dtypes={"x": float})
        # To compute mean, we have to fully materialize df, and we won't free its data
        # until we have reused the chunks to compute a partial of the sum. 
        mean = df["x"].mean()
        df[df["x"] > mean].sum().compute()
```

**Without full intermediate result materialization**
```python
from dask_expr.datasets import timeseries

from distributed import Client

if __name__ == "__main__":
    with Client() as client:
        print(client.dashboard_link)
        df = timeseries(start="2000-01-01", end="2020-12-31", freq="100ms", dtypes={"x": float})
        # Compute the mean beforehand so that we don't have to keep all of `df` in memory
        mean = df["x"].mean().compute()
        df[df["x"] > mean].sum().compute()
```

**Possible solution**

The easiest approach would be to _never_ reuse any intermediate results. This has a few downsides:
* Non-deterministic functions will lead to unexpected results
* We waste _a lot_ of computational resources on recomputations

...but it will allow us to scale. 

We can certainly get smarter about intermediate result materialization, but this will require some effort depending on how smart we want to be. (There's a body of (ongoing) research and implementations in the database world we could draw from.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Reusing intermediate results causes memory issues #854

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Reusing intermediate results causes memory issues #854

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions