Skip to content

Reusing intermediate results causes memory issues #854

@hendrikmakait

Description

@hendrikmakait

Problem

Whenever we reuse intermediate results and there is a pipeline breaker (such as shuffles, joins, reductions, or groupby operations), it forces us to materialize the entire intermediate result (thus, it is breaking the pipelining that we could utilize for reuse for example between multiple element-wise operations).

This result materialization puts a hard limit on our ability to scale as I have observed in multiple TPC-H benchmark queries.

To illustrate this, run these two snippets on a cluster of your choice

With full intermediate result materialization

from dask_expr.datasets import timeseries

from distributed import Client

if __name__ == "__main__":
    with Client() as client:
        print(client.dashboard_link)
        df = timeseries(start="2000-01-01", end="2020-12-31", freq="100ms", dtypes={"x": float})
        # To compute mean, we have to fully materialize df, and we won't free its data
        # until we have reused the chunks to compute a partial of the sum. 
        mean = df["x"].mean()
        df[df["x"] > mean].sum().compute()

Without full intermediate result materialization

from dask_expr.datasets import timeseries

from distributed import Client

if __name__ == "__main__":
    with Client() as client:
        print(client.dashboard_link)
        df = timeseries(start="2000-01-01", end="2020-12-31", freq="100ms", dtypes={"x": float})
        # Compute the mean beforehand so that we don't have to keep all of `df` in memory
        mean = df["x"].mean().compute()
        df[df["x"] > mean].sum().compute()

Possible solution

The easiest approach would be to never reuse any intermediate results. This has a few downsides:

  • Non-deterministic functions will lead to unexpected results
  • We waste a lot of computational resources on recomputations

...but it will allow us to scale.

We can certainly get smarter about intermediate result materialization, but this will require some effort depending on how smart we want to be. (There's a body of (ongoing) research and implementations in the database world we could draw from.)

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions