Skip to content

Conversation

frankmcsherry
Copy link
Member

Operator shutdown was previously pretty loose, and only in response to operator activation. However, the conditions for shutdown can change without prompting an activation if e.g. a frontier becomes empty or a final capability is dropped. This meant that operators that should be shut down would instead linger until the dataflow itself is shut down.

This PR adds that test as progress information is pushed to operators, in order to better clean up operators mid-dataflow.

NB: Failing to shut down an operator should not have resulted in non-termination, unless operators were relying on dropping their state to signal something of consequence outward. All progress information would still be correct, and all downstream operators would receive correct frontiers.

@lluki
Copy link

lluki commented Nov 28, 2022

Unfortunately it doesnt fix the operator leak of running TPC-H loadgen + materialized view with query 14. This is the situation after drop materialized view q14; and waiting 30s:

pr488

and this is mz_dataflow_operator_dataflows:

materialize=> set database to tpch;
SET
materialize=> drop materialized view q14;
DROP MATERIALIZED VIEW
materialize=> select * from mz_internal.mz_dataflow_operator_dataflows;
 id  |                   name                   | worker_id | dataflow_id |   dataflow_name   
-----+------------------------------------------+-----------+-------------+-------------------
 333 | Map                                      | 0         | 188         | Dataflow: 2.6.q14
 329 | FlatMap                                  | 0         | 188         | Dataflow: 2.6.q14
 331 | Exchange                                 | 0         | 188         | Dataflow: 2.6.q14
 326 | InspectBatch                             | 0         | 188         | Dataflow: 2.6.q14
 338 | InspectBatch                             | 0         | 188         | Dataflow: 2.6.q14
 345 | Dataflow: 2.6.q14                        | 0         | 188         | Dataflow: 2.6.q14
 188 | Dataflow: 2.6.q14                        | 0         | 188         | Dataflow: 2.6.q14
 328 | persist_sink u11 write_batches           | 0         | 188         | Dataflow: 2.6.q14
 340 | persist_sink u11 append_batches          | 0         | 188         | Dataflow: 2.6.q14
 323 | persist_sink u11 mint_batch_descriptions | 0         | 188         | Dataflow: 2.6.q14
(10 rows)

materialize=> show materialized views;
 name | cluster 
------+---------
(0 rows)

materialize=> 

@frankmcsherry
Copy link
Member Author

Ah, yes this wasn't meant to fix that for certain. It does fix a leak for e.g. simple.rs, and I'm happy to crack open the same example in MZ (the one based on generate_series(0, large)).

@lluki
Copy link

lluki commented Nov 28, 2022

List of operators during various stages of the TPC-H run: ops.txt

@frankmcsherry
Copy link
Member Author

Great, that appears to have the desired outcome, as we think that the remaining operators are the ones that haven't shut down, for whatever reason.

Copy link
Member

@antiguru antiguru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems fine, as explained in the office hours!

One thing to keep in mind is that we could restrict the nodes added to maybe_shutdown to operators that have no inputs or the notify bit turned off.

@frankmcsherry frankmcsherry merged commit 1cf7b4a into TimelyDataflow:master Jan 9, 2023
@github-actions github-actions bot mentioned this pull request Oct 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants