Skip to content

Local validation of progress updates #107

@frankmcsherry

Description

@frankmcsherry

Ideally the progress API exposed by Operate trait, by which operators report input messages consumed, capabilities dropped and held, and output messages produced, would have some pleasant invariants about batches of updates by which we could increase our confidence in their global properties.

One intended (but not satisfied) invariant is "each newly held capability must be greater or equal to some consumed input message or held capability" and "each send output must have time greater or equal to some consumed input message or held capability".

This invariant is not currently satisfied because subgraphs eagerly report information about the future that are known to be true, but aren't justifiable yet.

For example, consider a subgraph with two inputs, managing an operator that consumes input and may hold capabilities. Imagine this graph is executed across multiple workers. One subgraph instance can receive a report from its managed operator that it has consumed a message and now holds a capability. The subgraph currently chooses to report that information upwards (it now holds a capability) but it is not yet in a position to indicate which of the two inputs the message came in through (that information is perhaps with the other worker, who performed the ingestion and sent a progress update relating this, but it has not yet been received).

Although this information is not incorrect, nor does it lead to errors in the protocol (known errors, at least), it is nonetheless confusing from the point of view of invariants maintained. The subgraph appears to be claiming a capability without consuming any input messages. There is the intent to do so in the future, and in this case we know that the only way the report of a consumed message can arrive is through the subgraph.

Perhaps the subgraph should delay this information, even though it knows it will happen, until the progress update "makes sense". It could wait until it hears about the message that justified the capability, and only then report claiming the capability; as long as the system only needs to know about the frontier, this would be the first moment it could advance to the frontier because until this point the unacknowledged message blocked the frontier.

This would allowed us to get closer to imposing invariants on each batch of updates, which should increase confidence in the protocol, at the expense of complicating the responsibility of the subgraph. On the positive side, this buffering would result in less movement of progress updates, communicating changes less frequently.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions