Skip to content

Timed out cluster state publication is applied in an empty context #53751

@DaveCTurner

Description

@DaveCTurner

Today the elected master waits for all other nodes to acknowledge a cluster state publication before applying it locally, although it will time out if the other nodes are not all fast enough. The timeout is performed by a delayed action scheduled with ThreadPool#schedule at the start of the publication.

ThreadPool#schedule does not preserve the context of the caller, however, which means that the cluster state is applied with an empty context rather than being in the system context. This means that any cluster state appliers which use the context of the application (e.g. capture it for future use, or try and send transport messages) will not work correctly if security is enabled.

One such case was introduced in #48430: retention lease syncs now run in the context in which the IndexService was created, which happens during cluster state application. Thus if the elected master is also a data node, and the cluster state publication that assigns a shard to it times out after committing, then the retention lease syncs will fail.

If affected, the workaround is to restart the elected master.

This exposes a gap in the CoordinatorTests framework which does not properly simulate how thread contexts behave.

Metadata

Metadata

Assignees

No one assigned

    Labels

    :Distributed Coordination/Cluster CoordinationCluster formation and cluster state publication, including cluster membership and fault detection.>bugTeam:Distributed (Obsolete)Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions