Skip to content

Conversation

@jaymode
Copy link
Member

@jaymode jaymode commented Jul 20, 2020

This change fixes two possible race conditions in SLM related to
how local master changes and cluster state events are observed. When
implementing the LocalNodeMasterListener interface, it is only
recommended to execute on a separate threadpool if the operations are
heavy and would block the cluster state thread. SLM specified that the
listeners should run in the Snapshot thread pool, but the operations
in the listener were lightweight. This had the side effect of causing
master changes to be delayed if the Snapshot threads were all busy and
could also potentially cause the onMaster and offMaster calls to
race if both were queued and then executed concurrently. Additionally,
the SnapshotLifecycleService is also a ClusterStateListener and
there is currently no order of operations guarantee between
LocalNodeMasterListeners and ClusterStateListeners so this could
lead to incorrect behavior.

The resolution for these two issues is that the
SnapshotRetentionService now specifies the SAME executor for its
implementation of the LocalNodeMasterListener interface. The
SnapshotLifecycleService is no longer a LocalNodeMasterListener and
instead tracks local master changes in its ClusterStateListner.

Backport of #59801

This change fixes two possible race conditions in SLM related to
how local master changes and cluster state events are observed. When
implementing the `LocalNodeMasterListener` interface, it is only
recommended to execute on a separate threadpool if the operations are
heavy and would block the cluster state thread. SLM specified that the
listeners should run in the Snapshot thread pool, but the operations
in the listener were lightweight. This had the side effect of causing
master changes to be delayed if the Snapshot threads were all busy and
could also potentially cause the `onMaster` and `offMaster` calls to
race if both were queued and then executed concurrently. Additionally,
the `SnapshotLifecycleService` is also a `ClusterStateListener` and
there is currently no order of operations guarantee between
`LocalNodeMasterListeners` and `ClusterStateListeners` so this could
lead to incorrect behavior.

The resolution for these two issues is that the
SnapshotRetentionService now specifies the `SAME` executor for its
implementation of the `LocalNodeMasterListener` interface. The
`SnapshotLifecycleService` is no longer a `LocalNodeMasterListener` and
instead tracks local master changes in its `ClusterStateListner`.
@jaymode jaymode merged commit f2120df into elastic:7.9 Jul 20, 2020
@jaymode jaymode deleted the slm_listeners_79 branch July 20, 2020 15:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant