Fix race in SLM master/cluster state listeners #59897
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This change fixes two possible race conditions in SLM related to
how local master changes and cluster state events are observed. When
implementing the
LocalNodeMasterListenerinterface, it is onlyrecommended to execute on a separate threadpool if the operations are
heavy and would block the cluster state thread. SLM specified that the
listeners should run in the Snapshot thread pool, but the operations
in the listener were lightweight. This had the side effect of causing
master changes to be delayed if the Snapshot threads were all busy and
could also potentially cause the
onMasterandoffMastercalls torace if both were queued and then executed concurrently. Additionally,
the
SnapshotLifecycleServiceis also aClusterStateListenerandthere is currently no order of operations guarantee between
LocalNodeMasterListenersandClusterStateListenersso this couldlead to incorrect behavior.
The resolution for these two issues is that the
SnapshotRetentionService now specifies the
SAMEexecutor for itsimplementation of the
LocalNodeMasterListenerinterface. TheSnapshotLifecycleServiceis no longer aLocalNodeMasterListenerandinstead tracks local master changes in its
ClusterStateListner.Backport of #59801