Fix race in SLM master/cluster state listeners #59801

jaymode · 2020-07-17T16:27:58Z

This change fixes two possible race conditions in SLM related to
how local master changes and cluster state events are observed. When
implementing the LocalNodeMasterListener interface, it is only
recommended to execute on a separate threadpool if the operations are
heavy and would block the cluster state thread. SLM specified that the
listeners should run in the Snapshot thread pool, but the operations
in the listener were lightweight. This had the side effect of causing
master changes to be delayed if the Snapshot threads were all busy and
could also potentially cause the onMaster and offMaster calls to
race if both were queued and then executed concurrently. Additionally,
the SnapshotLifecycleService is also a ClusterStateListener and
there is currently no order of operations guarantee between
LocalNodeMasterListeners and ClusterStateListeners so this could
lead to incorrect behavior.

The resolution for these two issues is that the
SnapshotRetentionService now specifies the SAME executor for its
implementation of the LocalNodeMasterListener interface. The
SnapshotLifecycleService is no longer a LocalNodeMasterListener and
instead tracks local master changes in its ClusterStateListner.

This change fixes two possible race conditions in SLM related to how local master changes and cluster state events are observed. When implementing the `LocalNodeMasterListener` interface, it is only recommended to execute on a separate threadpool if the operations are heavy and would block the cluster state thread. SLM specified that the listeners should run in the Snapshot thread pool, but the operations in the listener were lightweight. This had the side effect of causing master changes to be delayed if the Snapshot threads were all busy and could also potentially cause the `onMaster` and `offMaster` calls to race if both were queued and then executed concurrently. Additionally, the `SnapshotLifecycleService` is also a `ClusterStateListener` and there is currently no order of operations guarantee between `LocalNodeMasterListeners` and `ClusterStateListeners` so this could lead to incorrect behavior. The resolution for these two issues is that the SnapshotRetentionService now specifies the `SAME` executor for its implementation of the `LocalNodeMasterListener` interface. The `SnapshotLifecycleService` is no longer a `LocalNodeMasterListener` and instead tracks local master changes in its `ClusterStateListner`.

elasticmachine · 2020-07-17T16:28:15Z

Pinging @elastic/es-core-features (:Core/Features/ILM+SLM)

dakrone

LGTM, thanks Jay

jaymode · 2020-07-17T17:51:58Z

@elasticmachine update branch

This change fixes two possible race conditions in SLM related to how local master changes and cluster state events are observed. When implementing the `LocalNodeMasterListener` interface, it is only recommended to execute on a separate threadpool if the operations are heavy and would block the cluster state thread. SLM specified that the listeners should run in the Snapshot thread pool, but the operations in the listener were lightweight. This had the side effect of causing master changes to be delayed if the Snapshot threads were all busy and could also potentially cause the `onMaster` and `offMaster` calls to race if both were queued and then executed concurrently. Additionally, the `SnapshotLifecycleService` is also a `ClusterStateListener` and there is currently no order of operations guarantee between `LocalNodeMasterListeners` and `ClusterStateListeners` so this could lead to incorrect behavior. The resolution for these two issues is that the SnapshotRetentionService now specifies the `SAME` executor for its implementation of the `LocalNodeMasterListener` interface. The `SnapshotLifecycleService` is no longer a `LocalNodeMasterListener` and instead tracks local master changes in its `ClusterStateListner`.

This change fixes two possible race conditions in SLM related to how local master changes and cluster state events are observed. When implementing the `LocalNodeMasterListener` interface, it is only recommended to execute on a separate threadpool if the operations are heavy and would block the cluster state thread. SLM specified that the listeners should run in the Snapshot thread pool, but the operations in the listener were lightweight. This had the side effect of causing master changes to be delayed if the Snapshot threads were all busy and could also potentially cause the `onMaster` and `offMaster` calls to race if both were queued and then executed concurrently. Additionally, the `SnapshotLifecycleService` is also a `ClusterStateListener` and there is currently no order of operations guarantee between `LocalNodeMasterListeners` and `ClusterStateListeners` so this could lead to incorrect behavior. The resolution for these two issues is that the SnapshotRetentionService now specifies the `SAME` executor for its implementation of the `LocalNodeMasterListener` interface. The `SnapshotLifecycleService` is no longer a `LocalNodeMasterListener` and instead tracks local master changes in its `ClusterStateListner`. Backport of #59801

This commit continues on the work in elastic#59801 and makes other implementors of the `LocalNodeMasterListener` interface thread safe in that they will no longer allow the callbacks to run on different threads and possibly race each other. This also helps address other issues where these events could be queued to wait for execution while the service keeps moving forward thinking it is the master even when that is not the case. In order to accomplish this, the `LocalNodeMasterListener` now provides a default implementation of the `executorName()` and the javadocs have been updated to indicate the dangers of using an executor that could execute the listeners concurrently. Each use was inspected and if the class was also a `ClusterStateListener`, the implementation of `LocalNodeMasterListener` was removed in favor of a single listener that combined the logic. A single listener is used and there is currently no guarantee on execution order between `ClusterStateListener`s and `LocalNodeMasterListener`s, so a future change there could cause undesired consequences. For other classes, the implementations of the callbacks were inspected and if the operations were lightweight, the overriden `executorName` method was removed to use the default, which runs on the same thread.

This commit continues on the work in #59801 and makes other implementors of the LocalNodeMasterListener interface thread safe in that they will no longer allow the callbacks to run on different threads and possibly race each other. This also helps address other issues where these events could be queued to wait for execution while the service keeps moving forward thinking it is the master even when that is not the case. In order to accomplish this, the LocalNodeMasterListener no longer has the executorName() method to prevent future uses that could encounter this surprising behavior. Each use was inspected and if the class was also a ClusterStateListener, the implementation of LocalNodeMasterListener was removed in favor of a single listener that combined the logic. A single listener is used and there is currently no guarantee on execution order between ClusterStateListeners and LocalNodeMasterListeners, so a future change there could cause undesired consequences. For other classes, the implementations of the callbacks were inspected and if the operations were lightweight, the overriden executorName method was removed to use the default, which runs on the same thread.

This commit continues on the work in #59801 and makes other implementors of the LocalNodeMasterListener interface thread safe in that they will no longer allow the callbacks to run on different threads and possibly race each other. This also helps address other issues where these events could be queued to wait for execution while the service keeps moving forward thinking it is the master even when that is not the case. In order to accomplish this, the LocalNodeMasterListener no longer has the executorName() method to prevent future uses that could encounter this surprising behavior. Each use was inspected and if the class was also a ClusterStateListener, the implementation of LocalNodeMasterListener was removed in favor of a single listener that combined the logic. A single listener is used and there is currently no guarantee on execution order between ClusterStateListeners and LocalNodeMasterListeners, so a future change there could cause undesired consequences. For other classes, the implementations of the callbacks were inspected and if the operations were lightweight, the overriden executorName method was removed to use the default, which runs on the same thread. Backport of #59932

jaymode added >bug :Data Management/ILM+SLM Index and Snapshot lifecycle management v8.0.0 Team:Data Management Meta label for data/management team v7.10.0 v7.9.1 labels Jul 17, 2020

jaymode requested a review from dakrone July 17, 2020 16:27

dakrone approved these changes Jul 17, 2020

View reviewed changes

Merge branch 'master' into slm_listeners

dc56364

jaymode merged commit c41ac5f into elastic:master Jul 20, 2020

jaymode deleted the slm_listeners branch July 20, 2020 15:09

jaymode added the backport pending label Jul 20, 2020

jaymode mentioned this pull request Jul 20, 2020

Fix race in SLM master/cluster state listeners #59896

Merged

jaymode mentioned this pull request Jul 20, 2020

Fix race in SLM master/cluster state listeners #59897

Merged

jaymode removed the backport pending label Jul 20, 2020

jaymode mentioned this pull request Jul 20, 2020

Thread safe clean up of LocalNodeModeListeners #59932

Merged

jaymode mentioned this pull request Jul 21, 2020

Thread safe clean up of LocalNodeModeListeners #60007

Merged

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix race in SLM master/cluster state listeners #59801

Fix race in SLM master/cluster state listeners #59801

Uh oh!

jaymode commented Jul 17, 2020

Uh oh!

elasticmachine commented Jul 17, 2020

Uh oh!

dakrone left a comment

Uh oh!

jaymode commented Jul 17, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fix race in SLM master/cluster state listeners #59801

Fix race in SLM master/cluster state listeners #59801

Uh oh!

Conversation

jaymode commented Jul 17, 2020

Uh oh!

elasticmachine commented Jul 17, 2020

Uh oh!

dakrone left a comment

Choose a reason for hiding this comment

Uh oh!

jaymode commented Jul 17, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants