Fix race conditions in master stability polling #88874

masseyke · 2022-07-27T21:10:56Z

This fixes some possible race conditions in the cluster formation polling of the stable master code. It also prevents the list of tasks from growing indefinitely.

andreidan

Thanks for fixing this Keith.

Left a few suggestions

andreidan · 2022-07-28T10:04:04Z

server/src/main/java/org/elasticsearch/cluster/coordination/CoordinationDiagnosticsService.java

-            beginPollingClusterFormationInfo();
-        } else {
-            cancelPollingClusterFormationInfo();
+        if (disableAutoPollingForTestMode == false) {


I think the need for this flag arises from the presence of some invariants in the integration test that don't hold anymore
e.g. we stabilise the cluster and still expect the polling to run

I think we have 3 options here:

Since CoordinationDiagnosticsServiceTests is an integration test we should modify it to work based on "real life" invariants. One of these invariants is that if the cluster has a stable master the polling track will be cancelled. Maybe we shouldn't stabilise the cluster and run some of the assertions in assertBusy block whilst the polling track is expected to run.

Add the ability to register a listener that gets called whenever we start and cancel the polling track. This will allow the test to register a custom listener and make assertions when the polling track is cancelled (ie. the cluster is stable)

Add unit tests and control what events get send to clusterChanged

IMO we probably want both 1 and 3.

What do you think?

I think it's trickier than you would imagine. :) This is a little more than a unit test, but it's a long way from being an integration test. Using AbstractCoordinatorTestCase$Cluster, you unfortunately don't get realistic PeerFinder behavior. This is why it's currently passing in the list of master eligible nodes to query (and why it's important to disable auto-polling because the next time polling begins it will fetch the list of master eligible nodes from PeerFinder. Previously this worked fine because we were effectively passing in our own clusterFormationInfoTasks, and it didn't matter that the object-level one was being overwritten because the code we were exercising didn't use it. Now I'm explicitly disabling automatically calling begin poll because the code being tested actually uses the object-level clusterFormationInfoTasks.
It's also not possible with this code to write an integration test because there is no externally-visible functionality -- it's just populating a Map in memory. I've got an integration test that tests this in the branch that adds the master stability logic using this map.
I can delete this slightly-more-than-unit test and try to add a unit test that doesn't use AbstractCoordinatorTestCase$Cluster, but the last time I went down this path I found myself mocking so much that the test was fairly meaningless. But the current unit test does a pretty good job of testing functionality and I kind of hate to lose it.

Just documenting the reasons unit testing this is a bear as I go down the path again (mainly for myself, so I don't do this a third time in the future) --
Setting up a unit test that calls beginPollingClusterFormationInfo, cancelPollingClusterFormationInfo, and/or clusterChanged is pretty easy, and we can verify that what is in clusterFormationInfoTasks is correct (that is, whether it's null or not, and whether it has the expected number of tasks).
Actually getting those tasks to run and populating clusterFormationResponses a custom TaskQueue (one that will run tasks immediately instead of waiting 10 seconds). We can borrow DeterministicTaskQueue for this, and that works well. But then we need a TransportService. You can't just use Mockito for TransportService though because TransportService::sendRequest is final, and Mockito can't mock final methods (at least not the version we use -- maybe newer versions can). We can use DisruptableMockTransport for this, but it's abstract and requires implementing several methods. There's a DisruptableMockTransport already implemented in AbstractCoordinatorTestCase. But in order to use that one we have to use so much of AbstractCoordinatorTestCase and AbstractCoordinatorTestCase$Cluster that we're basically back at the point we started. I'll try implementing a custom DisruptableMockTransport and see where that goes.

I got pretty far into creating a DisruptableMockTransport implementation, and realized that I was basically reimplementing AbstractCoordinatorTestCase$Cluster (just with far fewer features). I think AbstractCoordinatorTestCase$Cluster is really our best option.

Thanks for the details Keith.

How about we create a disruption test (similar to RareClusterStateIT) that blocks cluster state processing (amongst other things simulating what this flag is trying to achieve)

https://gist.github.com/andreidan/6b1eb6369cb78060cbf6e8111e6e48a5

andreidan · 2022-07-28T10:14:11Z

server/src/main/java/org/elasticsearch/cluster/coordination/CoordinationDiagnosticsService.java

+            /*
+             * There is a slight risk here that a new Cancellable is added to clusterFormationInfoTasks after we begin iterating in the next
+             * line. We are calling this an acceptable risk because it will result in an un-cancelled un-cancellable task, but it will not
+             * reschedule itself so it will not be around long. It is possible that a cancellable will be called concurrently by multiple


I'm not sure I understand what It is possible that a cancellable will be called concurrently by multiple threads. means. Do you mean a new cancellable will be scheduled?

Oh sorry. That should read "It is possible that cancel() will be called on a Cancellable concurrently by multiple threads." I'll fix that.

andreidan · 2022-07-28T10:15:37Z

...rc/test/java/org/elasticsearch/cluster/coordination/CoordinationDiagnosticsServiceTests.java

                 * The cluster has now run normally for some period of time, so check that the outputs of
                 * beginPollingClusterFormationInfo() are present with no exceptions:


I think this is not true anymore and we shouldn't rely on this condition - the polling was cancelled at some point after the cluster was stabilised

andreidan · 2022-07-28T10:18:06Z

server/src/main/java/org/elasticsearch/cluster/coordination/CoordinationDiagnosticsService.java

-                )
-            );
+            if (clusterFormationInfoTasks != null) {
+                if (cancellables.equals(clusterFormationInfoTasks)) {


Shall we document the why behind this if statement? Something along the lines of:

// it could happen that the polling for cluster formation info was started (1), cancelled and a new one was started (2) // before we got a response from the firstly run (1) polling track // this checks we are still the running polling track

andreidan · 2022-07-28T10:20:23Z

server/src/main/java/org/elasticsearch/cluster/coordination/CoordinationDiagnosticsService.java

+                            responseConsumer.andThen(rescheduleFetchConsumer(masterEligibleNode, responseConsumer, cancellables))
+                        )
+                    );
+                } else {


Shall we document the why behind this if statement? Something along the lines of:

// if we are not the "current" polling track anymore due to a quick succesion of a polling for cluster formation info track being started (1), cancelled and a new one was started (2) // let's cancel the old (1) polling track cancellables

andreidan · 2022-07-28T10:21:41Z

server/src/main/java/org/elasticsearch/cluster/coordination/CoordinationDiagnosticsService.java

+                } else {
+                    cancellables.values().forEach(Scheduler.Cancellable::cancel);
+                }
+            } else {


Shall we document the why behind this if statement? Something along the lines of:

// if the polling track was cancelled but some cancellables were scheduled concurrently let's cancel the // cancellables this track schedulled and weren't necessarily cancelled

masseyke · 2022-07-29T19:38:03Z

@elasticsearchmachine update branch

elasticsearchmachine · 2022-07-29T19:38:07Z

Pinging @elastic/es-data-management (Team:Data Management)

masseyke · 2022-07-29T19:40:21Z

@elasticsearchmachine run elasticsearch-ci/packaging-tests-unix-sample

andreidan

Thanks for fixing this Keith

LGTM

@jbaiera could you please have a look as well?

server/src/main/java/org/elasticsearch/cluster/coordination/CoordinationDiagnosticsService.java

andreidan · 2022-08-01T10:15:19Z

server/src/main/java/org/elasticsearch/cluster/coordination/CoordinationDiagnosticsService.java

+                /*
+                 * If cancellables is not the same as clusterFormationInfoTasks, that means that the current polling track has been
+                 * cancelled and a new polling track has been started. So we don't want to run anything new, and we want to cancel
+                 * anything that might still be running in our cancellables just to be safe. Note that it is possible for
+                 * clusterFormationInfoTasks to be null at this point (since it is assigned in a different thread), so it is important
+                 * that we don't call equals on it.


Co-authored-by: Andrei Dan <[email protected]>

jbaiera

Looks much cleaner, thanks for taking this on.

LGTM post merge

* backporting #88874 * Eliminating initial delay of CoordinationDiagnosticsService#beginPollingClusterFormationInfo for integration tests (#89001)

andreidan and others added 5 commits July 27, 2022 13:39

Cancel the cancellables

c785b69

Avoid race conditions

1c51122

adding comments and avoid a possible NPE

cf837ae

more comments

f0c9b1f

Not letting cancellables grow indefinitely

8f321da

elasticsearchmachine added the v8.5.0 label Jul 27, 2022

masseyke requested review from andreidan and jbaiera July 27, 2022 21:12

checkstyle

c5b57dd

andreidan reviewed Jul 28, 2022

View reviewed changes

masseyke added 3 commits July 28, 2022 09:28

adding comments

8cebbf2

merging main

7b18ebd

updating unit and integration tests

b5e9f73

masseyke marked this pull request as ready for review July 29, 2022 19:34

elasticsearchmachine added the needs:triage Requires assignment of a team area label label Jul 29, 2022

masseyke added :Data Management/Health >non-issue and removed needs:triage Requires assignment of a team area label labels Jul 29, 2022

elasticsearchmachine added the Team:Data Management Meta label for data/management team label Jul 29, 2022

masseyke requested a review from andreidan July 29, 2022 19:40

andreidan added the v8.4.1 label Aug 1, 2022

andreidan approved these changes Aug 1, 2022

View reviewed changes

Fixing a comment

e29e88a

Co-authored-by: Andrei Dan <[email protected]>

masseyke merged commit 579692d into elastic:main Aug 1, 2022

masseyke deleted the fix/polling-race-conditions branch August 1, 2022 14:29

jbaiera reviewed Aug 1, 2022

View reviewed changes

masseyke added a commit to masseyke/elasticsearch that referenced this pull request Aug 1, 2022

backporting elastic#88874

8373c17

masseyke mentioned this pull request Aug 1, 2022

Backporting #88874 and #89001 to 8.4 #89006

Merged

elasticsearchmachine pushed a commit that referenced this pull request Aug 1, 2022

Backporting #88874 and #89001 to 8.4 (#89006)

05e13f7

* backporting #88874 * Eliminating initial delay of CoordinationDiagnosticsService#beginPollingClusterFormationInfo for integration tests (#89001)

masseyke mentioned this pull request Aug 1, 2022

Polling for cluster diagnostics information #89014

Merged

kingherc mentioned this pull request Aug 1, 2022

[CI] CoordinationDiagnosticsServiceIT testBlockClusterStateProcessingOnOneNode failing #89015

Closed

mark-vieira added v8.4.0 and removed v8.4.1 labels Aug 15, 2022

		* The cluster has now run normally for some period of time, so check that the outputs of
		* beginPollingClusterFormationInfo() are present with no exceptions:

Fix race conditions in master stability polling #88874

Fix race conditions in master stability polling #88874

Uh oh!

Conversation

masseyke commented Jul 27, 2022

Uh oh!

andreidan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

masseyke Jul 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

masseyke commented Jul 29, 2022

Uh oh!

elasticsearchmachine commented Jul 29, 2022

Uh oh!

masseyke commented Jul 29, 2022

Uh oh!

andreidan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jbaiera left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

masseyke Jul 28, 2022 •

edited

Loading