Skip to content

Conversation

@masseyke
Copy link
Member

This fixes some possible race conditions in the cluster formation polling of the stable master code. It also prevents the list of tasks from growing indefinitely.

Copy link
Contributor

@andreidan andreidan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this Keith.

Left a few suggestions

beginPollingClusterFormationInfo();
} else {
cancelPollingClusterFormationInfo();
if (disableAutoPollingForTestMode == false) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the need for this flag arises from the presence of some invariants in the integration test that don't hold anymore
e.g. we stabilise the cluster and still expect the polling to run

I think we have 3 options here:

  1. Since CoordinationDiagnosticsServiceTests is an integration test we should modify it to work based on "real life" invariants. One of these invariants is that if the cluster has a stable master the polling track will be cancelled. Maybe we shouldn't stabilise the cluster and run some of the assertions in assertBusy block whilst the polling track is expected to run.
  2. Add the ability to register a listener that gets called whenever we start and cancel the polling track. This will allow the test to register a custom listener and make assertions when the polling track is cancelled (ie. the cluster is stable)
  3. Add unit tests and control what events get send to clusterChanged

IMO we probably want both 1 and 3.

What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's trickier than you would imagine. :) This is a little more than a unit test, but it's a long way from being an integration test. Using AbstractCoordinatorTestCase$Cluster, you unfortunately don't get realistic PeerFinder behavior. This is why it's currently passing in the list of master eligible nodes to query (and why it's important to disable auto-polling because the next time polling begins it will fetch the list of master eligible nodes from PeerFinder. Previously this worked fine because we were effectively passing in our own clusterFormationInfoTasks, and it didn't matter that the object-level one was being overwritten because the code we were exercising didn't use it. Now I'm explicitly disabling automatically calling begin poll because the code being tested actually uses the object-level clusterFormationInfoTasks.
It's also not possible with this code to write an integration test because there is no externally-visible functionality -- it's just populating a Map in memory. I've got an integration test that tests this in the branch that adds the master stability logic using this map.
I can delete this slightly-more-than-unit test and try to add a unit test that doesn't use AbstractCoordinatorTestCase$Cluster, but the last time I went down this path I found myself mocking so much that the test was fairly meaningless. But the current unit test does a pretty good job of testing functionality and I kind of hate to lose it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just documenting the reasons unit testing this is a bear as I go down the path again (mainly for myself, so I don't do this a third time in the future) --
Setting up a unit test that calls beginPollingClusterFormationInfo, cancelPollingClusterFormationInfo, and/or clusterChanged is pretty easy, and we can verify that what is in clusterFormationInfoTasks is correct (that is, whether it's null or not, and whether it has the expected number of tasks).
Actually getting those tasks to run and populating clusterFormationResponses a custom TaskQueue (one that will run tasks immediately instead of waiting 10 seconds). We can borrow DeterministicTaskQueue for this, and that works well. But then we need a TransportService. You can't just use Mockito for TransportService though because TransportService::sendRequest is final, and Mockito can't mock final methods (at least not the version we use -- maybe newer versions can). We can use DisruptableMockTransport for this, but it's abstract and requires implementing several methods. There's a DisruptableMockTransport already implemented in AbstractCoordinatorTestCase. But in order to use that one we have to use so much of AbstractCoordinatorTestCase and AbstractCoordinatorTestCase$Cluster that we're basically back at the point we started. I'll try implementing a custom DisruptableMockTransport and see where that goes.

Copy link
Member Author

@masseyke masseyke Jul 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got pretty far into creating a DisruptableMockTransport implementation, and realized that I was basically reimplementing AbstractCoordinatorTestCase$Cluster (just with far fewer features). I think AbstractCoordinatorTestCase$Cluster is really our best option.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the details Keith.

How about we create a disruption test (similar to RareClusterStateIT) that blocks cluster state processing (amongst other things simulating what this flag is trying to achieve)

https://gist.github.com/andreidan/6b1eb6369cb78060cbf6e8111e6e48a5

/*
* There is a slight risk here that a new Cancellable is added to clusterFormationInfoTasks after we begin iterating in the next
* line. We are calling this an acceptable risk because it will result in an un-cancelled un-cancellable task, but it will not
* reschedule itself so it will not be around long. It is possible that a cancellable will be called concurrently by multiple
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand what It is possible that a cancellable will be called concurrently by multiple threads. means. Do you mean a new cancellable will be scheduled?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sorry. That should read "It is possible that cancel() will be called on a Cancellable concurrently by multiple threads." I'll fix that.

Comment on lines 937 to 938
* The cluster has now run normally for some period of time, so check that the outputs of
* beginPollingClusterFormationInfo() are present with no exceptions:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not true anymore and we shouldn't rely on this condition - the polling was cancelled at some point after the cluster was stabilised

)
);
if (clusterFormationInfoTasks != null) {
if (cancellables.equals(clusterFormationInfoTasks)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we document the why behind this if statement? Something along the lines of:

// it could happen that the polling for cluster formation info was started (1), cancelled and a new one was started (2) 
// before we got a response from the firstly run (1) polling track
// this checks we are still the running polling track

responseConsumer.andThen(rescheduleFetchConsumer(masterEligibleNode, responseConsumer, cancellables))
)
);
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we document the why behind this if statement? Something along the lines of:

// if we are not the "current" polling track anymore due to a quick succesion of a polling for cluster formation info track being started (1), cancelled and a new one was started (2) 
// let's cancel the old (1) polling track cancellables

} else {
cancellables.values().forEach(Scheduler.Cancellable::cancel);
}
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we document the why behind this if statement? Something along the lines of:

// if the polling track was cancelled but some cancellables were scheduled concurrently let's cancel the
// cancellables this track schedulled and weren't necessarily cancelled 

@masseyke masseyke marked this pull request as ready for review July 29, 2022 19:34
@elasticsearchmachine elasticsearchmachine added the needs:triage Requires assignment of a team area label label Jul 29, 2022
@masseyke masseyke added :Data Management/Health >non-issue and removed needs:triage Requires assignment of a team area label labels Jul 29, 2022
@masseyke
Copy link
Member Author

@elasticsearchmachine update branch

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@elasticsearchmachine elasticsearchmachine added the Team:Data Management Meta label for data/management team label Jul 29, 2022
@masseyke
Copy link
Member Author

@elasticsearchmachine run elasticsearch-ci/packaging-tests-unix-sample

@masseyke masseyke requested a review from andreidan July 29, 2022 19:40
Copy link
Contributor

@andreidan andreidan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this Keith

LGTM

@jbaiera could you please have a look as well?

Comment on lines +695 to +700
/*
* If cancellables is not the same as clusterFormationInfoTasks, that means that the current polling track has been
* cancelled and a new polling track has been started. So we don't want to run anything new, and we want to cancel
* anything that might still be running in our cancellables just to be safe. Note that it is possible for
* clusterFormationInfoTasks to be null at this point (since it is assigned in a different thread), so it is important
* that we don't call equals on it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

Co-authored-by: Andrei Dan <[email protected]>
@masseyke masseyke merged commit 579692d into elastic:main Aug 1, 2022
@masseyke masseyke deleted the fix/polling-race-conditions branch August 1, 2022 14:29
Copy link
Member

@jbaiera jbaiera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks much cleaner, thanks for taking this on.

LGTM post merge

masseyke added a commit to masseyke/elasticsearch that referenced this pull request Aug 1, 2022
elasticsearchmachine pushed a commit that referenced this pull request Aug 1, 2022
* backporting #88874

* Eliminating initial delay of CoordinationDiagnosticsService#beginPollingClusterFormationInfo for integration tests (#89001)
@mark-vieira mark-vieira added v8.4.0 and removed v8.4.1 labels Aug 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants