-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Remove Needless Sleeps on Node Configuration Changes in Internal Cluster Tests #76884
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove Needless Sleeps on Node Configuration Changes in Internal Cluster Tests #76884
Conversation
…ter Tests I noticed this recently when trying to reproduce a test failure. We're doing a lot of sleeping when validating that the cluster formed if that process is slow randomly (which it tends to be due to disk interaction on node starts and such.). By reusing the approach for waiting on a cluster state we rarely if ever need to get into the busy assert loop and remove all these sleeps, shaving of a few seconds here and there from running internal cluster tests.
|
Pinging @elastic/es-delivery (Team:Delivery) |
|
Jenkins run elasticsearch-ci/packaging-tests-windows-sample (unrelated windows worker randomness) |
DaveCTurner
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, left a couple of suggestions
test/framework/src/main/java/org/elasticsearch/test/InternalTestCluster.java
Outdated
Show resolved
Hide resolved
|
|
||
| @Override | ||
| public void onTimeout(TimeValue timeout) { | ||
| future.onFailure(new TimeoutException()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| future.onFailure(new TimeoutException()); | |
| assert false : "onTimeout called with no timeout set"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't quite correct I think, the default timeout on the observer is 60s isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh right so it is. I saw lots of @Nullable things, missed that one place. Still, we don't want the observer to timeout do we? We should pass null to avoid enqueueing the timeout task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it doesn't matter much since we only wait on the future for 30s so we won't ever see those 60s elapse. But you're right, much cleaner to not enqueue any timeout :)
|
Thanks David! |
…ter Tests (elastic#76884) I noticed this recently when trying to reproduce a test failure. We're doing a lot of sleeping when validating that the cluster formed if that process is slow randomly (which it tends to be due to disk interaction on node starts and such.). By reusing the approach for waiting on a cluster state we rarely if ever need to get into the busy assert loop and remove all these sleeps, shaving of a few seconds here and there from running internal cluster tests.
…ter Tests (#76884) (#76908) I noticed this recently when trying to reproduce a test failure. We're doing a lot of sleeping when validating that the cluster formed if that process is slow randomly (which it tends to be due to disk interaction on node starts and such.). By reusing the approach for waiting on a cluster state we rarely if ever need to get into the busy assert loop and remove all these sleeps, shaving of a few seconds here and there from running internal cluster tests.
Since elastic#76884 in `InternalTestCluster#validateClusterFormed` we wait for a correctly-sized cluster state to be applied before entering the `assertBusy()` loop to wait for the cluster state to be exactly right everywhere. Today we do this by injecting a cluster state observer into one of the nodes which waits for a cluster state containing a master and the right number of nodes. With this commit we move to using the cluster health API which can do the same thing. By this point any extra nodes have stopped, but there might still be a stale join request for one of those nodes in the master's queue. This commit addresses this by also waiting for the master queue to be empty. Closes elastic#81830
Since #76884 in `InternalTestCluster#validateClusterFormed` we wait for a correctly-sized cluster state to be applied before entering the `assertBusy()` loop to wait for the cluster state to be exactly right everywhere. Today we do this by injecting a cluster state observer into one of the nodes which waits for a cluster state containing a master and the right number of nodes. With this commit we move to using the cluster health API which can do the same thing. By this point any extra nodes have stopped, but there might still be a stale join request for one of those nodes in the master's queue. This commit addresses this by also waiting for the master queue to be empty. Closes #81830
Since #76884 in `InternalTestCluster#validateClusterFormed` we wait for a correctly-sized cluster state to be applied before entering the `assertBusy()` loop to wait for the cluster state to be exactly right everywhere. Today we do this by injecting a cluster state observer into one of the nodes which waits for a cluster state containing a master and the right number of nodes. With this commit we move to using the cluster health API which can do the same thing. By this point any extra nodes have stopped, but there might still be a stale join request for one of those nodes in the master's queue. This commit addresses this by also waiting for the master queue to be empty. Closes #81830
I randomly noticed this recently when trying to reproduce a test failure. We're doing a lot of sleeping
when validating that the cluster formed if that process is slow randomly (which it tends to be
due to disk interaction on node starts and such.). By reusing the approach for waiting on a
cluster state we rarely if ever need to get into the busy assert loop and remove all these sleeps,
shaving of a few seconds here and there from running internal cluster tests (at the cost of minimal
added complexity).