Remove Needless Sleeps on Node Configuration Changes in Internal Cluster Tests #76884

original-brownbear · 2021-08-24T15:46:14Z

I randomly noticed this recently when trying to reproduce a test failure. We're doing a lot of sleeping
when validating that the cluster formed if that process is slow randomly (which it tends to be
due to disk interaction on node starts and such.). By reusing the approach for waiting on a
cluster state we rarely if ever need to get into the busy assert loop and remove all these sleeps,
shaving of a few seconds here and there from running internal cluster tests (at the cost of minimal
added complexity).

…ter Tests I noticed this recently when trying to reproduce a test failure. We're doing a lot of sleeping when validating that the cluster formed if that process is slow randomly (which it tends to be due to disk interaction on node starts and such.). By reusing the approach for waiting on a cluster state we rarely if ever need to get into the busy assert loop and remove all these sleeps, shaving of a few seconds here and there from running internal cluster tests.

elasticmachine · 2021-08-24T15:46:17Z

Pinging @elastic/es-delivery (Team:Delivery)

original-brownbear · 2021-08-24T16:35:52Z

Jenkins run elasticsearch-ci/packaging-tests-windows-sample (unrelated windows worker randomness)

DaveCTurner

LGTM, left a couple of suggestions

test/framework/src/main/java/org/elasticsearch/test/InternalTestCluster.java

DaveCTurner · 2021-08-24T18:21:56Z

test/framework/src/main/java/org/elasticsearch/test/ClusterServiceUtils.java

+
+                @Override
+                public void onTimeout(TimeValue timeout) {
+                    future.onFailure(new TimeoutException());


Suggested change

future.onFailure(new TimeoutException());

assert false : "onTimeout called with no timeout set";

This isn't quite correct I think, the default timeout on the observer is 60s isn't it?

Oh right so it is. I saw lots of @Nullable things, missed that one place. Still, we don't want the observer to timeout do we? We should pass null to avoid enqueueing the timeout task.

I guess it doesn't matter much since we only wait on the future for 30s so we won't ever see those 60s elapse. But you're right, much cleaner to not enqueue any timeout :)

original-brownbear · 2021-08-25T03:39:18Z

Thanks David!

…ter Tests (elastic#76884) I noticed this recently when trying to reproduce a test failure. We're doing a lot of sleeping when validating that the cluster formed if that process is slow randomly (which it tends to be due to disk interaction on node starts and such.). By reusing the approach for waiting on a cluster state we rarely if ever need to get into the busy assert loop and remove all these sleeps, shaving of a few seconds here and there from running internal cluster tests.

…ter Tests (#76884) (#76908) I noticed this recently when trying to reproduce a test failure. We're doing a lot of sleeping when validating that the cluster formed if that process is slow randomly (which it tends to be due to disk interaction on node starts and such.). By reusing the approach for waiting on a cluster state we rarely if ever need to get into the busy assert loop and remove all these sleeps, shaving of a few seconds here and there from running internal cluster tests.

Since elastic#76884 in `InternalTestCluster#validateClusterFormed` we wait for a correctly-sized cluster state to be applied before entering the `assertBusy()` loop to wait for the cluster state to be exactly right everywhere. Today we do this by injecting a cluster state observer into one of the nodes which waits for a cluster state containing a master and the right number of nodes. With this commit we move to using the cluster health API which can do the same thing. By this point any extra nodes have stopped, but there might still be a stale join request for one of those nodes in the master's queue. This commit addresses this by also waiting for the master queue to be empty. Closes elastic#81830

Since #76884 in `InternalTestCluster#validateClusterFormed` we wait for a correctly-sized cluster state to be applied before entering the `assertBusy()` loop to wait for the cluster state to be exactly right everywhere. Today we do this by injecting a cluster state observer into one of the nodes which waits for a cluster state containing a master and the right number of nodes. With this commit we move to using the cluster health API which can do the same thing. By this point any extra nodes have stopped, but there might still be a stale join request for one of those nodes in the master's queue. This commit addresses this by also waiting for the master queue to be empty. Closes #81830

original-brownbear added >test Issues or PRs that are addressing/adding tests :Delivery/Build Build or test infrastructure v8.0.0 v7.16.0 labels Aug 24, 2021

elasticmachine added the Team:Delivery Meta label for Delivery team label Aug 24, 2021

original-brownbear requested a review from DaveCTurner August 24, 2021 17:10

DaveCTurner approved these changes Aug 24, 2021

View reviewed changes

original-brownbear added 3 commits August 24, 2021 20:27

shorter + nicer

1762633

infinite timeout

b0fb05f

assert

c73c159

original-brownbear merged commit 706ccbd into elastic:master Aug 25, 2021

original-brownbear deleted the faster-test-cluster-formation branch August 25, 2021 03:39

original-brownbear added the backport pending label Aug 25, 2021

original-brownbear mentioned this pull request Aug 25, 2021

Remove Needless Sleeps on Node Configuration Changes in Internal Cluster Tests (#76884) #76908

Merged

original-brownbear removed the backport pending label Aug 25, 2021

jakelandis added v8.0.0-alpha2 and removed v8.0.0 labels Sep 15, 2021

DaveCTurner mentioned this pull request Mar 29, 2022

Use cluster health API in validateClusterFormed #85454

Merged

original-brownbear restored the faster-test-cluster-formation branch April 18, 2023 20:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remove Needless Sleeps on Node Configuration Changes in Internal Cluster Tests #76884

Remove Needless Sleeps on Node Configuration Changes in Internal Cluster Tests #76884

Uh oh!

original-brownbear commented Aug 24, 2021

Uh oh!

elasticmachine commented Aug 24, 2021

Uh oh!

original-brownbear commented Aug 24, 2021

Uh oh!

DaveCTurner left a comment

Uh oh!

Uh oh!

DaveCTurner Aug 24, 2021

Uh oh!

original-brownbear Aug 24, 2021

Uh oh!

DaveCTurner Aug 24, 2021

Uh oh!

original-brownbear Aug 24, 2021

Uh oh!

original-brownbear commented Aug 25, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	future.onFailure(new TimeoutException());
	assert false : "onTimeout called with no timeout set";

Remove Needless Sleeps on Node Configuration Changes in Internal Cluster Tests #76884

Remove Needless Sleeps on Node Configuration Changes in Internal Cluster Tests #76884

Uh oh!

Conversation

original-brownbear commented Aug 24, 2021

Uh oh!

elasticmachine commented Aug 24, 2021

Uh oh!

original-brownbear commented Aug 24, 2021

Uh oh!

DaveCTurner left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

DaveCTurner Aug 24, 2021

Choose a reason for hiding this comment

Uh oh!

original-brownbear Aug 24, 2021

Choose a reason for hiding this comment

Uh oh!

DaveCTurner Aug 24, 2021

Choose a reason for hiding this comment

Uh oh!

original-brownbear Aug 24, 2021

Choose a reason for hiding this comment

Uh oh!

original-brownbear commented Aug 25, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants