Skip to content

Conversation

@original-brownbear
Copy link
Contributor

I randomly noticed this recently when trying to reproduce a test failure. We're doing a lot of sleeping
when validating that the cluster formed if that process is slow randomly (which it tends to be
due to disk interaction on node starts and such.). By reusing the approach for waiting on a
cluster state we rarely if ever need to get into the busy assert loop and remove all these sleeps,
shaving of a few seconds here and there from running internal cluster tests (at the cost of minimal
added complexity).

…ter Tests

I noticed this recently when trying to reproduce a test failure. We're doing a lot of sleeping
when validating that the cluster formed if that process is slow randomly (which it tends to be
due to disk interaction on node starts and such.). By reusing the approach for waiting on a
cluster state we rarely if ever need to get into the busy assert loop and remove all these sleeps,
shaving of a few seconds here and there from running internal cluster tests.
@original-brownbear original-brownbear added >test Issues or PRs that are addressing/adding tests :Delivery/Build Build or test infrastructure v8.0.0 v7.16.0 labels Aug 24, 2021
@elasticmachine elasticmachine added the Team:Delivery Meta label for Delivery team label Aug 24, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-delivery (Team:Delivery)

@original-brownbear
Copy link
Contributor Author

Jenkins run elasticsearch-ci/packaging-tests-windows-sample (unrelated windows worker randomness)

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, left a couple of suggestions


@Override
public void onTimeout(TimeValue timeout) {
future.onFailure(new TimeoutException());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
future.onFailure(new TimeoutException());
assert false : "onTimeout called with no timeout set";

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't quite correct I think, the default timeout on the observer is 60s isn't it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right so it is. I saw lots of @Nullable things, missed that one place. Still, we don't want the observer to timeout do we? We should pass null to avoid enqueueing the timeout task.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it doesn't matter much since we only wait on the future for 30s so we won't ever see those 60s elapse. But you're right, much cleaner to not enqueue any timeout :)

@original-brownbear
Copy link
Contributor Author

Thanks David!

@original-brownbear original-brownbear merged commit 706ccbd into elastic:master Aug 25, 2021
@original-brownbear original-brownbear deleted the faster-test-cluster-formation branch August 25, 2021 03:39
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Aug 25, 2021
…ter Tests (elastic#76884)

I noticed this recently when trying to reproduce a test failure. We're doing a lot of sleeping
when validating that the cluster formed if that process is slow randomly (which it tends to be
due to disk interaction on node starts and such.). By reusing the approach for waiting on a
cluster state we rarely if ever need to get into the busy assert loop and remove all these sleeps,
shaving of a few seconds here and there from running internal cluster tests.
original-brownbear added a commit that referenced this pull request Aug 25, 2021
…ter Tests (#76884) (#76908)

I noticed this recently when trying to reproduce a test failure. We're doing a lot of sleeping
when validating that the cluster formed if that process is slow randomly (which it tends to be
due to disk interaction on node starts and such.). By reusing the approach for waiting on a
cluster state we rarely if ever need to get into the busy assert loop and remove all these sleeps,
shaving of a few seconds here and there from running internal cluster tests.
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Mar 29, 2022
Since elastic#76884 in `InternalTestCluster#validateClusterFormed` we wait for
a correctly-sized cluster state to be applied before entering the
`assertBusy()` loop to wait for the cluster state to be exactly right
everywhere. Today we do this by injecting a cluster state observer into
one of the nodes which waits for a cluster state containing a master and
the right number of nodes. With this commit we move to using the cluster
health API which can do the same thing.

By this point any extra nodes have stopped, but there might still be a
stale join request for one of those nodes in the master's queue. This
commit addresses this by also waiting for the master queue to be empty.

Closes elastic#81830
DaveCTurner added a commit that referenced this pull request May 6, 2022
Since #76884 in `InternalTestCluster#validateClusterFormed` we wait for
a correctly-sized cluster state to be applied before entering the
`assertBusy()` loop to wait for the cluster state to be exactly right
everywhere. Today we do this by injecting a cluster state observer into
one of the nodes which waits for a cluster state containing a master and
the right number of nodes. With this commit we move to using the cluster
health API which can do the same thing.

By this point any extra nodes have stopped, but there might still be a
stale join request for one of those nodes in the master's queue. This
commit addresses this by also waiting for the master queue to be empty.

Closes #81830
DaveCTurner added a commit that referenced this pull request May 6, 2022
Since #76884 in `InternalTestCluster#validateClusterFormed` we wait for
a correctly-sized cluster state to be applied before entering the
`assertBusy()` loop to wait for the cluster state to be exactly right
everywhere. Today we do this by injecting a cluster state observer into
one of the nodes which waits for a cluster state containing a master and
the right number of nodes. With this commit we move to using the cluster
health API which can do the same thing.

By this point any extra nodes have stopped, but there might still be a
stale join request for one of those nodes in the master's queue. This
commit addresses this by also waiting for the master queue to be empty.

Closes #81830
@original-brownbear original-brownbear restored the faster-test-cluster-formation branch April 18, 2023 20:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Delivery/Build Build or test infrastructure Team:Delivery Meta label for Delivery team >test Issues or PRs that are addressing/adding tests v7.16.0 v8.0.0-alpha2

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants