Skip to content

Conversation

@fcofdez
Copy link
Contributor

@fcofdez fcofdez commented Aug 10, 2022

This commit adds a new system property tests.configure_test_clusters_with_one_processor
that configures Rest integration tests to run with 1 processor, this should help finding possible
deadlocks (if any) when the Elasticsearch nodes have node.processors set to 1.

With this change it should be possible to catch possible deadlocks
that would surface when a node runs in a 1 CPU hosts or node.processors
is configured to use just 1 processor when the ThreadPools are configured/
@fcofdez fcofdez added >test Issues or PRs that are addressing/adding tests :Distributed Indexing/Distributed A catch all label for anything in the Distributed Indexing Area. Please avoid if you can. Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v8.5.0 labels Aug 10, 2022
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@fcofdez
Copy link
Contributor Author

fcofdez commented Aug 10, 2022

I haven't set it up for Rest tests as the nodes are configured in gradle and we don't have access to the test random seed there.

@DaveCTurner
Copy link
Contributor

DaveCTurner commented Aug 10, 2022

configured in gradle and we don't have access to the test random seed

Could you do something like this?

boolean proxyMode = (new Random(Long.parseUnsignedLong(BuildParams.testSeed.tokenize(':').get(0), 16))).nextBoolean()

@fcofdez fcofdez requested a review from DaveCTurner August 10, 2022 15:51
@fcofdez fcofdez changed the title Set node.processors lower bound to 1 in InternalTestCluster#getRandomNodeSettings Set node.processors lower bound to 1 in integration tests Aug 10, 2022
Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@fcofdez
Copy link
Contributor Author

fcofdez commented Aug 11, 2022

I discussed this with @mark-vieira offline and we decided to define a new CI job that would run all the tests using 1 processor a few times a day, the arguments in favour of that approach were:

  • Introducing this change based on randomness could make difficult to identify the root cause (is this test failure cased by a deadlock? or is it something else?) while if we get the same test failure in the specific CI job it would be easier to triage.
  • There were some concerns about the slowness of tests running with 1 processor configured, but for most tests that shouldn't be problematic.

As cons:

  • We would run the tests with 1 processor less often, meaning that possible problems would take longer to surface.

The idea is to change this PR and let the test code accept a new system property that would set node.processors to 1 and @mark-vieira will create the CI job to run the tests with that flag.

Are there any objections to this @henningandersen @DaveCTurner?

@DaveCTurner
Copy link
Contributor

Are there any objections to this?

No, that also sounds good to me.

@henningandersen
Copy link
Contributor

Are there any objections to this

I'd like to keep the randomness in InternalTestCluster (since we always had it), but do not mind for the REST tests (though it seems a bit overkill for something we expect not to result in more failures, nor a slowdown).

Copy link
Contributor

@mark-vieira mark-vieira left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment that I'd like @breskeby to provide feedback on. Otherwise LGTM.

}

protected void indexRandomDocs(String index, int numdocs) throws InterruptedException {
indexRandomDocs(index, numdocs, true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these simply test fixes related to this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's correct

builder.put(
EsExecutors.NODE_PROCESSORS_SETTING.getKey(),
1 + random.nextInt(Math.min(4, Runtime.getRuntime().availableProcessors()))
RandomNumbers.randomIntBetween(random, 1, Runtime.getRuntime().availableProcessors())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so looks like we were already doing this sort of thing for internal cluster tests.

}

private boolean shouldConfigureTestClustersWithOneProcessor() {
return Boolean.parseBoolean(System.getProperty("tests.configure_test_clusters_with_one_processor", "false"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@breskeby Can you confirm it's no longer necessary to use provider factory to get system properties to ensure things are config cache compliant?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mark-vieira yes its not required anymore with newer gradle versions

@breskeby breskeby self-requested a review August 11, 2022 21:32
@fcofdez
Copy link
Contributor Author

fcofdez commented Aug 17, 2022

@elasticmachine update branch

@fcofdez fcofdez changed the title Set node.processors lower bound to 1 in integration tests Add the ability to run Rest integration tests with 1 allocated processor Aug 17, 2022
@fcofdez fcofdez merged commit cbea639 into elastic:main Aug 17, 2022
@fcofdez
Copy link
Contributor Author

fcofdez commented Aug 17, 2022

Thanks all!

elasticsearchmachine pushed a commit that referenced this pull request Aug 23, 2022
…89562)

Ass a follow up to #89234, we want to also include this system property
in reproduction lines so that developers are actually enabling this
setting when trying to reproduce these failures.
jakelandis pushed a commit to jakelandis/elasticsearch that referenced this pull request Aug 24, 2022
…lastic#89562)

Ass a follow up to elastic#89234, we want to also include this system property
in reproduction lines so that developers are actually enabling this
setting when trying to reproduce these failures.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Indexing/Distributed A catch all label for anything in the Distributed Indexing Area. Please avoid if you can. Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. >test Issues or PRs that are addressing/adding tests v8.5.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants