-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Description
This is a continuation of #69641
On 7.x it appears that for reasons yet to be determined, the regex used by "cat.nodes/10_basic/Test cat nodes output" is extremely expensive, to the point of causing tests to timeout.
On a modern desktop CPU the following command on 7.x can take over 1 minute to run
./gradlew ':qa:mixed-cluster:v7.11.2#mixedClusterTest' -Dtests.timestamp=$(date +%S) -Dtests.method="test {p0=cat.nodes/10_basic/Test cat nodes output}" --info
For example:
org.elasticsearch.backwards.MixedClusterClientYamlTestSuiteIT > test {p0=cat.nodes/10_basic/Test cat nodes output} STANDARD_OUT
[2021-03-02T03:46:19,108][INFO ][o.e.b.MixedClusterClientYamlTestSuiteIT] [test] [p0=cat.nodes/10_basic/Test cat nodes output] before test
[2021-03-02T03:46:19,189][INFO ][o.e.b.MixedClusterClientYamlTestSuiteIT] [test] initializing REST clients against [http://[::1]:45603, http://127.0.0.1:46583, http://[::1]:43459, http://127.0.0.1:44891, http://[::1]:33689, http://127.0.0.1:33833, http://[::1]:36341, http://127.0.0.1:43035]
[2021-03-02T03:46:19,617][INFO ][o.e.b.MixedClusterClientYamlTestSuiteIT] [test] initializing client, minimum es version [7.11.2], master version, [7.11.2], hosts [http://[::1]:45603, http://127.0.0.1:46583, http://[::1]:43459, http://127.0.0.1:44891, http://[::1]:33689, http://127.0.0.1:33833, http://[::1]:36341, http://127.0.0.1:43035], os [Ubuntu 20.04.1 LTS]
[2021-03-02T03:48:19,626][INFO ][o.e.b.MixedClusterClientYamlTestSuiteIT] [test] Clearing template [.logstash-management]
[2021-03-02T03:48:20,020][INFO ][o.e.b.MixedClusterClientYamlTestSuiteIT] [test] [p0=cat.nodes/10_basic/Test cat nodes output] after test
It has been observed that when running the full test suite that (on the same hardware ^^) to take 4 minutes.
Most all evidence and some testing done be @benwtrent lay blame at the regular expression(s) used in this test.
Further, I have found some successful tests that took 15m to perform this test:

It appears that this started with the innocent change introduced by #68605, which merely adds a character to set of character constants. During that 4m local test, I was able to grab the output of a _cat/nodes while the CPU was pegged with lots of work happening in regex (as evident by the stack)
ip heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
127.0.0.1 53 99 2 1.99 4.27 4.11 cdfhilmrstw - v7.11.2-0
127.0.0.1 62 99 2 1.99 4.27 4.11 cdhilmrstw - v7.11.2-3
127.0.0.1 55 99 2 1.99 4.27 4.11 cdfhilmrstw - v7.11.2-1
127.0.0.1 76 99 2 1.99 4.27 4.11 cdhilmrstw * v7.11.2-2
Oddly this is not an issue for master or other branches and seems specific to 7.x. The equivalent command from master is :
./gradlew ':qa:mixed-cluster:v7.13.0#mixedClusterTest' -Dtests.timestamp=$(date +%S) -Dtests.method="test {p0=cat.nodes/10_basic/Test cat nodes output}" --info
and executes in under 1s on that same hardware ^^.
Build scan:
https://gradle-enterprise.elastic.co/s/az2nnpcpm5dd6
Repro line:
On 7.x
./gradlew ':qa:mixed-cluster:v7.11.2#mixedClusterTest' -Dtests.timestamp=$(date +%S) -Dtests.method="test {p0=cat.nodes/10_basic/Test cat nodes output}" --info
Reproduces locally?:
The slowness does, but only up to 4m so far. I have not been able to
Applicable branches:
7.x
Failure history:
build-stats
and
Failure excerpt:
You will see "p0=cat.nodes/10_basic/Test cat nodes output" as the last test to execute. In the server logs, you will see a long pause with nothing in the logs until the container is killed by the test timeout. No relevant error.