Skip to content

ArrayCompareConditionSearchTests thread leak suite failure #38875

@talevy

Description

@talevy

ArrayCompareConditionSearchTests test suite is flaky due to the integ cluster's
SchedulerEngine's thread trigger_engine_scheduler not shutting down in time.

failure instance in CI: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+matrix-java-periodic/ES_BUILD_JAVA=openjdk12,ES_RUNTIME_JAVA=openjdk12,nodes=immutable&&linux&&docker/240/console

num occurrences: 6 times in last 6 months.

stacktrace:

ERROR   0.00s J2 | ArrayCompareConditionSearchTests (suite) <<< FAILURES!
   > Throwable #1: com.carrotsearch.randomizedtesting.ThreadLeakError: 1 thread leaked from SUITE scope at org.elasticsearch.xpack.watcher.condition.ArrayCompareConditionSearchTests: 
   >    1) Thread[id=337, name=elasticsearch[node_sm1][trigger_engine_scheduler][T#1], state=TIMED_WAITING, group=TGRP-ArrayCompareConditionSearchTests]
   >         at java.base@12/jdk.internal.misc.Unsafe.park(Native Method)
   >         at java.base@12/java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:235)
   >         at java.base@12/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2123)
   >         at java.base@12/java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1182)
   >         at java.base@12/java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:899)
   >         at java.base@12/java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1054)
   >         at java.base@12/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1114)
   >         at java.base@12/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
   >         at java.base@12/java.lang.Thread.run(Thread.java:835)
   > 	at __randomizedtesting.SeedInfo.seed([3D43839813A5AEA5]:0)Throwable #2: com.carrotsearch.randomizedtesting.ThreadLeakError: There are still zombie threads that couldn't be terminated:
   >    1) Thread[id=337, name=elasticsearch[node_sm1][trigger_engine_scheduler][T#1], state=TIMED_WAITING, group=TGRP-ArrayCompareConditionSearchTests]
   >         at java.base@12/jdk.internal.misc.Unsafe.park(Native Method)
   >         at java.base@12/java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:235)
   >         at java.base@12/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2123)
   >         at java.base@12/java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1182)
   >         at java.base@12/java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:899)
   >         at java.base@12/java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1054)
   >         at java.base@12/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1114)
   >         at java.base@12/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
   >         at java.base@12/java.lang.Thread.run(Thread.java:835)
   > 	at __randomizedtesting.SeedInfo.seed([3D43839813A5AEA5]:0)
Completed [111/140] on J2 in 18.31s, 1 test, 2 errors <<< FAILURES!

RollupIT had the same problem because it leverages the SchedulerEngine, its solution was to move away from integ tests and rewrite the test as rest tests. It looks like this test suite can do the same.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions