Skip to content

Thread leak in TribeNode when a cluster is offline #19370

@escheie

Description

@escheie

Elasticsearch version: 2.3.4
JVM version: 1.8.0_91
OS version: RedHat 6.5

We are using the TribeNode feature to enable search across a number of geographically distributed ElasticSearch clusters. Occasionally when we take one of these clusters completely offline, we find that our TribeNode hits the following exception:

java.lang.OutOfMemoryError: unable to create new native thread
        at java.lang.Thread.start0(Native Method)
        at java.lang.Thread.start(Thread.java:714)
        at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:950)
        at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1368)
        at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.execute(EsThreadPoolExecutor.java:85)
        at org.elasticsearch.threadpool.ThreadPool$ThreadedRunnable.run(ThreadPool.java:676)
        at org.elasticsearch.threadpool.ThreadPool$LoggingRunnable.run(ThreadPool.java:640)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

This exception is thrown because of thread exhaustion due to the TribeNode creating a new thread every couple of seconds. Below is the stack trace of the leaked threads:

java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
       java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
       java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:867)
       java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1197)
       java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:214)
       java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290)
       org.elasticsearch.common.util.concurrent.KeyedLock.acquire(KeyedLock.java:75)
       org.elasticsearch.transport.netty.NettyTransport.disconnectFromNode(NettyTransport.java:1063)
       org.elasticsearch.transport.TransportService.disconnectFromNode(TransportService.java:274)
       org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing$2$1.doRun(UnicastZenPing.java:258)
       org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
       java.lang.Thread.run(Thread.java:745)

Steps to reproduce:
Create TribeNode configuration where one cluster is offline. Its not enough that the processes are shutdown and the machine is online, the nodes specified in the discovery.zen.ping.unicast.hosts for the offline cluster must be offline and not respond to ping/connection attempts. Here is a simple configuration I was able to use to reproduce the problem.


---
cluster.name: "thread-leak-test"
node.name: "thread-leak-node"
http.port: "9201"
http.host: "127.0.0.1"
tribe:
  online-cluster:
    cluster.name: "online-cluster"
    discovery.zen.ping.unicast.hosts:
    - "localhost"
  offline-cluster:
    cluster.name: "offline-cluster"
    discovery.zen.ping.unicast.hosts:
    - "10.10.10.10"

Start the Tribe node. Observe that the number of threads continue to grow unbounded (ps -m <pid> | wc -l) until the OutOfMemoryError: unable to create new native thread exceptions are thrown.

This issue appears similar to the problem described in #8057.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions