Thread leak in TribeNode when a cluster is offline

**Elasticsearch version**:  2.3.4
**JVM version**: 1.8.0_91
**OS version**: RedHat 6.5

We are using the TribeNode feature to enable search across a number of geographically distributed ElasticSearch clusters.  Occasionally when we take one of these clusters completely offline, we find that our TribeNode hits the following exception:

```
java.lang.OutOfMemoryError: unable to create new native thread
        at java.lang.Thread.start0(Native Method)
        at java.lang.Thread.start(Thread.java:714)
        at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:950)
        at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1368)
        at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.execute(EsThreadPoolExecutor.java:85)
        at org.elasticsearch.threadpool.ThreadPool$ThreadedRunnable.run(ThreadPool.java:676)
        at org.elasticsearch.threadpool.ThreadPool$LoggingRunnable.run(ThreadPool.java:640)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
```

This exception is thrown because of thread exhaustion due to the TribeNode creating a new thread every couple of seconds.  Below is the stack trace of the leaked threads:

```
java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
       java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
       java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:867)
       java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1197)
       java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:214)
       java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290)
       org.elasticsearch.common.util.concurrent.KeyedLock.acquire(KeyedLock.java:75)
       org.elasticsearch.transport.netty.NettyTransport.disconnectFromNode(NettyTransport.java:1063)
       org.elasticsearch.transport.TransportService.disconnectFromNode(TransportService.java:274)
       org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing$2$1.doRun(UnicastZenPing.java:258)
       org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
       java.lang.Thread.run(Thread.java:745)
```

**Steps to reproduce**:
 Create TribeNode configuration where one cluster is offline.   Its not enough that the processes are shutdown and the machine is online, the nodes specified in the discovery.zen.ping.unicast.hosts for the offline cluster must be offline and not respond to ping/connection attempts.    Here is a simple configuration I was able to use to reproduce the problem.

```

---
cluster.name: "thread-leak-test"
node.name: "thread-leak-node"
http.port: "9201"
http.host: "127.0.0.1"
tribe:
  online-cluster:
    cluster.name: "online-cluster"
    discovery.zen.ping.unicast.hosts:
    - "localhost"
  offline-cluster:
    cluster.name: "offline-cluster"
    discovery.zen.ping.unicast.hosts:
    - "10.10.10.10"
```

Start the Tribe node.   Observe that the number of threads continue to grow unbounded (`ps -m <pid> | wc -l`) until the OutOfMemoryError: unable to create new native thread exceptions are thrown.

This issue appears similar to the problem described in #8057.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Thread leak in TribeNode when a cluster is offline #19370

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Thread leak in TribeNode when a cluster is offline #19370

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions