-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Description
Hi,
We're experiencing a critical production issue in elasticsearch 6.2.2 related to open_file_descriptors. The cluster is an exact replica (as much as possible) of a 5.2.2 cluster, and documents are indexed into both clusters in parallel.
While the indexing performance of the new cluster seems to be at least as good as the 5.2.2 cluster, the new nodes open_file_descriptors is reaching record-breaking levels (especially when compared to v5.2.2).
Elasticsearch version
6.2.2
Plugins installed:
ingest-attachment
ingest-geoip
mapper-murmur3
mapper-size
repository-azure
repository-gcs
repository-s3
JVM version:
openjdk version "1.8.0_151"
OpenJDK Runtime Environment (build 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12)
OpenJDK 64-Bit Server VM (build 25.151-b12, mixed mode)
OS version:
Linux 4.13.0-1011-azure #14-Ubuntu SMP 2018 x86_64 GNU/Linux
Description of the problem including expected versus actual behavior:
All machines have ulimit of 65536, as recommended by the official documentation.
All nodes on the v5.2.2 cluster have up to 4,500 open_file_descripors, while the new v6.2.2 nodes are divided: some have up to 4,500 open_file_descriptors, while others consistently open more and more file descriptors until reaching the limit and crashing with java.nio.file.FileSystemException: Too many open files -
[WARN ][o.e.c.a.s.ShardStateAction] [prod-elasticsearch-master-002] [newlogs_20180315-01][0] received shard failed for shard id [[newlogs_20180315-01][0]], allocation id [G8NGOPNHRNuqNKYKzfiPcg], primary term [0], message [shard failure, reason [already closed by tragic event on the translog]], failure [FileSystemException[/mnt/nodes/0/indices/nIgarkzwRwe0DmT-nmLhvg/0/translog/translog.ckp: Too many open files]]
java.nio.file.FileSystemException: /mnt/nodes/0/indices/nIgarkzwRwe0DmT-nmLhvg/0/translog/translog.ckp: Too many open files
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:91) ~[?:?]
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) ~[?:?]
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) ~[?:?]
at sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:177) ~[?:?]
at java.nio.channels.FileChannel.open(FileChannel.java:287) ~[?:1.8.0_151]
at java.nio.channels.FileChannel.open(FileChannel.java:335) ~[?:1.8.0_151]
at org.elasticsearch.index.translog.Checkpoint.write(Checkpoint.java:179) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.index.translog.TranslogWriter.writeCheckpoint(TranslogWriter.java:419) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.index.translog.TranslogWriter.syncUpTo(TranslogWriter.java:362) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.index.translog.Translog.ensureSynced(Translog.java:699) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.index.translog.Translog.ensureSynced(Translog.java:724) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.index.shard.IndexShard$3.write(IndexShard.java:2336) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.common.util.concurrent.AsyncIOProcessor.processList(AsyncIOProcessor.java:107) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.common.util.concurrent.AsyncIOProcessor.drainAndProcess(AsyncIOProcessor.java:99) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.common.util.concurrent.AsyncIOProcessor.put(AsyncIOProcessor.java:82) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.index.shard.IndexShard.sync(IndexShard.java:2358) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.support.replication.TransportWriteAction$AsyncAfterWriteAction.run(TransportWriteAction.java:362) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.support.replication.TransportWriteAction$WritePrimaryResult.(TransportWriteAction.java:160) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.bulk.TransportShardBulkAction.performOnPrimary(TransportShardBulkAction.java:133) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:110) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:72) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryShardReference.perform(TransportReplicationAction.java:1034) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryShardReference.perform(TransportReplicationAction.java:1012) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.support.replication.ReplicationOperation.execute(ReplicationOperation.java:103) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.onResponse(TransportReplicationAction.java:359) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.onResponse(TransportReplicationAction.java:299) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.support.replication.TransportReplicationAction$1.onResponse(TransportReplicationAction.java:975) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.support.replication.TransportReplicationAction$1.onResponse(TransportReplicationAction.java:972) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:238) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.index.shard.IndexShard.acquirePrimaryOperationPermit(IndexShard.java:2220) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.support.replication.TransportReplicationAction.acquirePrimaryShardReference(TransportReplicationAction.java:984) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.support.replication.TransportReplicationAction.access$500(TransportReplicationAction.java:98) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.doRun(TransportReplicationAction.java:320) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:295) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:282) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:656) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:672) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.2.2.jar:6.2.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_151]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_151]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]
After this exception, some of the nodes throw many exceptions and reduce the number of open file descriptors. Other times they just crash. The issue repeats itself with interleaving nodes.
Steps to reproduce:
None yet. This is a production-scale issue, so it is not easy to reproduce.
I'd be happy to provide additional details, whatever is needed.
Thanks!