Nested RemoteTransportExceptions flood the logs and fill the disk

It happened during a rolling restart needed for a security upgrade. The cluster is running elastic 2.3.3.
All nodes are running the same JVM version (OpenJDK 64-Bit Server VM (build 24.95-b01, mixed mode)).

A RemoteTransportException seemed to "loop?" between 2 nodes causing elastic to log bigger and bigger exception traces as a new RemoteException exception seemed to be created with the previous one carrying all its causes.

The first trace was (on elastic1045) : 

```
[2016-06-30 08:34:20,553][WARN ][org.elasticsearch        ] Exception cause unwrapping ran for 10 levels...
RemoteTransportException[[elastic1036][10.64.16.45:9300][indices:data/write/bulk[s]]]; nested: RemoteTransportException[[elastic1045][10.64.48.143:9300][indices:data/write/bulk[s]]]; nested: RemoteTransportException[[elastic1036][10.64.16.45:9300][indices:data/write/bulk[s]]]; nested: RemoteTransportException[[elastic1045][10.64.48.143:9300][indices:data/write/bulk[s]]]; nested: RemoteTransportException[[elastic1036][10.64.16.45:9300][indices:data/write/bulk[s]]]; nested: RemoteTransportException[[elastic1045][10.64.48.143:9300][indices:data/write/bulk[s]]]; nested: RemoteTransportException[[elastic1036][10.64.16.45:9300][indices:data/write/bulk[s]]]; nested: RemoteTransportException[[elastic1045][10.64.48.143:9300][indices:data/write/bulk[s]]]; nested: RemoteTransportException[[elastic1036][10.64.16.45:9300][indices:data/write/bulk[s]]]; nested: RemoteTransportException[[elastic1045][10.64.48.143:9300][indices:data/write/bulk[s]]]; nested: RemoteTransportException[[elastic1036][10.64.16.45:9300][indices:data/write/bulk[s]]]; nested: RemoteTransportException[[elastic1036][10.64.16.45:9300][indices:data/write/bulk[s][p]]]; nested: IllegalIndexShardStateException[CurrentState[POST_RECOVERY] operation only allowed when started/recovering, origin [PRIMARY]];
[[11 lines of Caused by: RemoteTransportException]]
Caused by: [itwiki_general_1415230945][[itwiki_general_1415230945][2]] IllegalIndexShardStateException[CurrentState[POST_RECOVERY] operation only allowed when started/recovering, origin [PRIMARY]]
        at org.elasticsearch.index.shard.IndexShard.ensureWriteAllowed(IndexShard.java:1062)
        at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:593)
        at org.elasticsearch.index.engine.Engine$Index.execute(Engine.java:836)
        at org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:237)
        at org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:326)
        at org.elasticsearch.action.bulk.TransportShardBulkAction.shardUpdateOperation(TransportShardBulkAction.java:389)
        at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:191)
        at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:68)
        at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:639)
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
        at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279)
        at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271)
        at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
        at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376)
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
```

The second one (same root cause) appeared few ms after with also 12 causes.
The third and fourth ones had 14 causes, fifth and sixth 16 causes and so on...
The last one I've seen had 1982 chained causes.

The logs were nearly the same on elastic1036 (master) generating 27gig of logs in few minutes on both nodes.

Surprisingly the cluster was still performing relatively well with higher gc activity on these nodes.

Then (maybe 1 hour after the first trace) elastic1045 was dropped from the cluster:

```
[2016-06-30 09:48:25,953][INFO ][discovery.zen            ] [elastic1045] master_left [{elastic1036}{DUOG0aGqQ3Gajr_wcFTOyw}{10.64.16.45}{10.64.16.45:9300}{rack=B3, row=B, master=true}], reason [failed to ping, tried [3] times, each with  maximum [30s] timeout]
```

It was immediately re-added and the log flood stopped.

I'll comment on this ticket if it happens again.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Nested RemoteTransportExceptions flood the logs and fill the disk #19187

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Nested RemoteTransportExceptions flood the logs and fill the disk #19187

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions