Skip to content

Nested RemoteTransportExceptions flood the logs and fill the disk #19187

@nomoa

Description

@nomoa

It happened during a rolling restart needed for a security upgrade. The cluster is running elastic 2.3.3.
All nodes are running the same JVM version (OpenJDK 64-Bit Server VM (build 24.95-b01, mixed mode)).

A RemoteTransportException seemed to "loop?" between 2 nodes causing elastic to log bigger and bigger exception traces as a new RemoteException exception seemed to be created with the previous one carrying all its causes.

The first trace was (on elastic1045) :

[2016-06-30 08:34:20,553][WARN ][org.elasticsearch        ] Exception cause unwrapping ran for 10 levels...
RemoteTransportException[[elastic1036][10.64.16.45:9300][indices:data/write/bulk[s]]]; nested: RemoteTransportException[[elastic1045][10.64.48.143:9300][indices:data/write/bulk[s]]]; nested: RemoteTransportException[[elastic1036][10.64.16.45:9300][indices:data/write/bulk[s]]]; nested: RemoteTransportException[[elastic1045][10.64.48.143:9300][indices:data/write/bulk[s]]]; nested: RemoteTransportException[[elastic1036][10.64.16.45:9300][indices:data/write/bulk[s]]]; nested: RemoteTransportException[[elastic1045][10.64.48.143:9300][indices:data/write/bulk[s]]]; nested: RemoteTransportException[[elastic1036][10.64.16.45:9300][indices:data/write/bulk[s]]]; nested: RemoteTransportException[[elastic1045][10.64.48.143:9300][indices:data/write/bulk[s]]]; nested: RemoteTransportException[[elastic1036][10.64.16.45:9300][indices:data/write/bulk[s]]]; nested: RemoteTransportException[[elastic1045][10.64.48.143:9300][indices:data/write/bulk[s]]]; nested: RemoteTransportException[[elastic1036][10.64.16.45:9300][indices:data/write/bulk[s]]]; nested: RemoteTransportException[[elastic1036][10.64.16.45:9300][indices:data/write/bulk[s][p]]]; nested: IllegalIndexShardStateException[CurrentState[POST_RECOVERY] operation only allowed when started/recovering, origin [PRIMARY]];
[[11 lines of Caused by: RemoteTransportException]]
Caused by: [itwiki_general_1415230945][[itwiki_general_1415230945][2]] IllegalIndexShardStateException[CurrentState[POST_RECOVERY] operation only allowed when started/recovering, origin [PRIMARY]]
        at org.elasticsearch.index.shard.IndexShard.ensureWriteAllowed(IndexShard.java:1062)
        at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:593)
        at org.elasticsearch.index.engine.Engine$Index.execute(Engine.java:836)
        at org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:237)
        at org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:326)
        at org.elasticsearch.action.bulk.TransportShardBulkAction.shardUpdateOperation(TransportShardBulkAction.java:389)
        at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:191)
        at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:68)
        at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:639)
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
        at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279)
        at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271)
        at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
        at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376)
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

The second one (same root cause) appeared few ms after with also 12 causes.
The third and fourth ones had 14 causes, fifth and sixth 16 causes and so on...
The last one I've seen had 1982 chained causes.

The logs were nearly the same on elastic1036 (master) generating 27gig of logs in few minutes on both nodes.

Surprisingly the cluster was still performing relatively well with higher gc activity on these nodes.

Then (maybe 1 hour after the first trace) elastic1045 was dropped from the cluster:

[2016-06-30 09:48:25,953][INFO ][discovery.zen            ] [elastic1045] master_left [{elastic1036}{DUOG0aGqQ3Gajr_wcFTOyw}{10.64.16.45}{10.64.16.45:9300}{rack=B3, row=B, master=true}], reason [failed to ping, tried [3] times, each with  maximum [30s] timeout]

It was immediately re-added and the log flood stopped.

I'll comment on this ticket if it happens again.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions