-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Description
It happened during a rolling restart needed for a security upgrade. The cluster is running elastic 2.3.3.
All nodes are running the same JVM version (OpenJDK 64-Bit Server VM (build 24.95-b01, mixed mode)).
A RemoteTransportException seemed to "loop?" between 2 nodes causing elastic to log bigger and bigger exception traces as a new RemoteException exception seemed to be created with the previous one carrying all its causes.
The first trace was (on elastic1045) :
[2016-06-30 08:34:20,553][WARN ][org.elasticsearch ] Exception cause unwrapping ran for 10 levels...
RemoteTransportException[[elastic1036][10.64.16.45:9300][indices:data/write/bulk[s]]]; nested: RemoteTransportException[[elastic1045][10.64.48.143:9300][indices:data/write/bulk[s]]]; nested: RemoteTransportException[[elastic1036][10.64.16.45:9300][indices:data/write/bulk[s]]]; nested: RemoteTransportException[[elastic1045][10.64.48.143:9300][indices:data/write/bulk[s]]]; nested: RemoteTransportException[[elastic1036][10.64.16.45:9300][indices:data/write/bulk[s]]]; nested: RemoteTransportException[[elastic1045][10.64.48.143:9300][indices:data/write/bulk[s]]]; nested: RemoteTransportException[[elastic1036][10.64.16.45:9300][indices:data/write/bulk[s]]]; nested: RemoteTransportException[[elastic1045][10.64.48.143:9300][indices:data/write/bulk[s]]]; nested: RemoteTransportException[[elastic1036][10.64.16.45:9300][indices:data/write/bulk[s]]]; nested: RemoteTransportException[[elastic1045][10.64.48.143:9300][indices:data/write/bulk[s]]]; nested: RemoteTransportException[[elastic1036][10.64.16.45:9300][indices:data/write/bulk[s]]]; nested: RemoteTransportException[[elastic1036][10.64.16.45:9300][indices:data/write/bulk[s][p]]]; nested: IllegalIndexShardStateException[CurrentState[POST_RECOVERY] operation only allowed when started/recovering, origin [PRIMARY]];
[[11 lines of Caused by: RemoteTransportException]]
Caused by: [itwiki_general_1415230945][[itwiki_general_1415230945][2]] IllegalIndexShardStateException[CurrentState[POST_RECOVERY] operation only allowed when started/recovering, origin [PRIMARY]]
at org.elasticsearch.index.shard.IndexShard.ensureWriteAllowed(IndexShard.java:1062)
at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:593)
at org.elasticsearch.index.engine.Engine$Index.execute(Engine.java:836)
at org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:237)
at org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:326)
at org.elasticsearch.action.bulk.TransportShardBulkAction.shardUpdateOperation(TransportShardBulkAction.java:389)
at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:191)
at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:68)
at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:639)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279)
at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271)
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
The second one (same root cause) appeared few ms after with also 12 causes.
The third and fourth ones had 14 causes, fifth and sixth 16 causes and so on...
The last one I've seen had 1982 chained causes.
The logs were nearly the same on elastic1036 (master) generating 27gig of logs in few minutes on both nodes.
Surprisingly the cluster was still performing relatively well with higher gc activity on these nodes.
Then (maybe 1 hour after the first trace) elastic1045 was dropped from the cluster:
[2016-06-30 09:48:25,953][INFO ][discovery.zen ] [elastic1045] master_left [{elastic1036}{DUOG0aGqQ3Gajr_wcFTOyw}{10.64.16.45}{10.64.16.45:9300}{rack=B3, row=B, master=true}], reason [failed to ping, tried [3] times, each with maximum [30s] timeout]
It was immediately re-added and the log flood stopped.
I'll comment on this ticket if it happens again.