-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Description
Elasticsearch version: Version: 5.6.1, Build: 667b497/2017-09-14T19:22:05.189Z
Plugins installed: []
JVM version: 1.8.0_144
OS version (uname -a if on a Unix-like system):
Linux NodeB 4.4.0-1013-aws #22-Ubuntu SMP Fri Mar 31 15:41:31 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Description of the problem including expected versus actual behavior:
We have a cluster with 6 nodes.
Probably due to network flakiness, some nodes started to loose connection with each other, during several minutes.
At least 2 nodes lost connection with the master => NodeB & NodeA.
Cluster went red, and stayed as is even after all the nodes came back to the cluster.
[2017-10-06T01:42:07,068][INFO ][o.e.c.r.a.AllocationService] [NodeC] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[index_1][1]] ...]).
[2017-10-06T01:35:35,929][INFO ][o.e.c.r.a.AllocationService] [NodeC] Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[index_2][0]] ...]).
[2017-10-06T01:21:20,590][INFO ][o.e.c.r.a.AllocationService] [NodeC] Cluster health status changed from [YELLOW] to [RED] (reason: [{NodeB}{ItIC8RvWQk-K9xv4EtUH-g}{doC-pFtIRpmn-PaywU9AFg}{IpNodeB}{IpNodeB:9300}{availability_zone=us-east-1b, tag=histo} failed to ping, tried [3] times, each with maximum [30s] timeout]).
[2017-10-06T01:20:50,553][INFO ][o.e.c.r.a.AllocationService] [NodeC] Cluster health status changed from [GREEN] to [YELLOW] (reason: [{NodeA}{tAZvKePkTqiqP44v4g4L7g}{areeqGx9RiO_q3_vip_fYA}{IpNodeA}{IpNodeA:9300}{availability_zone=us-east-1c, tag=histo} failed to ping, tried [3] times, each with maximum [30s] timeout]).
When executing a /_cat/indices:
[2017-10-06T01:33:27,125][WARN ][r.suppressed ] path: /_cat/indices, params: {}
java.lang.NullPointerException: null
at org.elasticsearch.rest.action.cat.RestIndicesAction.buildTable(RestIndicesAction.java:368) ~[elasticsearch-5.6.1.jar:5.6.1]
at org.elasticsearch.rest.action.cat.RestIndicesAction$1$1$1.buildResponse(RestIndicesAction.java:116) ~[elasticsearch-5.6.1.jar:5.6.1]
at org.elasticsearch.rest.action.cat.RestIndicesAction$1$1$1.buildResponse(RestIndicesAction.java:113) ~[elasticsearch-5.6.1.jar:5.6.1]
at org.elasticsearch.rest.action.RestResponseListener.processResponse(RestResponseListener.java:37) ~[elasticsearch-5.6.1.jar:5.6.1]
at org.elasticsearch.rest.action.RestActionListener.onResponse(RestActionListener.java:47) [elasticsearch-5.6.1.jar:5.6.1]
Bugging line is the following: RestIndicesAction.java#L368
^ A restart of the node NodeA at 01:35:16 fixed the issue.
Some relevant logs (not exhaustive):
[2017-10-06T01:35:56,904][WARN ][o.e.c.a.s.ShardStateAction] [NodeC] [index_1][2] received shard failed for shard id [[index_1][2]], allocation id [hLAZTvTCRWGJ_vBnpc5xbg], primary term [4], message [mark copy as stale]
[2017-10-06T01:35:35,981][WARN ][o.e.c.a.s.ShardStateAction] [NodeC] [index_2][0] received shard failed for shard id [[index_2][0]], allocation id [IvPPBlcRRnSQUA43s9v0qw], primary term [4], message [mark copy as stale]
[2017-10-06T01:35:10,053][WARN ][o.e.c.a.s.ShardStateAction] [NodeC] [index_2][1] received shard failed for shard id [[index_2][1]], allocation id [xRch14rfR_OvfQHYvPul-g], primary term [2], message [mark copy as stale]
[2017-10-06T01:35:09,840][WARN ][o.e.c.a.s.ShardStateAction] [NodeC] [index_2][1] received shard failed for shard id [[index_2][1]], allocation id [7JpkTXDnSR-Z54p3t9dlTQ], primary term [1], message [failed to perform indices:data/write/bulk[s] on replica [index_2][1], node[0dPW5AaBR--KS7JRNB32yA], [R], s[STARTED], a[id=7JpkTXDnSR-Z54p3t9dlTQ]], failure [RemoteTransportException[[NodeC][IpNodeC:9300][indices:data/write/bulk[s][r]]]; nested: IllegalStateException[active primary shard cannot be a replication target before relocation hand off [index_2][1], node[0dPW5AaBR--KS7JRNB32yA], [P], s[STARTED], a[id=7JpkTXDnSR-Z54p3t9dlTQ], state is [STARTED]]; ]
[2017-10-06T01:35:09,840][WARN ][o.e.c.a.s.ShardStateAction] [NodeC] [index_2][1] received shard failed for shard id [[index_2][1]], allocation id [7JpkTXDnSR-Z54p3t9dlTQ], primary term [1], message [failed to perform indices:data/write/bulk[s] on replica [index_2][1], node[0dPW5AaBR--KS7JRNB32yA], [R], s[STARTED], a[id=7JpkTXDnSR-Z54p3t9dlTQ]], failure [RemoteTransportException[[NodeC][IpNodeC:9300][indices:data/write/bulk[s][r]]]; nested: IllegalStateException[active primary shard cannot be a replication target before relocation hand off [index_2][1], node[0dPW5AaBR--KS7JRNB32yA], [P], s[STARTED], a[id=7JpkTXDnSR-Z54p3t9dlTQ], state is [STARTED]]; ]
[2017-10-06T01:35:09,894][WARN ][o.e.a.b.TransportShardBulkAction] [NodeF] [[index_2][1]] failed to perform indices:data/write/bulk[s] on replica [index_2][1], node[0dPW5AaBR--KS7JRNB32yA], [R], s[STARTED], a[id=7JpkTXDnSR-Z54p3t9dlTQ]
[2017-10-06T01:35:04,107][WARN ][o.e.a.b.TransportShardBulkAction] [NodeD] [[index_2][0]] failed to perform indices:data/write/bulk[s] on replica [index_2][0], node[l-TN-YQMThO8V_srAwknTg], [R], s[STARTED], a[id=IvPPBlcRRnSQUA43s9v0qw]
[2017-10-06T01:21:20,553][WARN ][o.e.d.z.PublishClusterStateAction] [NodeC] timed out waiting for all nodes to process published state [423] (timeout [30s], pending nodes: [{NodeD}{of6-ePXOT6uGk5TDKS1h-A}{IGu1YUCSRNiPOUgcq8HClw}{IpNodeD}{IpNodeD:9300}{availability_zone=us-east-1c, tag=fresh}, {NodeB}{ItIC8RvWQk-K9xv4EtUH-g}{doC-pFtIRpmn-PaywU9AFg}{IpNodeB}{IpNodeB:9300}{availability_zone=us-east-1b, tag=histo}, {NodeE}{_2uc635bS66TcqHVXjWpLA}{SzgLC8b0SpegMwaKLkPhgA}{IpNodeE}{IpNodeE:9300}{availability_zone=us-east-1a, tag=histo}])
[2017-10-06T01:21:20,594][INFO ][o.e.c.s.ClusterService ] [NodeF] removed {{NodeB}{ItIC8RvWQk-K9xv4EtUH-g}{doC-pFtIRpmn-PaywU9AFg}{IpNodeB}{IpNodeB:9300}{availability_zone=us-east-1b, tag=histo},}, reason: zen-disco-receive(from master [master {NodeC}{0dPW5AaBR--KS7JRNB32yA}{bvYMHcw-QZ6xTN8SMaaMHw}{IpNodeC}{IpNodeC:9300}{availability_zone=us-east-1b, tag=fresh} committed version [424]])
[2017-10-06T01:21:20,579][WARN ][o.e.c.s.ClusterService ] [NodeC] cluster state update task [zen-disco-node-failed({NodeA}{tAZvKePkTqiqP44v4g4L7g}{areeqGx9RiO_q3_vip_fYA}{IpNodeA}{IpNodeA:9300}{availability_zone=us-east-1c, tag=histo}), reason(failed to ping, tried [3] times, each with maximum [30s] timeout)[{NodeA}{tAZvKePkTqiqP44v4g4L7g}{areeqGx9RiO_q3_vip_fYA}{IpNodeA}{IpNodeA:9300}{availability_zone=us-east-1c, tag=histo} failed to ping, tried [3] times, each with maximum [30s] timeout]] took [30s] above the warn threshold of 30s