Skip to content

Investigating TestMutationMdiToDedicated failure #614

@barkbay

Description

@barkbay

The e2e test TestMutationMdiToDedicated seems to consistently fails, on my laptop I have dozens of errors like this one during the stepCluster_health_should_not_have_been_red_during_mutation_process (The name of the test is misleading as any error,e.g. a timeout, will trigger a failure):

Cluster health check failure at 2019-04-09 09:11:08.481942 +0200 CEST m=+232.212360466: Get http://test-mutation-mdi-to-dedicated-es.e2e.svc.cluster.local:9200/_cluster/health: no pod addresses found in service endpoints

I have tried to understand why the test is failing, here what I found:

The first interesting thing is that the operator may remove the last pod in a Ready state (from a K8S pov), leaving the Kubernetes service without any endpoint to route a request:

image

I guess it is because the K8S state is not taken into account when the operator checks if it can delete a pod:

https://github.com/elastic/k8s-operators/blob/bac6e037efac300d51c0f68c71942121ba7627ef/operators/pkg/controller/elasticsearch/mutation/podrestrictions.go#L50-L55

A first question is: should we remove a node that is the last in the Ready state ?
note that this behavior is mitigated when some data is present, the migration gives some time to a node to move into a Ready state

But the real problem is that when the former master leaves the cluster it can take more than 1 minute for the cluster to recover:

[2019-04-09T06:48:17,110][INFO ][o.e.d.z.ZenDiscovery     ] [test-mutation-mdi-to-dedicated-es-vlnkt2svz4] master_left [{test-mutation-mdi-to-dedicated-es-j7cxq7f4nc}{oHRJa9kbTa6qf8oknEPLbg}{vNTpy9onR5mpCew57WRAMQ}{172.17.0.10}{172.17.0.10:9300}{xpack.installed=true}], reason [shut_down]
[2019-04-09T06:48:17,111][WARN ][o.e.d.z.ZenDiscovery     ] [test-mutation-mdi-to-dedicated-es-vlnkt2svz4] master left (reason = shut_down), current nodes: nodes: 
   {test-mutation-mdi-to-dedicated-es-vlnkt2svz4}{Hm0TkDYlRoygmT206xXEzQ}{B8xYzZPsS6OtFKatRhpOew}{172.17.0.11}{172.17.0.11:9300}{xpack.installed=true}, local
   {test-mutation-mdi-to-dedicated-es-j7cxq7f4nc}{oHRJa9kbTa6qf8oknEPLbg}{vNTpy9onR5mpCew57WRAMQ}{172.17.0.10}{172.17.0.10:9300}{xpack.installed=true}, master
   {test-mutation-mdi-to-dedicated-es-9666rsgvfl}{Ki4h1ZkrTqqI3BXPtRqJBw}{1fofStBBTR6RmySl29HyBw}{172.17.0.12}{172.17.0.12:9300}{xpack.installed=true}

[2019-04-09T06:48:20,219][INFO ][o.e.c.s.MasterService    ] [test-mutation-mdi-to-dedicated-es-vlnkt2svz4] zen-disco-elected-as-master ([0] nodes joined), reason: new_master {test-mutation-mdi-to-dedicated-es-vlnkt2svz4}{Hm0TkDYlRoygmT206xXEzQ}{B8xYzZPsS6OtFKatRhpOew}{172.17.0.11}{172.17.0.11:9300}{xpack.installed=true}
[2019-04-09T06:48:47,943][WARN ][r.suppressed             ] [test-mutation-mdi-to-dedicated-es-vlnkt2svz4] path: /_cat/master, params: {}
org.elasticsearch.discovery.MasterNotDiscoveredException: null
...
[2019-04-09T06:48:50,236][WARN ][o.e.d.z.PublishClusterStateAction] [test-mutation-mdi-to-dedicated-es-vlnkt2svz4] timed out waiting for all nodes to process published state [19] (timeout [30s], pending nodes: [{test-mutation-mdi-to-dedicated-es-9666rsgvfl}{Ki4h1ZkrTqqI3BXPtRqJBw}{1fofStBBTR6RmySl29HyBw}{172.17.0.12}{172.17.0.12:9300}{xpack.installed=true}])
[2019-04-09T06:48:50,237][WARN ][o.e.d.z.PublishClusterStateAction] [test-mutation-mdi-to-dedicated-es-vlnkt2svz4] publishing cluster state with version [19] failed for the following nodes: [[{test-mutation-mdi-to-dedicated-es-j7cxq7f4nc}{oHRJa9kbTa6qf8oknEPLbg}{vNTpy9onR5mpCew57WRAMQ}{172.17.0.10}{172.17.0.10:9300}{xpack.installed=true}]]
[2019-04-09T06:48:50,239][INFO ][o.e.c.s.ClusterApplierService] [test-mutation-mdi-to-dedicated-es-vlnkt2svz4] new_master {test-mutation-mdi-to-dedicated-es-vlnkt2svz4}{Hm0TkDYlRoygmT206xXEzQ}{B8xYzZPsS6OtFKatRhpOew}{172.17.0.11}{172.17.0.11:9300}{xpack.installed=true}, reason: apply cluster state (from master [master {test-mutation-mdi-to-dedicated-es-vlnkt2svz4}{Hm0TkDYlRoygmT206xXEzQ}{B8xYzZPsS6OtFKatRhpOew}{172.17.0.11}{172.17.0.11:9300}{xpack.installed=true} committed version [19] source [zen-disco-elected-as-master ([0] nodes joined)]])
[2019-04-09T06:48:51,444][WARN ][o.e.c.NodeConnectionsService] [test-mutation-mdi-to-dedicated-es-vlnkt2svz4] failed to connect to node {test-mutation-mdi-to-dedicated-es-j7cxq7f4nc}{oHRJa9kbTa6qf8oknEPLbg}{vNTpy9onR5mpCew57WRAMQ}{172.17.0.10}{172.17.0.10:9300}{xpack.installed=true} (tried [1] times)
org.elasticsearch.transport.ConnectTransportException: [test-mutation-mdi-to-dedicated-es-j7cxq7f4nc][172.17.0.10:9300] connect_exception
...
[2019-04-09T06:48:57,780][WARN ][r.suppressed             ] [test-mutation-mdi-to-dedicated-es-vlnkt2svz4] path: /_cat/master, params: {}
org.elasticsearch.discovery.MasterNotDiscoveredException: null
...
[2019-04-09T06:49:07,842][WARN ][r.suppressed             ] [test-mutation-mdi-to-dedicated-es-vlnkt2svz4] path: /_cat/master, params: {}
org.elasticsearch.discovery.MasterNotDiscoveredException: null
...
[2019-04-09T06:49:10,037][WARN ][o.e.c.s.MasterService    ] [test-mutation-mdi-to-dedicated-es-vlnkt2svz4] cluster state update task [zen-disco-elected-as-master ([0] nodes joined)] took [49.9s] above the warn threshold of 30s
[2019-04-09T06:49:10,039][INFO ][o.e.c.s.MasterService    ] [test-mutation-mdi-to-dedicated-es-vlnkt2svz4] zen-disco-node-failed({test-mutation-mdi-to-dedicated-es-j7cxq7f4nc}{oHRJa9kbTa6qf8oknEPLbg}{vNTpy9onR5mpCew57WRAMQ}{172.17.0.10}{172.17.0.10:9300}{xpack.installed=true}), reason(transport disconnected)[{test-mutation-mdi-to-dedicated-es-j7cxq7f4nc}{oHRJa9kbTa6qf8oknEPLbg}{vNTpy9onR5mpCew57WRAMQ}{172.17.0.10}{172.17.0.10:9300}{xpack.installed=true} transport disconnected], reason: removed {{test-mutation-mdi-to-dedicated-es-j7cxq7f4nc}{oHRJa9kbTa6qf8oknEPLbg}{vNTpy9onR5mpCew57WRAMQ}{172.17.0.10}{172.17.0.10:9300}{xpack.installed=true},}
[2019-04-09T06:49:20,405][INFO ][o.e.c.s.ClusterApplierService] [test-mutation-mdi-to-dedicated-es-vlnkt2svz4] removed {{test-mutation-mdi-to-dedicated-es-j7cxq7f4nc}{oHRJa9kbTa6qf8oknEPLbg}{vNTpy9onR5mpCew57WRAMQ}{172.17.0.10}{172.17.0.10:9300}{xpack.installed=true},}, reason: apply cluster state (from master [master {test-mutation-mdi-to-dedicated-es-vlnkt2svz4}{Hm0TkDYlRoygmT206xXEzQ}{B8xYzZPsS6OtFKatRhpOew}{172.17.0.11}{172.17.0.11:9300}{xpack.installed=true} committed version [20] source [zen-disco-node-failed({test-mutation-mdi-to-dedicated-es-j7cxq7f4nc}{oHRJa9kbTa6qf8oknEPLbg}{vNTpy9onR5mpCew57WRAMQ}{172.17.0.10}{172.17.0.10:9300}{xpack.installed=true}), reason(transport disconnected)[{test-mutation-mdi-to-dedicated-es-j7cxq7f4nc}{oHRJa9kbTa6qf8oknEPLbg}{vNTpy9onR5mpCew57WRAMQ}{172.17.0.10}{172.17.0.10:9300}{xpack.installed=true} transport disconnected]]])

I'm not sure I understand how this test has been successful in the past.
I think we should stop increasing the timeout and try to evaluate the downtime, if the recovery of the cluster can't be optimized we should decide that, for instance, it is ok for this test to have a downtime of 90 seconds.
If after 90 seconds there's still some timeout or the cluster is red then the test should be considered as failed.

Metadata

Metadata

Assignees

Labels

>bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions