Investigating TestMutationMdiToDedicated failure

The e2e test `TestMutationMdiToDedicated` seems to consistently fails, on my laptop I have dozens of errors like this one during the step`Cluster_health_should_not_have_been_red_during_mutation_process` _(The name of the test is misleading as any error,e.g. a timeout, will trigger a failure)_:

```
Cluster health check failure at 2019-04-09 09:11:08.481942 +0200 CEST m=+232.212360466: Get http://test-mutation-mdi-to-dedicated-es.e2e.svc.cluster.local:9200/_cluster/health: no pod addresses found in service endpoints
```

I have tried to understand why the test is failing, here what I found:

The first interesting thing is that the operator may remove the last pod in a `Ready` state (from a K8S pov), leaving the Kubernetes service without any endpoint to route a request:

![image](https://user-images.githubusercontent.com/976373/55781490-8a22fe00-5aaa-11e9-87cd-0399e8f6a24b.png)

I guess it is because the K8S state is not taken into account when the operator checks if it can delete a pod:

https://github.com/elastic/k8s-operators/blob/bac6e037efac300d51c0f68c71942121ba7627ef/operators/pkg/controller/elasticsearch/mutation/podrestrictions.go#L50-L55

A first question is: should we remove a node that is the last in the `Ready` state ?
_note that this behavior is mitigated when some data is present, the migration gives some time to a node to move into a `Ready` state_

But the real problem is that when the former master leaves the cluster it can take more than 1 minute for the cluster to recover:

```
[2019-04-09T06:48:17,110][INFO ][o.e.d.z.ZenDiscovery     ] [test-mutation-mdi-to-dedicated-es-vlnkt2svz4] master_left [{test-mutation-mdi-to-dedicated-es-j7cxq7f4nc}{oHRJa9kbTa6qf8oknEPLbg}{vNTpy9onR5mpCew57WRAMQ}{172.17.0.10}{172.17.0.10:9300}{xpack.installed=true}], reason [shut_down]
[2019-04-09T06:48:17,111][WARN ][o.e.d.z.ZenDiscovery     ] [test-mutation-mdi-to-dedicated-es-vlnkt2svz4] master left (reason = shut_down), current nodes: nodes: 
   {test-mutation-mdi-to-dedicated-es-vlnkt2svz4}{Hm0TkDYlRoygmT206xXEzQ}{B8xYzZPsS6OtFKatRhpOew}{172.17.0.11}{172.17.0.11:9300}{xpack.installed=true}, local
   {test-mutation-mdi-to-dedicated-es-j7cxq7f4nc}{oHRJa9kbTa6qf8oknEPLbg}{vNTpy9onR5mpCew57WRAMQ}{172.17.0.10}{172.17.0.10:9300}{xpack.installed=true}, master
   {test-mutation-mdi-to-dedicated-es-9666rsgvfl}{Ki4h1ZkrTqqI3BXPtRqJBw}{1fofStBBTR6RmySl29HyBw}{172.17.0.12}{172.17.0.12:9300}{xpack.installed=true}

[2019-04-09T06:48:20,219][INFO ][o.e.c.s.MasterService    ] [test-mutation-mdi-to-dedicated-es-vlnkt2svz4] zen-disco-elected-as-master ([0] nodes joined), reason: new_master {test-mutation-mdi-to-dedicated-es-vlnkt2svz4}{Hm0TkDYlRoygmT206xXEzQ}{B8xYzZPsS6OtFKatRhpOew}{172.17.0.11}{172.17.0.11:9300}{xpack.installed=true}
[2019-04-09T06:48:47,943][WARN ][r.suppressed             ] [test-mutation-mdi-to-dedicated-es-vlnkt2svz4] path: /_cat/master, params: {}
org.elasticsearch.discovery.MasterNotDiscoveredException: null
...
[2019-04-09T06:48:50,236][WARN ][o.e.d.z.PublishClusterStateAction] [test-mutation-mdi-to-dedicated-es-vlnkt2svz4] timed out waiting for all nodes to process published state [19] (timeout [30s], pending nodes: [{test-mutation-mdi-to-dedicated-es-9666rsgvfl}{Ki4h1ZkrTqqI3BXPtRqJBw}{1fofStBBTR6RmySl29HyBw}{172.17.0.12}{172.17.0.12:9300}{xpack.installed=true}])
[2019-04-09T06:48:50,237][WARN ][o.e.d.z.PublishClusterStateAction] [test-mutation-mdi-to-dedicated-es-vlnkt2svz4] publishing cluster state with version [19] failed for the following nodes: [[{test-mutation-mdi-to-dedicated-es-j7cxq7f4nc}{oHRJa9kbTa6qf8oknEPLbg}{vNTpy9onR5mpCew57WRAMQ}{172.17.0.10}{172.17.0.10:9300}{xpack.installed=true}]]
[2019-04-09T06:48:50,239][INFO ][o.e.c.s.ClusterApplierService] [test-mutation-mdi-to-dedicated-es-vlnkt2svz4] new_master {test-mutation-mdi-to-dedicated-es-vlnkt2svz4}{Hm0TkDYlRoygmT206xXEzQ}{B8xYzZPsS6OtFKatRhpOew}{172.17.0.11}{172.17.0.11:9300}{xpack.installed=true}, reason: apply cluster state (from master [master {test-mutation-mdi-to-dedicated-es-vlnkt2svz4}{Hm0TkDYlRoygmT206xXEzQ}{B8xYzZPsS6OtFKatRhpOew}{172.17.0.11}{172.17.0.11:9300}{xpack.installed=true} committed version [19] source [zen-disco-elected-as-master ([0] nodes joined)]])
[2019-04-09T06:48:51,444][WARN ][o.e.c.NodeConnectionsService] [test-mutation-mdi-to-dedicated-es-vlnkt2svz4] failed to connect to node {test-mutation-mdi-to-dedicated-es-j7cxq7f4nc}{oHRJa9kbTa6qf8oknEPLbg}{vNTpy9onR5mpCew57WRAMQ}{172.17.0.10}{172.17.0.10:9300}{xpack.installed=true} (tried [1] times)
org.elasticsearch.transport.ConnectTransportException: [test-mutation-mdi-to-dedicated-es-j7cxq7f4nc][172.17.0.10:9300] connect_exception
...
[2019-04-09T06:48:57,780][WARN ][r.suppressed             ] [test-mutation-mdi-to-dedicated-es-vlnkt2svz4] path: /_cat/master, params: {}
org.elasticsearch.discovery.MasterNotDiscoveredException: null
...
[2019-04-09T06:49:07,842][WARN ][r.suppressed             ] [test-mutation-mdi-to-dedicated-es-vlnkt2svz4] path: /_cat/master, params: {}
org.elasticsearch.discovery.MasterNotDiscoveredException: null
...
[2019-04-09T06:49:10,037][WARN ][o.e.c.s.MasterService    ] [test-mutation-mdi-to-dedicated-es-vlnkt2svz4] cluster state update task [zen-disco-elected-as-master ([0] nodes joined)] took [49.9s] above the warn threshold of 30s
[2019-04-09T06:49:10,039][INFO ][o.e.c.s.MasterService    ] [test-mutation-mdi-to-dedicated-es-vlnkt2svz4] zen-disco-node-failed({test-mutation-mdi-to-dedicated-es-j7cxq7f4nc}{oHRJa9kbTa6qf8oknEPLbg}{vNTpy9onR5mpCew57WRAMQ}{172.17.0.10}{172.17.0.10:9300}{xpack.installed=true}), reason(transport disconnected)[{test-mutation-mdi-to-dedicated-es-j7cxq7f4nc}{oHRJa9kbTa6qf8oknEPLbg}{vNTpy9onR5mpCew57WRAMQ}{172.17.0.10}{172.17.0.10:9300}{xpack.installed=true} transport disconnected], reason: removed {{test-mutation-mdi-to-dedicated-es-j7cxq7f4nc}{oHRJa9kbTa6qf8oknEPLbg}{vNTpy9onR5mpCew57WRAMQ}{172.17.0.10}{172.17.0.10:9300}{xpack.installed=true},}
[2019-04-09T06:49:20,405][INFO ][o.e.c.s.ClusterApplierService] [test-mutation-mdi-to-dedicated-es-vlnkt2svz4] removed {{test-mutation-mdi-to-dedicated-es-j7cxq7f4nc}{oHRJa9kbTa6qf8oknEPLbg}{vNTpy9onR5mpCew57WRAMQ}{172.17.0.10}{172.17.0.10:9300}{xpack.installed=true},}, reason: apply cluster state (from master [master {test-mutation-mdi-to-dedicated-es-vlnkt2svz4}{Hm0TkDYlRoygmT206xXEzQ}{B8xYzZPsS6OtFKatRhpOew}{172.17.0.11}{172.17.0.11:9300}{xpack.installed=true} committed version [20] source [zen-disco-node-failed({test-mutation-mdi-to-dedicated-es-j7cxq7f4nc}{oHRJa9kbTa6qf8oknEPLbg}{vNTpy9onR5mpCew57WRAMQ}{172.17.0.10}{172.17.0.10:9300}{xpack.installed=true}), reason(transport disconnected)[{test-mutation-mdi-to-dedicated-es-j7cxq7f4nc}{oHRJa9kbTa6qf8oknEPLbg}{vNTpy9onR5mpCew57WRAMQ}{172.17.0.10}{172.17.0.10:9300}{xpack.installed=true} transport disconnected]]])
```

I'm not sure I understand how this test has been successful in the past.
I think we should stop increasing the timeout and try to evaluate the downtime, if the recovery of the cluster can't be optimized we should decide that, for instance, it is ok for this test to have a downtime of 90 seconds.
If after 90 seconds there's still some timeout or the cluster is red then the test should be considered as failed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Investigating TestMutationMdiToDedicated failure #614

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Investigating TestMutationMdiToDedicated failure #614

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions