Cluster downtime during master node restart while not in discovery file provider

I'm opening this issue while working on PVC reuse/rolling upgrades (https://github.com/elastic/cloud-on-k8s/issues/312), which is not merged yet, but it seemed important to have this separate discussion.

I observe a small downtime in the cluster during the rolling upgrade process, while the master node is being restarted (we restart it last). There is no downtime when other nodes are restarted.

During master nodes restart, requests to Elasticsearch (3 x v7.1 mdi nodes cluster - 2/3 nodes still alive - the master is down) return:

>{"error":{"root_cause":[{"type":"master_not_discovered_exception","reason":null}],"type":"master_not_discovered_exception","reason":null},"status":503}

Master election never happens, even though we have 2/3 master eligible nodes alive, until the master node gets back into the cluster a few seconds later (restart over).

Those are the errors we can see in one of the 2 remaining Elasticsearch instances logs:

>{"type": "server", "timestamp": "2019-06-24T12:28:32,179+0000", "level": "DEBUG", "component": "o.e.a.a.c.n.i.TransportNodesInfoAction", "cluster.name": "elasticsearch-sample", "node.name": "elasticsearch-sample-es-x4dl6l8vkn", "cluster.uuid": "IdHEH4LNQR6XiIF05_hZMQ", "node.id": "_HmKAToxSjOexs-qdN1LrQ",  "message": "failed to execute on node [XF9dsvmJQ9iew2
dUwQd6Bw]" ,
"stacktrace": ["org.elasticsearch.transport.NodeNotConnectedException: [elasticsearch-sample-es-mkt2kvbpgs][10.16.0.56:9300] Node not connected",
"at org.elasticsearch.transport.ConnectionManager.getConnection(ConnectionManager.java:151) ~[elasticsearch-7.1.0.jar:7.1.0]"   ...

>{"type": "server", "timestamp": "2019-06-24T12:28:32,876+0000", "level": "WARN", "component": "o.e.c.c.ClusterFormationFailureHelper", "cluster.name": "elasticsearch-sample", "node.name": "elasticsearch-sample-es-x4dl6l8vkn", "cluster.uuid": "IdHEH4LNQR6XiIF05_hZMQ", "node.id": "_HmKAToxSjOexs-qdN1LrQ",  "message": "master not discovered or elected yet, an election requires a node with id [XF9dsvmJQ9iew2dUwQd6Bw], have discovered [{elasticsearch-sample-es-lb7qc4g6r6}{1-faHCEOSo2-eZDsV-YixA}{CKgONMtLR723Pc8n_3p0Zg}{10.16.1.58}{10.16.1.58:9300}{ml.machine_memory=2147483648, ml.max_open_jobs=20, xpack.installed=true, foo=bar}] which is not a quorum; discovery will continue using [127.0.0.1:9300, 127.0.0.1:9301, 127.0.0.1:9302, 127.0.0.1:9303, 127.0.0.1:9304, [::1]:9300, [::1]:9301, [::1]:9302, [::1]:9303, [::1]:9304, 10.16.1.58:9300] from hosts providers and [{elasticsearch-sample-es-mkt2kvbpgs}{XF9dsvmJQ9iew2dUwQd6Bw}{dw9R1PbTSMW7ah2r_mVk6A}{10.16.0.56}{10.16.0.56:9300}{ml.machine_memory=2147483648, ml.max_open_jobs=20, xpack.installed=true, foo=bar}, {elasticsearch-sample-es-x4dl6l8vkn}{_HmKAToxSjOexs-qdN1LrQ}{4eL_Fhr5TcmUC9xy076qWQ}{10.16.1.57}{10.16.1.57:9300}{ml.machine_memory=2147483648, xpack.installed=true, foo=bar, ml.max_open_jobs=20}, {elasticsearch-sample-es-lb7qc4g6r6}{1-faHCEOSo2-eZDsV-YixA}{CKgONMtLR723Pc8n_3p0Zg}{10.16.1.58}{10.16.1.58:9300}{ml.machine_memory=2147483648, ml.max_open_jobs=20, xpack.installed=true, foo=bar}] from last-known cluster state; node term 14, last-accepted version 759 in term 14"  }

The two remaining nodes seem to complain about the third node (master whose restart is in progress) not being available.

Debugging this a bit more, I realised this is related to the way we manage the discovery.seed_providers file. In this file, we inject each master node's IP (Kubernetes pod IP), on every reconciliation loop. To do that we simply inspect the current pods in the cluster, and if they're master eligible we append their IP to that file, which gets propagated to all nodes in the cluster.
Very soon after stopping the master node (deleting the pod but keeping data volume around), its IP address is also deleted from that file. From our perspective there is no reason to keep it around: that "old" IP does not make sense anymore. When recreated (with the same data), the pod will probably get assigned a new IP.
So we first create the pod, then as soon as it has an IP available we inject it into the file. And the situation gets unlocked, master election can proceed.
However during the whole time where the pod is restarted, and its IP disappears from the seed providers discovery file, the cluster is unavailable.

If I "manually" delete the pod but keep its IP (which does not make sense anymore) in the discovery.seed_providers file, a new master gets elected instantly among the 2 remaining nodes.

I'm wondering if:
* this is expected, and we should do whatever's necessary in the operator to avoid that situation
* this is not expected, and something that should be fixed in Elasticsearch (in other terms: if 2/3 master eligible nodes are in the cluster, a leader election should happen even if the 3rd one has disappeared from hosts discovery)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cluster downtime during master node restart while not in discovery file provider #1138

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Cluster downtime during master node restart while not in discovery file provider #1138

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions