MinimumMasterNodesIT fails due to connection timeout to shut-down node

In https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+periodic/6775/console was a failure of `o.e.c.MinimumMasterNodesIT.testMultipleNodesShutdownNonMasterNodes` that was ultimately due to a connection timeout to a node that was shut down.

In the beginning there are three nodes. `node_t0` is the master and `minimum_master_nodes` is 3. `node_t3` is shut down:

```
  1> [2018-08-01T06:44:09,205][INFO ][o.e.t.InternalTestCluster] [testMultipleNodesShutdownNonMasterNodes] Closing random non master node [node_t3] current master [node_t0] 
```

The master notices and updates its cluster state appropriately:

```
  1> [2018-08-01T06:44:09,209][TRACE][o.e.c.s.MasterService    ] [node_t0] will process [zen-disco-node-left({node_t3}{CUgOeW0-THaez0RC0m07qw}{KYX9YQH1Sl2oz3JPFKazWA}{127.0.0.1}{127.0.0.1:33257}), reason(left)]
  1> [2018-08-01T06:44:09,209][TRACE][o.e.c.s.MasterService    ] [node_t0] will process [zen-disco-node-failed({node_t3}{CUgOeW0-THaez0RC0m07qw}{KYX9YQH1Sl2oz3JPFKazWA}{127.0.0.1}{127.0.0.1:33257}), reason(transport disconnected)]
  1> [2018-08-01T06:44:09,209][DEBUG][o.e.c.s.MasterService    ] [node_t0] processing [zen-disco-node-failed({node_t3}{CUgOeW0-THaez0RC0m07qw}{KYX9YQH1Sl2oz3JPFKazWA}{127.0.0.1}{127.0.0.1:33257}), reason(transport disconnected), zen-disco-node-left({node_t3}{CUgOeW0-THaez0RC0m07qw}{KYX9YQH1Sl2oz3JPFKazWA}{127.0.0.1}{127.0.0.1:33257}), reason(left)]: execute
  1> [2018-08-01T06:44:09,209][WARN ][o.e.t.d.TestZenDiscovery ] [node_t0] not enough master nodes (has [2], but needed [3]), current nodes: nodes: 
  1>    {node_t0}{3daZWUaxQfGt9a7wv6hKNA}{pzu-p-qUSByPXCP7SOSdFA}{127.0.0.1}{127.0.0.1:35729}, local, master
  1>    {node_t1}{N7pZoArIQkagoOjazUUnpg}{CHz8p-1ZSWK6x_N0-Frh7Q}{127.0.0.1}{127.0.0.1:36663}
  1>    {node_t3}{CUgOeW0-THaez0RC0m07qw}{KYX9YQH1Sl2oz3JPFKazWA}{127.0.0.1}{127.0.0.1:33257}
  1> [2018-08-01T06:44:09,209][TRACE][o.e.d.z.NodeJoinController] [node_t0] starting an election context, will accumulate joins
  1> [2018-08-01T06:44:09,209][DEBUG][o.e.c.s.MasterService    ] [node_t0] processing [zen-disco-node-failed({node_t3}{CUgOeW0-THaez0RC0m07qw}{KYX9YQH1Sl2oz3JPFKazWA}{127.0.0.1}{127.0.0.1:33257}), reason(transport disconnected), zen-disco-node-left({node_t3}{CUgOeW0-THaez0RC0m07qw}{KYX9YQH1Sl2oz3JPFKazWA}{127.0.0.1}{127.0.0.1:33257}), reason(left)]: took [0s] no change in cluster state
  1> [2018-08-01T06:44:09,209][DEBUG][o.e.c.s.ClusterApplierService] [node_t0] processing [not enough master nodes (has [2], but needed [3])]: execute
  1> [2018-08-01T06:44:09,210][TRACE][o.e.c.s.ClusterApplierService] [node_t0] cluster state updated, source [not enough master nodes (has [2], but needed [3])]
  1> cluster uuid: FZaCIUzwQQSJO1qlrDq-Rg
  1> version: 9
  1> state uuid: 9P8fB76MTe60o3DdFRiAVg
  1> from_diff: false
  1> meta data version: 6
  1>    [test/xM1OBH02QE2z4aIJH9p5Ug]: v[6]
  1>       0: p_term [2], isa_ids [J_FCA5HgQ1C-7Zz6utjEyQ]
  1>       1: p_term [1], isa_ids [GUI_NyhHSrCqcVxnkCEVMg]
  1>       2: p_term [1], isa_ids [oIGUcqa_TaS5tPLryOUPBA]
  1>       3: p_term [1], isa_ids [Yvg8EnkHSy-B9AOionROSw]
  1>       4: p_term [2], isa_ids [zQZW2002TsGtb-AWpIeiPg]
  1> metadata customs:
  1>    index-graveyard: IndexGraveyard[[]]
  1> blocks: 
  1>    _global_:
  1>       2,no master, blocks WRITE,METADATA_WRITE
  1> nodes: 
  1>    {node_t0}{3daZWUaxQfGt9a7wv6hKNA}{pzu-p-qUSByPXCP7SOSdFA}{127.0.0.1}{127.0.0.1:35729}, local
  1>    {node_t1}{N7pZoArIQkagoOjazUUnpg}{CHz8p-1ZSWK6x_N0-Frh7Q}{127.0.0.1}{127.0.0.1:36663}
  1>    {node_t3}{CUgOeW0-THaez0RC0m07qw}{KYX9YQH1Sl2oz3JPFKazWA}{127.0.0.1}{127.0.0.1:33257}
```

Shortly (700ms) later, `node_t1` notices that `node_t0` is no longer the master, and also updates its cluster state:

```
  1> [2018-08-01T06:44:09,947][DEBUG][o.e.d.z.MasterFaultDetection] [node_t1] [master] pinging a master {node_t0}{3daZWUaxQfGt9a7wv6hKNA}{pzu-p-qUSByPXCP7SOSdFA}{127.0.0.1}{127.0.0.1:35729} that is no longer a master
  1> [2018-08-01T06:44:09,947][DEBUG][o.e.d.z.MasterFaultDetection] [node_t1] [master] stopping fault detection against master [{node_t0}{3daZWUaxQfGt9a7wv6hKNA}{pzu-p-qUSByPXCP7SOSdFA}{127.0.0.1}{127.0.0.1:35729}], reason [master failure, no longer master]
  1> [2018-08-01T06:44:09,947][INFO ][o.e.t.d.TestZenDiscovery ] [node_t1] master_left [{node_t0}{3daZWUaxQfGt9a7wv6hKNA}{pzu-p-qUSByPXCP7SOSdFA}{127.0.0.1}{127.0.0.1:35729}], reason [no longer master]
  1> org.elasticsearch.transport.RemoteTransportException: [node_t0][127.0.0.1:53240][internal:discovery/zen/fd/master_ping]
  1> Caused by: org.elasticsearch.cluster.NotMasterException: local node is not master
  1> [2018-08-01T06:44:09,948][WARN ][o.e.t.d.TestZenDiscovery ] [node_t1] master left (reason = no longer master), current nodes: nodes: 
  1>    {node_t0}{3daZWUaxQfGt9a7wv6hKNA}{pzu-p-qUSByPXCP7SOSdFA}{127.0.0.1}{127.0.0.1:35729}, master
  1>    {node_t1}{N7pZoArIQkagoOjazUUnpg}{CHz8p-1ZSWK6x_N0-Frh7Q}{127.0.0.1}{127.0.0.1:36663}, local
  1>    {node_t3}{CUgOeW0-THaez0RC0m07qw}{KYX9YQH1Sl2oz3JPFKazWA}{127.0.0.1}{127.0.0.1:33257}
  1> [2018-08-01T06:44:09,948][TRACE][o.e.d.z.NodeJoinController] [node_t1] starting an election context, will accumulate joins
  1> [2018-08-01T06:44:09,948][DEBUG][o.e.c.s.ClusterApplierService] [node_t1] processing [master left (reason = no longer master)]: execute
  1> [2018-08-01T06:44:09,948][TRACE][o.e.c.s.ClusterApplierService] [node_t1] cluster state updated, source [master left (reason = no longer master)]
  1> cluster uuid: FZaCIUzwQQSJO1qlrDq-Rg
  1> version: 9
  1> state uuid: 9P8fB76MTe60o3DdFRiAVg
  1> from_diff: false
  1> meta data version: 6
  1>    [test/xM1OBH02QE2z4aIJH9p5Ug]: v[6]
  1>       0: p_term [2], isa_ids [J_FCA5HgQ1C-7Zz6utjEyQ]
  1>       1: p_term [1], isa_ids [GUI_NyhHSrCqcVxnkCEVMg]
  1>       2: p_term [1], isa_ids [oIGUcqa_TaS5tPLryOUPBA]
  1>       3: p_term [1], isa_ids [Yvg8EnkHSy-B9AOionROSw]
  1>       4: p_term [2], isa_ids [zQZW2002TsGtb-AWpIeiPg]
  1> metadata customs:
  1>    index-graveyard: IndexGraveyard[[]]
  1> blocks: 
  1>    _global_:
  1>       2,no master, blocks WRITE,METADATA_WRITE
  1> nodes: 
  1>    {node_t0}{3daZWUaxQfGt9a7wv6hKNA}{pzu-p-qUSByPXCP7SOSdFA}{127.0.0.1}{127.0.0.1:35729}
  1>    {node_t1}{N7pZoArIQkagoOjazUUnpg}{CHz8p-1ZSWK6x_N0-Frh7Q}{127.0.0.1}{127.0.0.1:36663}, local
  1>    {node_t3}{CUgOeW0-THaez0RC0m07qw}{KYX9YQH1Sl2oz3JPFKazWA}{127.0.0.1}{127.0.0.1:33257}
```

However, we busy-wait for 20 seconds for both nodes to report a `NO_MASTER_BLOCK` in their cluster state, and `node_t1` doesn't:


```
...
  1> [2018-08-01T06:44:27,271][DEBUG][o.e.c.MinimumMasterNodesIT] [testMultipleNodesShutdownNonMasterNodes] Checking for NO_MASTER_BLOCK on client: org.elasticsearch.transport.MockTransportClient@109a20f8 NO_MASTER_BLOCK: [true]
  1> [2018-08-01T06:44:27,272][DEBUG][o.e.c.MinimumMasterNodesIT] [testMultipleNodesShutdownNonMasterNodes] Checking for NO_MASTER_BLOCK on client: org.elasticsearch.transport.MockTransportClient@33e1bfab NO_MASTER_BLOCK: [false]
  1> [2018-08-01T06:44:29,250][DEBUG][o.e.c.MinimumMasterNodesIT] [testMultipleNodesShutdownNonMasterNodes] Checking for NO_MASTER_BLOCK on client: org.elasticsearch.transport.MockTransportClient@109a20f8 NO_MASTER_BLOCK: [true]
  1> [2018-08-01T06:44:29,250][DEBUG][o.e.c.MinimumMasterNodesIT] [testMultipleNodesShutdownNonMasterNodes] Checking for NO_MASTER_BLOCK on client: org.elasticsearch.transport.MockTransportClient@33e1bfab NO_MASTER_BLOCK: [false]
  1> [2018-08-01T06:44:29,251][INFO ][o.e.c.MinimumMasterNodesIT] [testMultipleNodesShutdownNonMasterNodes] [MinimumMasterNodesIT#testMultipleNodesShutdownNonMasterNodes]: cleaning up after test
```

It's a further 10 seconds (i.e. 30 seconds since the cluster state application started) before `node_t1` completes the cluster state application:

```
  1> [2018-08-01T06:44:39,949][WARN ][o.e.c.NodeConnectionsService] [node_t1] failed to connect to node {node_t3}{CUgOeW0-THaez0RC0m07qw}{KYX9YQH1Sl2oz3JPFKazWA}{127.0.0.1}{127.0.0.1:33257} (tried [1] times)
  1> org.elasticsearch.transport.ConnectTransportException: [node_t3][127.0.0.1:33257] connect_timeout[30s]
  1>  at org.elasticsearch.transport.TcpChannel.awaitConnected(TcpChannel.java:114) ~[main/:?]
  1>  at org.elasticsearch.transport.TcpTransport.openConnection(TcpTransport.java:646) ~[main/:?]
  1>  at org.elasticsearch.transport.TcpTransport.connectToNode(TcpTransport.java:545) ~[main/:?]
  1>  at org.elasticsearch.test.transport.MockTransportService$LookupTestTransport.connectToNode(MockTransportService.java:518) ~[framework-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
  1>  at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:330) ~[main/:?]
  1>  at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:317) ~[main/:?]
  1>  at org.elasticsearch.cluster.NodeConnectionsService.validateAndConnectIfNeeded(NodeConnectionsService.java:153) [main/:?]
  1>  at org.elasticsearch.cluster.NodeConnectionsService$1.doRun(NodeConnectionsService.java:106) [main/:?]
  1>  at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:723) [main/:?]
  1>  at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [main/:?]
  1>  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_172]
  1>  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_172]
  1>  at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172]
  1> [2018-08-01T06:44:39,949][DEBUG][o.e.c.s.ClusterApplierService] [node_t1] applying cluster state version 9
  1> [2018-08-01T06:44:39,949][DEBUG][o.e.c.s.ClusterApplierService] [node_t1] apply cluster state with version 9
  1> [2018-08-01T06:44:39,949][TRACE][o.e.c.s.ClusterApplierService] [node_t1] calling [org.elasticsearch.indices.cluster.IndicesClusterStateService@7eb553a5] with change to version [9]
  1> [2018-08-01T06:44:39,949][TRACE][o.e.c.s.ClusterApplierService] [node_t1] calling [org.elasticsearch.script.ScriptService@32b8143d] with change to version [9]
  1> [2018-08-01T06:44:39,949][TRACE][o.e.c.s.ClusterApplierService] [node_t1] calling [org.elasticsearch.ingest.PipelineStore@4f81ef5d] with change to version [9]
  1> [2018-08-01T06:44:39,949][TRACE][o.e.c.s.ClusterApplierService] [node_t1] calling [org.elasticsearch.ingest.PipelineExecutionService@4a66f678] with change to version [9]
  1> [2018-08-01T06:44:39,949][TRACE][o.e.c.s.ClusterApplierService] [node_t1] calling [org.elasticsearch.gateway.GatewayAllocator$$Lambda$1110/1454779260@1e9aacea] with change to version [9]
  1> [2018-08-01T06:44:39,949][TRACE][o.e.c.s.ClusterApplierService] [node_t1] calling [org.elasticsearch.repositories.RepositoriesService@6cf15af8] with change to version [9]
  1> [2018-08-01T06:44:39,949][TRACE][o.e.c.s.ClusterApplierService] [node_t1] calling [org.elasticsearch.snapshots.RestoreService@2ccabb3c] with change to version [9]
  1> [2018-08-01T06:44:39,949][TRACE][o.e.c.s.ClusterApplierService] [node_t1] calling [org.elasticsearch.action.ingest.IngestActionForwarder@51625deb] with change to version [9]
  1> [2018-08-01T06:44:39,949][TRACE][o.e.c.s.ClusterApplierService] [node_t1] calling [org.elasticsearch.tasks.TaskManager@1ad456d1] with change to version [9]
  1> [2018-08-01T06:44:39,949][TRACE][o.e.c.s.ClusterApplierService] [node_t1] calling [org.elasticsearch.snapshots.SnapshotsService@7c94e9e9] with change to version [9]
  1> [2018-08-01T06:44:39,950][TRACE][o.e.c.s.ClusterApplierService] [node_t1] calling [org.elasticsearch.gateway.GatewayMetaState@5feae1c] with change to version [9]
  1> [2018-08-01T06:44:39,950][DEBUG][o.e.c.s.ClusterApplierService] [node_t1] set locally applied cluster state to version 9
```

The holdup is the 30-second connection timeout (see #29022). But what's strange is that this connection shouldn't be timing out - it's a connection on localhost to a port which is supposed to be closed, so it should yield immediate refusal. It's definitely to a closed port, because we see earlier connection refusals:

```
  1> [2018-08-01T06:44:09,210][WARN ][o.e.c.NodeConnectionsService] [node_t0] failed to connect to node {node_t3}{CUgOeW0-THaez0RC0m07qw}{KYX9YQH1Sl2oz3JPFKazWA}{127.0.0.1}{127.0.0.1:33257} (tried [1] times)
  1> org.elasticsearch.transport.ConnectTransportException: [node_t3][127.0.0.1:33257] connect_exception
...
  1> Caused by: java.net.ConnectException: Connection refused (Connection refused)
```

```
  1> [2018-08-01T06:44:09,212][TRACE][o.e.d.z.UnicastZenPing   ] [node_t0] [2] failed to ping {127.0.0.1:33257}{HNCsZ8-xQyCjOELP9QoZIQ}{127.0.0.1}{127.0.0.1:33257}
  1> org.elasticsearch.transport.ConnectTransportException: [][127.0.0.1:33257] connect_exception
...
  1> Caused by: java.net.ConnectException: Connection refused (Connection refused)
```

```
  1> [2018-08-01T06:44:09,544][TRACE][o.e.d.z.UnicastZenPing   ] [node_t0] [2] failed to ping {127.0.0.1:33257}{OrUqi6YARTeE8zFQdPUeXQ}{127.0.0.1}{127.0.0.1:33257}
  1> org.elasticsearch.transport.ConnectTransportException: [][127.0.0.1:33257] connect_exception
...
  1> Caused by: java.net.ConnectException: Connection refused (Connection refused)
```

After this, there are other connection attempts due to pinging which time out as well:

```
  1> [2018-08-01T06:44:10,877][TRACE][o.e.d.z.UnicastZenPing   ] [node_t0] [2] failed to ping {127.0.0.1:43597}{Aaa7SBg9SuejUTdLBu4hqQ}{127.0.0.1}{127.0.0.1:43597}
  1> org.elasticsearch.transport.ConnectTransportException: [][127.0.0.1:43597] connect_timeout[1s]
  1>  at org.elasticsearch.transport.TcpChannel.awaitConnected(TcpChannel.java:114) ~[main/:?]
  1>  at org.elasticsearch.transport.TcpTransport.openConnection(TcpTransport.java:646) ~[main/:?]
  1>  at org.elasticsearch.transport.TcpTransport.openConnection(TcpTransport.java:125) ~[main/:?]
  1>  at org.elasticsearch.test.transport.MockTransportService$LookupTestTransport.openConnection(MockTransportService.java:533) ~[framework-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
  1>  at org.elasticsearch.transport.TransportService.openConnection(TransportService.java:349) ~[main/:?]
  1>  at org.elasticsearch.test.transport.MockTransportService.openConnection(MockTransportService.java:813) ~[framework-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
  1>  at org.elasticsearch.discovery.zen.UnicastZenPing$PingingRound.getOrConnect(UnicastZenPing.java:365) ~[main/:?]
  1>  at org.elasticsearch.discovery.zen.UnicastZenPing$3.doRun(UnicastZenPing.java:472) [main/:?]
  1>  at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:723) [main/:?]
  1>  at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [main/:?]
  1>  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_172]
  1>  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_172]
  1>  at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172]
  1> [2018-08-01T06:44:10,877][TRACE][o.e.d.z.UnicastZenPing   ] [node_t0] [2] failed to ping {127.0.0.1:33257}{hq5HYpQ7T86A4xKCxqscsQ}{127.0.0.1}{127.0.0.1:33257}
  1> org.elasticsearch.transport.ConnectTransportException: [][127.0.0.1:33257] connect_timeout[1s]
...
  1> [2018-08-01T06:44:10,949][TRACE][o.e.d.z.UnicastZenPing   ] [node_t1] [2] failed to ping {127.0.0.1:33257}{GgGLdU_aQei6je26MFUJ7A}{127.0.0.1}{127.0.0.1:33257}
  1> org.elasticsearch.transport.ConnectTransportException: [][127.0.0.1:33257] connect_timeout[1s]
...
  1> [2018-08-01T06:44:10,949][TRACE][o.e.d.z.UnicastZenPing   ] [node_t1] [2] failed to ping {127.0.0.1:43597}{E3isLEliQjK6hgh12unCQg}{127.0.0.1}{127.0.0.1:43597}
  1> org.elasticsearch.transport.ConnectTransportException: [][127.0.0.1:43597] connect_timeout[1s]
...
```

Of course this doesn't reproduce locally, but I've not really much idea what could lead to a connection on localhost timing out like this. Input and ideas for next steps appreciated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MinimumMasterNodesIT fails due to connection timeout to shut-down node #32552

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

MinimumMasterNodesIT fails due to connection timeout to shut-down node #32552

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions