Skip to content

MinimumMasterNodesIT fails due to connection timeout to shut-down node #32552

@DaveCTurner

Description

@DaveCTurner

In https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+periodic/6775/console was a failure of o.e.c.MinimumMasterNodesIT.testMultipleNodesShutdownNonMasterNodes that was ultimately due to a connection timeout to a node that was shut down.

In the beginning there are three nodes. node_t0 is the master and minimum_master_nodes is 3. node_t3 is shut down:

  1> [2018-08-01T06:44:09,205][INFO ][o.e.t.InternalTestCluster] [testMultipleNodesShutdownNonMasterNodes] Closing random non master node [node_t3] current master [node_t0] 

The master notices and updates its cluster state appropriately:

  1> [2018-08-01T06:44:09,209][TRACE][o.e.c.s.MasterService    ] [node_t0] will process [zen-disco-node-left({node_t3}{CUgOeW0-THaez0RC0m07qw}{KYX9YQH1Sl2oz3JPFKazWA}{127.0.0.1}{127.0.0.1:33257}), reason(left)]
  1> [2018-08-01T06:44:09,209][TRACE][o.e.c.s.MasterService    ] [node_t0] will process [zen-disco-node-failed({node_t3}{CUgOeW0-THaez0RC0m07qw}{KYX9YQH1Sl2oz3JPFKazWA}{127.0.0.1}{127.0.0.1:33257}), reason(transport disconnected)]
  1> [2018-08-01T06:44:09,209][DEBUG][o.e.c.s.MasterService    ] [node_t0] processing [zen-disco-node-failed({node_t3}{CUgOeW0-THaez0RC0m07qw}{KYX9YQH1Sl2oz3JPFKazWA}{127.0.0.1}{127.0.0.1:33257}), reason(transport disconnected), zen-disco-node-left({node_t3}{CUgOeW0-THaez0RC0m07qw}{KYX9YQH1Sl2oz3JPFKazWA}{127.0.0.1}{127.0.0.1:33257}), reason(left)]: execute
  1> [2018-08-01T06:44:09,209][WARN ][o.e.t.d.TestZenDiscovery ] [node_t0] not enough master nodes (has [2], but needed [3]), current nodes: nodes: 
  1>    {node_t0}{3daZWUaxQfGt9a7wv6hKNA}{pzu-p-qUSByPXCP7SOSdFA}{127.0.0.1}{127.0.0.1:35729}, local, master
  1>    {node_t1}{N7pZoArIQkagoOjazUUnpg}{CHz8p-1ZSWK6x_N0-Frh7Q}{127.0.0.1}{127.0.0.1:36663}
  1>    {node_t3}{CUgOeW0-THaez0RC0m07qw}{KYX9YQH1Sl2oz3JPFKazWA}{127.0.0.1}{127.0.0.1:33257}
  1> [2018-08-01T06:44:09,209][TRACE][o.e.d.z.NodeJoinController] [node_t0] starting an election context, will accumulate joins
  1> [2018-08-01T06:44:09,209][DEBUG][o.e.c.s.MasterService    ] [node_t0] processing [zen-disco-node-failed({node_t3}{CUgOeW0-THaez0RC0m07qw}{KYX9YQH1Sl2oz3JPFKazWA}{127.0.0.1}{127.0.0.1:33257}), reason(transport disconnected), zen-disco-node-left({node_t3}{CUgOeW0-THaez0RC0m07qw}{KYX9YQH1Sl2oz3JPFKazWA}{127.0.0.1}{127.0.0.1:33257}), reason(left)]: took [0s] no change in cluster state
  1> [2018-08-01T06:44:09,209][DEBUG][o.e.c.s.ClusterApplierService] [node_t0] processing [not enough master nodes (has [2], but needed [3])]: execute
  1> [2018-08-01T06:44:09,210][TRACE][o.e.c.s.ClusterApplierService] [node_t0] cluster state updated, source [not enough master nodes (has [2], but needed [3])]
  1> cluster uuid: FZaCIUzwQQSJO1qlrDq-Rg
  1> version: 9
  1> state uuid: 9P8fB76MTe60o3DdFRiAVg
  1> from_diff: false
  1> meta data version: 6
  1>    [test/xM1OBH02QE2z4aIJH9p5Ug]: v[6]
  1>       0: p_term [2], isa_ids [J_FCA5HgQ1C-7Zz6utjEyQ]
  1>       1: p_term [1], isa_ids [GUI_NyhHSrCqcVxnkCEVMg]
  1>       2: p_term [1], isa_ids [oIGUcqa_TaS5tPLryOUPBA]
  1>       3: p_term [1], isa_ids [Yvg8EnkHSy-B9AOionROSw]
  1>       4: p_term [2], isa_ids [zQZW2002TsGtb-AWpIeiPg]
  1> metadata customs:
  1>    index-graveyard: IndexGraveyard[[]]
  1> blocks: 
  1>    _global_:
  1>       2,no master, blocks WRITE,METADATA_WRITE
  1> nodes: 
  1>    {node_t0}{3daZWUaxQfGt9a7wv6hKNA}{pzu-p-qUSByPXCP7SOSdFA}{127.0.0.1}{127.0.0.1:35729}, local
  1>    {node_t1}{N7pZoArIQkagoOjazUUnpg}{CHz8p-1ZSWK6x_N0-Frh7Q}{127.0.0.1}{127.0.0.1:36663}
  1>    {node_t3}{CUgOeW0-THaez0RC0m07qw}{KYX9YQH1Sl2oz3JPFKazWA}{127.0.0.1}{127.0.0.1:33257}

Shortly (700ms) later, node_t1 notices that node_t0 is no longer the master, and also updates its cluster state:

  1> [2018-08-01T06:44:09,947][DEBUG][o.e.d.z.MasterFaultDetection] [node_t1] [master] pinging a master {node_t0}{3daZWUaxQfGt9a7wv6hKNA}{pzu-p-qUSByPXCP7SOSdFA}{127.0.0.1}{127.0.0.1:35729} that is no longer a master
  1> [2018-08-01T06:44:09,947][DEBUG][o.e.d.z.MasterFaultDetection] [node_t1] [master] stopping fault detection against master [{node_t0}{3daZWUaxQfGt9a7wv6hKNA}{pzu-p-qUSByPXCP7SOSdFA}{127.0.0.1}{127.0.0.1:35729}], reason [master failure, no longer master]
  1> [2018-08-01T06:44:09,947][INFO ][o.e.t.d.TestZenDiscovery ] [node_t1] master_left [{node_t0}{3daZWUaxQfGt9a7wv6hKNA}{pzu-p-qUSByPXCP7SOSdFA}{127.0.0.1}{127.0.0.1:35729}], reason [no longer master]
  1> org.elasticsearch.transport.RemoteTransportException: [node_t0][127.0.0.1:53240][internal:discovery/zen/fd/master_ping]
  1> Caused by: org.elasticsearch.cluster.NotMasterException: local node is not master
  1> [2018-08-01T06:44:09,948][WARN ][o.e.t.d.TestZenDiscovery ] [node_t1] master left (reason = no longer master), current nodes: nodes: 
  1>    {node_t0}{3daZWUaxQfGt9a7wv6hKNA}{pzu-p-qUSByPXCP7SOSdFA}{127.0.0.1}{127.0.0.1:35729}, master
  1>    {node_t1}{N7pZoArIQkagoOjazUUnpg}{CHz8p-1ZSWK6x_N0-Frh7Q}{127.0.0.1}{127.0.0.1:36663}, local
  1>    {node_t3}{CUgOeW0-THaez0RC0m07qw}{KYX9YQH1Sl2oz3JPFKazWA}{127.0.0.1}{127.0.0.1:33257}
  1> [2018-08-01T06:44:09,948][TRACE][o.e.d.z.NodeJoinController] [node_t1] starting an election context, will accumulate joins
  1> [2018-08-01T06:44:09,948][DEBUG][o.e.c.s.ClusterApplierService] [node_t1] processing [master left (reason = no longer master)]: execute
  1> [2018-08-01T06:44:09,948][TRACE][o.e.c.s.ClusterApplierService] [node_t1] cluster state updated, source [master left (reason = no longer master)]
  1> cluster uuid: FZaCIUzwQQSJO1qlrDq-Rg
  1> version: 9
  1> state uuid: 9P8fB76MTe60o3DdFRiAVg
  1> from_diff: false
  1> meta data version: 6
  1>    [test/xM1OBH02QE2z4aIJH9p5Ug]: v[6]
  1>       0: p_term [2], isa_ids [J_FCA5HgQ1C-7Zz6utjEyQ]
  1>       1: p_term [1], isa_ids [GUI_NyhHSrCqcVxnkCEVMg]
  1>       2: p_term [1], isa_ids [oIGUcqa_TaS5tPLryOUPBA]
  1>       3: p_term [1], isa_ids [Yvg8EnkHSy-B9AOionROSw]
  1>       4: p_term [2], isa_ids [zQZW2002TsGtb-AWpIeiPg]
  1> metadata customs:
  1>    index-graveyard: IndexGraveyard[[]]
  1> blocks: 
  1>    _global_:
  1>       2,no master, blocks WRITE,METADATA_WRITE
  1> nodes: 
  1>    {node_t0}{3daZWUaxQfGt9a7wv6hKNA}{pzu-p-qUSByPXCP7SOSdFA}{127.0.0.1}{127.0.0.1:35729}
  1>    {node_t1}{N7pZoArIQkagoOjazUUnpg}{CHz8p-1ZSWK6x_N0-Frh7Q}{127.0.0.1}{127.0.0.1:36663}, local
  1>    {node_t3}{CUgOeW0-THaez0RC0m07qw}{KYX9YQH1Sl2oz3JPFKazWA}{127.0.0.1}{127.0.0.1:33257}

However, we busy-wait for 20 seconds for both nodes to report a NO_MASTER_BLOCK in their cluster state, and node_t1 doesn't:

...
  1> [2018-08-01T06:44:27,271][DEBUG][o.e.c.MinimumMasterNodesIT] [testMultipleNodesShutdownNonMasterNodes] Checking for NO_MASTER_BLOCK on client: org.elasticsearch.transport.MockTransportClient@109a20f8 NO_MASTER_BLOCK: [true]
  1> [2018-08-01T06:44:27,272][DEBUG][o.e.c.MinimumMasterNodesIT] [testMultipleNodesShutdownNonMasterNodes] Checking for NO_MASTER_BLOCK on client: org.elasticsearch.transport.MockTransportClient@33e1bfab NO_MASTER_BLOCK: [false]
  1> [2018-08-01T06:44:29,250][DEBUG][o.e.c.MinimumMasterNodesIT] [testMultipleNodesShutdownNonMasterNodes] Checking for NO_MASTER_BLOCK on client: org.elasticsearch.transport.MockTransportClient@109a20f8 NO_MASTER_BLOCK: [true]
  1> [2018-08-01T06:44:29,250][DEBUG][o.e.c.MinimumMasterNodesIT] [testMultipleNodesShutdownNonMasterNodes] Checking for NO_MASTER_BLOCK on client: org.elasticsearch.transport.MockTransportClient@33e1bfab NO_MASTER_BLOCK: [false]
  1> [2018-08-01T06:44:29,251][INFO ][o.e.c.MinimumMasterNodesIT] [testMultipleNodesShutdownNonMasterNodes] [MinimumMasterNodesIT#testMultipleNodesShutdownNonMasterNodes]: cleaning up after test

It's a further 10 seconds (i.e. 30 seconds since the cluster state application started) before node_t1 completes the cluster state application:

  1> [2018-08-01T06:44:39,949][WARN ][o.e.c.NodeConnectionsService] [node_t1] failed to connect to node {node_t3}{CUgOeW0-THaez0RC0m07qw}{KYX9YQH1Sl2oz3JPFKazWA}{127.0.0.1}{127.0.0.1:33257} (tried [1] times)
  1> org.elasticsearch.transport.ConnectTransportException: [node_t3][127.0.0.1:33257] connect_timeout[30s]
  1>  at org.elasticsearch.transport.TcpChannel.awaitConnected(TcpChannel.java:114) ~[main/:?]
  1>  at org.elasticsearch.transport.TcpTransport.openConnection(TcpTransport.java:646) ~[main/:?]
  1>  at org.elasticsearch.transport.TcpTransport.connectToNode(TcpTransport.java:545) ~[main/:?]
  1>  at org.elasticsearch.test.transport.MockTransportService$LookupTestTransport.connectToNode(MockTransportService.java:518) ~[framework-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
  1>  at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:330) ~[main/:?]
  1>  at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:317) ~[main/:?]
  1>  at org.elasticsearch.cluster.NodeConnectionsService.validateAndConnectIfNeeded(NodeConnectionsService.java:153) [main/:?]
  1>  at org.elasticsearch.cluster.NodeConnectionsService$1.doRun(NodeConnectionsService.java:106) [main/:?]
  1>  at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:723) [main/:?]
  1>  at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [main/:?]
  1>  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_172]
  1>  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_172]
  1>  at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172]
  1> [2018-08-01T06:44:39,949][DEBUG][o.e.c.s.ClusterApplierService] [node_t1] applying cluster state version 9
  1> [2018-08-01T06:44:39,949][DEBUG][o.e.c.s.ClusterApplierService] [node_t1] apply cluster state with version 9
  1> [2018-08-01T06:44:39,949][TRACE][o.e.c.s.ClusterApplierService] [node_t1] calling [org.elasticsearch.indices.cluster.IndicesClusterStateService@7eb553a5] with change to version [9]
  1> [2018-08-01T06:44:39,949][TRACE][o.e.c.s.ClusterApplierService] [node_t1] calling [org.elasticsearch.script.ScriptService@32b8143d] with change to version [9]
  1> [2018-08-01T06:44:39,949][TRACE][o.e.c.s.ClusterApplierService] [node_t1] calling [org.elasticsearch.ingest.PipelineStore@4f81ef5d] with change to version [9]
  1> [2018-08-01T06:44:39,949][TRACE][o.e.c.s.ClusterApplierService] [node_t1] calling [org.elasticsearch.ingest.PipelineExecutionService@4a66f678] with change to version [9]
  1> [2018-08-01T06:44:39,949][TRACE][o.e.c.s.ClusterApplierService] [node_t1] calling [org.elasticsearch.gateway.GatewayAllocator$$Lambda$1110/1454779260@1e9aacea] with change to version [9]
  1> [2018-08-01T06:44:39,949][TRACE][o.e.c.s.ClusterApplierService] [node_t1] calling [org.elasticsearch.repositories.RepositoriesService@6cf15af8] with change to version [9]
  1> [2018-08-01T06:44:39,949][TRACE][o.e.c.s.ClusterApplierService] [node_t1] calling [org.elasticsearch.snapshots.RestoreService@2ccabb3c] with change to version [9]
  1> [2018-08-01T06:44:39,949][TRACE][o.e.c.s.ClusterApplierService] [node_t1] calling [org.elasticsearch.action.ingest.IngestActionForwarder@51625deb] with change to version [9]
  1> [2018-08-01T06:44:39,949][TRACE][o.e.c.s.ClusterApplierService] [node_t1] calling [org.elasticsearch.tasks.TaskManager@1ad456d1] with change to version [9]
  1> [2018-08-01T06:44:39,949][TRACE][o.e.c.s.ClusterApplierService] [node_t1] calling [org.elasticsearch.snapshots.SnapshotsService@7c94e9e9] with change to version [9]
  1> [2018-08-01T06:44:39,950][TRACE][o.e.c.s.ClusterApplierService] [node_t1] calling [org.elasticsearch.gateway.GatewayMetaState@5feae1c] with change to version [9]
  1> [2018-08-01T06:44:39,950][DEBUG][o.e.c.s.ClusterApplierService] [node_t1] set locally applied cluster state to version 9

The holdup is the 30-second connection timeout (see #29022). But what's strange is that this connection shouldn't be timing out - it's a connection on localhost to a port which is supposed to be closed, so it should yield immediate refusal. It's definitely to a closed port, because we see earlier connection refusals:

  1> [2018-08-01T06:44:09,210][WARN ][o.e.c.NodeConnectionsService] [node_t0] failed to connect to node {node_t3}{CUgOeW0-THaez0RC0m07qw}{KYX9YQH1Sl2oz3JPFKazWA}{127.0.0.1}{127.0.0.1:33257} (tried [1] times)
  1> org.elasticsearch.transport.ConnectTransportException: [node_t3][127.0.0.1:33257] connect_exception
...
  1> Caused by: java.net.ConnectException: Connection refused (Connection refused)
  1> [2018-08-01T06:44:09,212][TRACE][o.e.d.z.UnicastZenPing   ] [node_t0] [2] failed to ping {127.0.0.1:33257}{HNCsZ8-xQyCjOELP9QoZIQ}{127.0.0.1}{127.0.0.1:33257}
  1> org.elasticsearch.transport.ConnectTransportException: [][127.0.0.1:33257] connect_exception
...
  1> Caused by: java.net.ConnectException: Connection refused (Connection refused)
  1> [2018-08-01T06:44:09,544][TRACE][o.e.d.z.UnicastZenPing   ] [node_t0] [2] failed to ping {127.0.0.1:33257}{OrUqi6YARTeE8zFQdPUeXQ}{127.0.0.1}{127.0.0.1:33257}
  1> org.elasticsearch.transport.ConnectTransportException: [][127.0.0.1:33257] connect_exception
...
  1> Caused by: java.net.ConnectException: Connection refused (Connection refused)

After this, there are other connection attempts due to pinging which time out as well:

  1> [2018-08-01T06:44:10,877][TRACE][o.e.d.z.UnicastZenPing   ] [node_t0] [2] failed to ping {127.0.0.1:43597}{Aaa7SBg9SuejUTdLBu4hqQ}{127.0.0.1}{127.0.0.1:43597}
  1> org.elasticsearch.transport.ConnectTransportException: [][127.0.0.1:43597] connect_timeout[1s]
  1>  at org.elasticsearch.transport.TcpChannel.awaitConnected(TcpChannel.java:114) ~[main/:?]
  1>  at org.elasticsearch.transport.TcpTransport.openConnection(TcpTransport.java:646) ~[main/:?]
  1>  at org.elasticsearch.transport.TcpTransport.openConnection(TcpTransport.java:125) ~[main/:?]
  1>  at org.elasticsearch.test.transport.MockTransportService$LookupTestTransport.openConnection(MockTransportService.java:533) ~[framework-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
  1>  at org.elasticsearch.transport.TransportService.openConnection(TransportService.java:349) ~[main/:?]
  1>  at org.elasticsearch.test.transport.MockTransportService.openConnection(MockTransportService.java:813) ~[framework-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
  1>  at org.elasticsearch.discovery.zen.UnicastZenPing$PingingRound.getOrConnect(UnicastZenPing.java:365) ~[main/:?]
  1>  at org.elasticsearch.discovery.zen.UnicastZenPing$3.doRun(UnicastZenPing.java:472) [main/:?]
  1>  at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:723) [main/:?]
  1>  at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [main/:?]
  1>  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_172]
  1>  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_172]
  1>  at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172]
  1> [2018-08-01T06:44:10,877][TRACE][o.e.d.z.UnicastZenPing   ] [node_t0] [2] failed to ping {127.0.0.1:33257}{hq5HYpQ7T86A4xKCxqscsQ}{127.0.0.1}{127.0.0.1:33257}
  1> org.elasticsearch.transport.ConnectTransportException: [][127.0.0.1:33257] connect_timeout[1s]
...
  1> [2018-08-01T06:44:10,949][TRACE][o.e.d.z.UnicastZenPing   ] [node_t1] [2] failed to ping {127.0.0.1:33257}{GgGLdU_aQei6je26MFUJ7A}{127.0.0.1}{127.0.0.1:33257}
  1> org.elasticsearch.transport.ConnectTransportException: [][127.0.0.1:33257] connect_timeout[1s]
...
  1> [2018-08-01T06:44:10,949][TRACE][o.e.d.z.UnicastZenPing   ] [node_t1] [2] failed to ping {127.0.0.1:43597}{E3isLEliQjK6hgh12unCQg}{127.0.0.1}{127.0.0.1:43597}
  1> org.elasticsearch.transport.ConnectTransportException: [][127.0.0.1:43597] connect_timeout[1s]
...

Of course this doesn't reproduce locally, but I've not really much idea what could lead to a connection on localhost timing out like this. Input and ideas for next steps appreciated.

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions