-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Description
In https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+periodic/6775/console was a failure of o.e.c.MinimumMasterNodesIT.testMultipleNodesShutdownNonMasterNodes that was ultimately due to a connection timeout to a node that was shut down.
In the beginning there are three nodes. node_t0 is the master and minimum_master_nodes is 3. node_t3 is shut down:
1> [2018-08-01T06:44:09,205][INFO ][o.e.t.InternalTestCluster] [testMultipleNodesShutdownNonMasterNodes] Closing random non master node [node_t3] current master [node_t0]
The master notices and updates its cluster state appropriately:
1> [2018-08-01T06:44:09,209][TRACE][o.e.c.s.MasterService ] [node_t0] will process [zen-disco-node-left({node_t3}{CUgOeW0-THaez0RC0m07qw}{KYX9YQH1Sl2oz3JPFKazWA}{127.0.0.1}{127.0.0.1:33257}), reason(left)]
1> [2018-08-01T06:44:09,209][TRACE][o.e.c.s.MasterService ] [node_t0] will process [zen-disco-node-failed({node_t3}{CUgOeW0-THaez0RC0m07qw}{KYX9YQH1Sl2oz3JPFKazWA}{127.0.0.1}{127.0.0.1:33257}), reason(transport disconnected)]
1> [2018-08-01T06:44:09,209][DEBUG][o.e.c.s.MasterService ] [node_t0] processing [zen-disco-node-failed({node_t3}{CUgOeW0-THaez0RC0m07qw}{KYX9YQH1Sl2oz3JPFKazWA}{127.0.0.1}{127.0.0.1:33257}), reason(transport disconnected), zen-disco-node-left({node_t3}{CUgOeW0-THaez0RC0m07qw}{KYX9YQH1Sl2oz3JPFKazWA}{127.0.0.1}{127.0.0.1:33257}), reason(left)]: execute
1> [2018-08-01T06:44:09,209][WARN ][o.e.t.d.TestZenDiscovery ] [node_t0] not enough master nodes (has [2], but needed [3]), current nodes: nodes:
1> {node_t0}{3daZWUaxQfGt9a7wv6hKNA}{pzu-p-qUSByPXCP7SOSdFA}{127.0.0.1}{127.0.0.1:35729}, local, master
1> {node_t1}{N7pZoArIQkagoOjazUUnpg}{CHz8p-1ZSWK6x_N0-Frh7Q}{127.0.0.1}{127.0.0.1:36663}
1> {node_t3}{CUgOeW0-THaez0RC0m07qw}{KYX9YQH1Sl2oz3JPFKazWA}{127.0.0.1}{127.0.0.1:33257}
1> [2018-08-01T06:44:09,209][TRACE][o.e.d.z.NodeJoinController] [node_t0] starting an election context, will accumulate joins
1> [2018-08-01T06:44:09,209][DEBUG][o.e.c.s.MasterService ] [node_t0] processing [zen-disco-node-failed({node_t3}{CUgOeW0-THaez0RC0m07qw}{KYX9YQH1Sl2oz3JPFKazWA}{127.0.0.1}{127.0.0.1:33257}), reason(transport disconnected), zen-disco-node-left({node_t3}{CUgOeW0-THaez0RC0m07qw}{KYX9YQH1Sl2oz3JPFKazWA}{127.0.0.1}{127.0.0.1:33257}), reason(left)]: took [0s] no change in cluster state
1> [2018-08-01T06:44:09,209][DEBUG][o.e.c.s.ClusterApplierService] [node_t0] processing [not enough master nodes (has [2], but needed [3])]: execute
1> [2018-08-01T06:44:09,210][TRACE][o.e.c.s.ClusterApplierService] [node_t0] cluster state updated, source [not enough master nodes (has [2], but needed [3])]
1> cluster uuid: FZaCIUzwQQSJO1qlrDq-Rg
1> version: 9
1> state uuid: 9P8fB76MTe60o3DdFRiAVg
1> from_diff: false
1> meta data version: 6
1> [test/xM1OBH02QE2z4aIJH9p5Ug]: v[6]
1> 0: p_term [2], isa_ids [J_FCA5HgQ1C-7Zz6utjEyQ]
1> 1: p_term [1], isa_ids [GUI_NyhHSrCqcVxnkCEVMg]
1> 2: p_term [1], isa_ids [oIGUcqa_TaS5tPLryOUPBA]
1> 3: p_term [1], isa_ids [Yvg8EnkHSy-B9AOionROSw]
1> 4: p_term [2], isa_ids [zQZW2002TsGtb-AWpIeiPg]
1> metadata customs:
1> index-graveyard: IndexGraveyard[[]]
1> blocks:
1> _global_:
1> 2,no master, blocks WRITE,METADATA_WRITE
1> nodes:
1> {node_t0}{3daZWUaxQfGt9a7wv6hKNA}{pzu-p-qUSByPXCP7SOSdFA}{127.0.0.1}{127.0.0.1:35729}, local
1> {node_t1}{N7pZoArIQkagoOjazUUnpg}{CHz8p-1ZSWK6x_N0-Frh7Q}{127.0.0.1}{127.0.0.1:36663}
1> {node_t3}{CUgOeW0-THaez0RC0m07qw}{KYX9YQH1Sl2oz3JPFKazWA}{127.0.0.1}{127.0.0.1:33257}
Shortly (700ms) later, node_t1 notices that node_t0 is no longer the master, and also updates its cluster state:
1> [2018-08-01T06:44:09,947][DEBUG][o.e.d.z.MasterFaultDetection] [node_t1] [master] pinging a master {node_t0}{3daZWUaxQfGt9a7wv6hKNA}{pzu-p-qUSByPXCP7SOSdFA}{127.0.0.1}{127.0.0.1:35729} that is no longer a master
1> [2018-08-01T06:44:09,947][DEBUG][o.e.d.z.MasterFaultDetection] [node_t1] [master] stopping fault detection against master [{node_t0}{3daZWUaxQfGt9a7wv6hKNA}{pzu-p-qUSByPXCP7SOSdFA}{127.0.0.1}{127.0.0.1:35729}], reason [master failure, no longer master]
1> [2018-08-01T06:44:09,947][INFO ][o.e.t.d.TestZenDiscovery ] [node_t1] master_left [{node_t0}{3daZWUaxQfGt9a7wv6hKNA}{pzu-p-qUSByPXCP7SOSdFA}{127.0.0.1}{127.0.0.1:35729}], reason [no longer master]
1> org.elasticsearch.transport.RemoteTransportException: [node_t0][127.0.0.1:53240][internal:discovery/zen/fd/master_ping]
1> Caused by: org.elasticsearch.cluster.NotMasterException: local node is not master
1> [2018-08-01T06:44:09,948][WARN ][o.e.t.d.TestZenDiscovery ] [node_t1] master left (reason = no longer master), current nodes: nodes:
1> {node_t0}{3daZWUaxQfGt9a7wv6hKNA}{pzu-p-qUSByPXCP7SOSdFA}{127.0.0.1}{127.0.0.1:35729}, master
1> {node_t1}{N7pZoArIQkagoOjazUUnpg}{CHz8p-1ZSWK6x_N0-Frh7Q}{127.0.0.1}{127.0.0.1:36663}, local
1> {node_t3}{CUgOeW0-THaez0RC0m07qw}{KYX9YQH1Sl2oz3JPFKazWA}{127.0.0.1}{127.0.0.1:33257}
1> [2018-08-01T06:44:09,948][TRACE][o.e.d.z.NodeJoinController] [node_t1] starting an election context, will accumulate joins
1> [2018-08-01T06:44:09,948][DEBUG][o.e.c.s.ClusterApplierService] [node_t1] processing [master left (reason = no longer master)]: execute
1> [2018-08-01T06:44:09,948][TRACE][o.e.c.s.ClusterApplierService] [node_t1] cluster state updated, source [master left (reason = no longer master)]
1> cluster uuid: FZaCIUzwQQSJO1qlrDq-Rg
1> version: 9
1> state uuid: 9P8fB76MTe60o3DdFRiAVg
1> from_diff: false
1> meta data version: 6
1> [test/xM1OBH02QE2z4aIJH9p5Ug]: v[6]
1> 0: p_term [2], isa_ids [J_FCA5HgQ1C-7Zz6utjEyQ]
1> 1: p_term [1], isa_ids [GUI_NyhHSrCqcVxnkCEVMg]
1> 2: p_term [1], isa_ids [oIGUcqa_TaS5tPLryOUPBA]
1> 3: p_term [1], isa_ids [Yvg8EnkHSy-B9AOionROSw]
1> 4: p_term [2], isa_ids [zQZW2002TsGtb-AWpIeiPg]
1> metadata customs:
1> index-graveyard: IndexGraveyard[[]]
1> blocks:
1> _global_:
1> 2,no master, blocks WRITE,METADATA_WRITE
1> nodes:
1> {node_t0}{3daZWUaxQfGt9a7wv6hKNA}{pzu-p-qUSByPXCP7SOSdFA}{127.0.0.1}{127.0.0.1:35729}
1> {node_t1}{N7pZoArIQkagoOjazUUnpg}{CHz8p-1ZSWK6x_N0-Frh7Q}{127.0.0.1}{127.0.0.1:36663}, local
1> {node_t3}{CUgOeW0-THaez0RC0m07qw}{KYX9YQH1Sl2oz3JPFKazWA}{127.0.0.1}{127.0.0.1:33257}
However, we busy-wait for 20 seconds for both nodes to report a NO_MASTER_BLOCK in their cluster state, and node_t1 doesn't:
...
1> [2018-08-01T06:44:27,271][DEBUG][o.e.c.MinimumMasterNodesIT] [testMultipleNodesShutdownNonMasterNodes] Checking for NO_MASTER_BLOCK on client: org.elasticsearch.transport.MockTransportClient@109a20f8 NO_MASTER_BLOCK: [true]
1> [2018-08-01T06:44:27,272][DEBUG][o.e.c.MinimumMasterNodesIT] [testMultipleNodesShutdownNonMasterNodes] Checking for NO_MASTER_BLOCK on client: org.elasticsearch.transport.MockTransportClient@33e1bfab NO_MASTER_BLOCK: [false]
1> [2018-08-01T06:44:29,250][DEBUG][o.e.c.MinimumMasterNodesIT] [testMultipleNodesShutdownNonMasterNodes] Checking for NO_MASTER_BLOCK on client: org.elasticsearch.transport.MockTransportClient@109a20f8 NO_MASTER_BLOCK: [true]
1> [2018-08-01T06:44:29,250][DEBUG][o.e.c.MinimumMasterNodesIT] [testMultipleNodesShutdownNonMasterNodes] Checking for NO_MASTER_BLOCK on client: org.elasticsearch.transport.MockTransportClient@33e1bfab NO_MASTER_BLOCK: [false]
1> [2018-08-01T06:44:29,251][INFO ][o.e.c.MinimumMasterNodesIT] [testMultipleNodesShutdownNonMasterNodes] [MinimumMasterNodesIT#testMultipleNodesShutdownNonMasterNodes]: cleaning up after test
It's a further 10 seconds (i.e. 30 seconds since the cluster state application started) before node_t1 completes the cluster state application:
1> [2018-08-01T06:44:39,949][WARN ][o.e.c.NodeConnectionsService] [node_t1] failed to connect to node {node_t3}{CUgOeW0-THaez0RC0m07qw}{KYX9YQH1Sl2oz3JPFKazWA}{127.0.0.1}{127.0.0.1:33257} (tried [1] times)
1> org.elasticsearch.transport.ConnectTransportException: [node_t3][127.0.0.1:33257] connect_timeout[30s]
1> at org.elasticsearch.transport.TcpChannel.awaitConnected(TcpChannel.java:114) ~[main/:?]
1> at org.elasticsearch.transport.TcpTransport.openConnection(TcpTransport.java:646) ~[main/:?]
1> at org.elasticsearch.transport.TcpTransport.connectToNode(TcpTransport.java:545) ~[main/:?]
1> at org.elasticsearch.test.transport.MockTransportService$LookupTestTransport.connectToNode(MockTransportService.java:518) ~[framework-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
1> at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:330) ~[main/:?]
1> at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:317) ~[main/:?]
1> at org.elasticsearch.cluster.NodeConnectionsService.validateAndConnectIfNeeded(NodeConnectionsService.java:153) [main/:?]
1> at org.elasticsearch.cluster.NodeConnectionsService$1.doRun(NodeConnectionsService.java:106) [main/:?]
1> at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:723) [main/:?]
1> at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [main/:?]
1> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_172]
1> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_172]
1> at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172]
1> [2018-08-01T06:44:39,949][DEBUG][o.e.c.s.ClusterApplierService] [node_t1] applying cluster state version 9
1> [2018-08-01T06:44:39,949][DEBUG][o.e.c.s.ClusterApplierService] [node_t1] apply cluster state with version 9
1> [2018-08-01T06:44:39,949][TRACE][o.e.c.s.ClusterApplierService] [node_t1] calling [org.elasticsearch.indices.cluster.IndicesClusterStateService@7eb553a5] with change to version [9]
1> [2018-08-01T06:44:39,949][TRACE][o.e.c.s.ClusterApplierService] [node_t1] calling [org.elasticsearch.script.ScriptService@32b8143d] with change to version [9]
1> [2018-08-01T06:44:39,949][TRACE][o.e.c.s.ClusterApplierService] [node_t1] calling [org.elasticsearch.ingest.PipelineStore@4f81ef5d] with change to version [9]
1> [2018-08-01T06:44:39,949][TRACE][o.e.c.s.ClusterApplierService] [node_t1] calling [org.elasticsearch.ingest.PipelineExecutionService@4a66f678] with change to version [9]
1> [2018-08-01T06:44:39,949][TRACE][o.e.c.s.ClusterApplierService] [node_t1] calling [org.elasticsearch.gateway.GatewayAllocator$$Lambda$1110/1454779260@1e9aacea] with change to version [9]
1> [2018-08-01T06:44:39,949][TRACE][o.e.c.s.ClusterApplierService] [node_t1] calling [org.elasticsearch.repositories.RepositoriesService@6cf15af8] with change to version [9]
1> [2018-08-01T06:44:39,949][TRACE][o.e.c.s.ClusterApplierService] [node_t1] calling [org.elasticsearch.snapshots.RestoreService@2ccabb3c] with change to version [9]
1> [2018-08-01T06:44:39,949][TRACE][o.e.c.s.ClusterApplierService] [node_t1] calling [org.elasticsearch.action.ingest.IngestActionForwarder@51625deb] with change to version [9]
1> [2018-08-01T06:44:39,949][TRACE][o.e.c.s.ClusterApplierService] [node_t1] calling [org.elasticsearch.tasks.TaskManager@1ad456d1] with change to version [9]
1> [2018-08-01T06:44:39,949][TRACE][o.e.c.s.ClusterApplierService] [node_t1] calling [org.elasticsearch.snapshots.SnapshotsService@7c94e9e9] with change to version [9]
1> [2018-08-01T06:44:39,950][TRACE][o.e.c.s.ClusterApplierService] [node_t1] calling [org.elasticsearch.gateway.GatewayMetaState@5feae1c] with change to version [9]
1> [2018-08-01T06:44:39,950][DEBUG][o.e.c.s.ClusterApplierService] [node_t1] set locally applied cluster state to version 9
The holdup is the 30-second connection timeout (see #29022). But what's strange is that this connection shouldn't be timing out - it's a connection on localhost to a port which is supposed to be closed, so it should yield immediate refusal. It's definitely to a closed port, because we see earlier connection refusals:
1> [2018-08-01T06:44:09,210][WARN ][o.e.c.NodeConnectionsService] [node_t0] failed to connect to node {node_t3}{CUgOeW0-THaez0RC0m07qw}{KYX9YQH1Sl2oz3JPFKazWA}{127.0.0.1}{127.0.0.1:33257} (tried [1] times)
1> org.elasticsearch.transport.ConnectTransportException: [node_t3][127.0.0.1:33257] connect_exception
...
1> Caused by: java.net.ConnectException: Connection refused (Connection refused)
1> [2018-08-01T06:44:09,212][TRACE][o.e.d.z.UnicastZenPing ] [node_t0] [2] failed to ping {127.0.0.1:33257}{HNCsZ8-xQyCjOELP9QoZIQ}{127.0.0.1}{127.0.0.1:33257}
1> org.elasticsearch.transport.ConnectTransportException: [][127.0.0.1:33257] connect_exception
...
1> Caused by: java.net.ConnectException: Connection refused (Connection refused)
1> [2018-08-01T06:44:09,544][TRACE][o.e.d.z.UnicastZenPing ] [node_t0] [2] failed to ping {127.0.0.1:33257}{OrUqi6YARTeE8zFQdPUeXQ}{127.0.0.1}{127.0.0.1:33257}
1> org.elasticsearch.transport.ConnectTransportException: [][127.0.0.1:33257] connect_exception
...
1> Caused by: java.net.ConnectException: Connection refused (Connection refused)
After this, there are other connection attempts due to pinging which time out as well:
1> [2018-08-01T06:44:10,877][TRACE][o.e.d.z.UnicastZenPing ] [node_t0] [2] failed to ping {127.0.0.1:43597}{Aaa7SBg9SuejUTdLBu4hqQ}{127.0.0.1}{127.0.0.1:43597}
1> org.elasticsearch.transport.ConnectTransportException: [][127.0.0.1:43597] connect_timeout[1s]
1> at org.elasticsearch.transport.TcpChannel.awaitConnected(TcpChannel.java:114) ~[main/:?]
1> at org.elasticsearch.transport.TcpTransport.openConnection(TcpTransport.java:646) ~[main/:?]
1> at org.elasticsearch.transport.TcpTransport.openConnection(TcpTransport.java:125) ~[main/:?]
1> at org.elasticsearch.test.transport.MockTransportService$LookupTestTransport.openConnection(MockTransportService.java:533) ~[framework-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
1> at org.elasticsearch.transport.TransportService.openConnection(TransportService.java:349) ~[main/:?]
1> at org.elasticsearch.test.transport.MockTransportService.openConnection(MockTransportService.java:813) ~[framework-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
1> at org.elasticsearch.discovery.zen.UnicastZenPing$PingingRound.getOrConnect(UnicastZenPing.java:365) ~[main/:?]
1> at org.elasticsearch.discovery.zen.UnicastZenPing$3.doRun(UnicastZenPing.java:472) [main/:?]
1> at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:723) [main/:?]
1> at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [main/:?]
1> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_172]
1> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_172]
1> at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172]
1> [2018-08-01T06:44:10,877][TRACE][o.e.d.z.UnicastZenPing ] [node_t0] [2] failed to ping {127.0.0.1:33257}{hq5HYpQ7T86A4xKCxqscsQ}{127.0.0.1}{127.0.0.1:33257}
1> org.elasticsearch.transport.ConnectTransportException: [][127.0.0.1:33257] connect_timeout[1s]
...
1> [2018-08-01T06:44:10,949][TRACE][o.e.d.z.UnicastZenPing ] [node_t1] [2] failed to ping {127.0.0.1:33257}{GgGLdU_aQei6je26MFUJ7A}{127.0.0.1}{127.0.0.1:33257}
1> org.elasticsearch.transport.ConnectTransportException: [][127.0.0.1:33257] connect_timeout[1s]
...
1> [2018-08-01T06:44:10,949][TRACE][o.e.d.z.UnicastZenPing ] [node_t1] [2] failed to ping {127.0.0.1:43597}{E3isLEliQjK6hgh12unCQg}{127.0.0.1}{127.0.0.1:43597}
1> org.elasticsearch.transport.ConnectTransportException: [][127.0.0.1:43597] connect_timeout[1s]
...
Of course this doesn't reproduce locally, but I've not really much idea what could lead to a connection on localhost timing out like this. Input and ideas for next steps appreciated.