ClusterApplierService stuck for mins while establishing connections to other node due to mismatch ephemeralId


**Elasticsearch version** (`bin/elasticsearch --version`): 7.1

**Plugins installed**: []

**Description of the problem including expected versus actual behavior**:

ClusterApplierService on master was stuck for 48m until it was eventually interrupted.

On further investigation, it was found that it was stuck establishing node connection which was repeated failing due to **handshake failed. unexpected remote node** error

**Issue:** The ES process was restarted on the target node causing the ephemeralId to mismatch previous: pi9bH-T5RFOTl4JB8niS-w new: izRTvB7KRVCenyA20GdH6A, resulting in unexpected remote node exception.

Though this change  #39629 would reduce the occurrence by not establishing connections to already disconnected nodes and establish connection to new nodes only which is present in 7.2.0 on wards. But this could happen with a new node as well (in versions > 7.1.1) if during cluster state processing, the ES process was restarted.
 
ClusterApplierService shouldn't to be stuck while establishing connection in case ephemeralId mismatches. This node should be removed from the cluster, so that it joins back again. 

**Steps to reproduce**:

1. Node sends join request to master
2. During the state processing ES process was restarted on new node, causing its ephemeralId to change.
3. This would cause ClusterApplierService to stuck while establishing connections

**Provide logs (if relevant)**:

`[2020-05-01T03:41:40,986][WARN ][o.e.c.s.MasterService ] [2e383c] failed to publish updated cluster state in [48.9m]: version [119560], uuid [VhfHxS62SPewZa2IpQ7elg], source [elected-as-master ([3] nodes joined) ..
java.lang.IllegalStateException: Future got interrupted
at org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:60) ~[elasticsearch-7.1.1.jar:7.1.1]
at org.elasticsearch.cluster.service.MasterService.publish(MasterService.java:256) [elasticsearch-7.1.1.jar:7.1.1]
at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:238) [elasticsearch-7.1.1.jar:7.1.1]
at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:142) [elasticsearch-7.1.1.jar:7.1.1]
at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [elasticsearch-7.1.1.jar:7.1.1]
at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) [elasticsearch-7.1.1.jar:7.1.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:690) [elasticsearch-7.1.1.jar:7.1.1]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) [elasticsearch-7.1.1.jar:7.1.1]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) [elasticsearch-7.1.1.jar:7.1.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_172]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_172]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172]
Caused by: java.lang.InterruptedException
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:998) ~[?:1.8.0_172]
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) ~[?:1.8.0_172]
`

`
[2020-05-01T03:09:28,186][WARN ][o.e.c.NodeConnectionsService] [2e383c] failed to connect to node {158fe9}{0CZIV4StTSCJ9uy607yYdA}{pi9bH-T5RFOTl4JB8niS-w}{x.x.x.x}{x.x.x.x:9300} (tried [85] times)
org.elasticsearch.transport.ConnectTransportException: [158fe9][x.x.x.x:9300] handshake failed. unexpected remote node {158fe9}{0CZIV4StTSCJ9uy607yYdA}{izRTvB7KRVCenyA20GdH6A}
	at org.elasticsearch.transport.TransportService.lambda$connectionValidator$4(TransportService.java:352) ~[elasticsearch-7.1.1.jar:7.1.1]
	at org.elasticsearch.transport.ConnectionManager.connectToNode(ConnectionManager.java:105) ~[elasticsearch-7.1.1.jar:7.1.1]
	at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:344) ~[elasticsearch-7.1.1.jar:7.1.1]
	at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:331) ~[elasticsearch-7.1.1.jar:7.1.1]
	at org.elasticsearch.cluster.NodeConnectionsService.validateAndConnectIfNeeded(NodeConnectionsService.java:153) [elasticsearch-7.1.1.jar:7.1.1]
...
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:510) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:470) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909) [netty-common-4.1.32.Final.jar:4.1.32.Final]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172]
`



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ClusterApplierService stuck for mins while establishing connections to other node due to mismatch ephemeralId #56979

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ClusterApplierService stuck for mins while establishing connections to other node due to mismatch ephemeralId #56979

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions