Skip to content

[Optimization]: During reroute async fetch data in GatewayAllocator, send request in generic threadpool instead of masterService#updateTask #57498

@shwetathareja

Description

@shwetathareja

Elasticsearch version (bin/elasticsearch --version): 7.1

Description of the problem including expected versus actual behavior:

This cluster had roughly ~ 50K shards and multiple nodes were disconnected from master and during hot_threads dump observed that master#updateTask thread was spending lot of time sending these TransportNodesListGatewayStartedShards requests. These are async calls and response handling is done in separate threadpool but transport sendRequest seems to be still executing in master#updateTask. The sending logic could also move to separate threadpool.

Though with this #42855 this wont happen in the context of JoinExecutor but when reroute task is scheduled, it will happen.

Provide logs (if relevant):

80.1% (8s out of 10s) cpu usage by thread 'elasticsearch[7b5db5][masterService#updateTask][T#1]'
     7/10 snapshots sharing following 54 elements
       sun.nio.ch.EPollArrayWrapper.interrupt(Native Method)
       sun.nio.ch.EPollArrayWrapper.interrupt(EPollArrayWrapper.java:317)
       sun.nio.ch.EPollSelectorImpl.wakeup(EPollSelectorImpl.java:207)
       io.netty.channel.nio.NioEventLoop.wakeup(NioEventLoop.java:719)
       io.netty.util.concurrent.SingleThreadEventExecutor.execute(SingleThreadEventExecutor.java:799)
       io.netty.channel.AbstractChannelHandlerContext.safeExecute(AbstractChannelHandlerContext.java:1013)
       io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:825)
       io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:794)
       io.netty.channel.DefaultChannelPipeline.writeAndFlush(DefaultChannelPipeline.java:1066)
       io.netty.channel.AbstractChannel.writeAndFlush(AbstractChannel.java:309)
       org.elasticsearch.transport.netty4.Netty4TcpChannel.sendMessage(Netty4TcpChannel.java:139)
       org.elasticsearch.transport.OutboundHandler.internalSendMessage(OutboundHandler.java:80)
       org.elasticsearch.transport.OutboundHandler.sendMessage(OutboundHandler.java:70)
       org.elasticsearch.transport.TcpTransport.sendRequestToChannel(TcpTransport.java:680)
       org.elasticsearch.transport.TcpTransport.sendRequestToChannel(TcpTransport.java:669)
       org.elasticsearch.transport.TcpTransport.access$300(TcpTransport.java:100)
       org.elasticsearch.transport.TcpTransport$NodeChannels.sendRequest(TcpTransport.java:272)
       org.elasticsearch.transport.TransportService.sendRequestInternal(TransportService.java:633)
       org.elasticsearch.transport.TransportService$$Lambda$1519/1694636980.sendRequest(Unknown Source)
       org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:543)
       org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:531)
       org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.start(TransportNodesAction.java:182)
       org.elasticsearch.action.support.nodes.TransportNodesAction.doExecute(TransportNodesAction.java:82)
       org.elasticsearch.action.support.nodes.TransportNodesAction.doExecute(TransportNodesAction.java:51)
       org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:145)
       com.amazon.opendistro.elasticsearch.performanceanalyzer.action.PerformanceAnalyzerActionFilter.apply(PerformanceAnalyzerActionFilter.java:77)
       org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:143)
       org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:121)
       org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:64)
       org.elasticsearch.gateway.TransportNodesListGatewayStartedShards.list(TransportNodesListGatewayStartedShards.java:91)
       org.elasticsearch.gateway.AsyncShardFetch.asyncFetch(AsyncShardFetch.java:283)
       org.elasticsearch.gateway.AsyncShardFetch.fetchData(AsyncShardFetch.java:126)
       org.elasticsearch.gateway.GatewayAllocator$InternalPrimaryShardAllocator.fetchData(GatewayAllocator.java:159)
       org.elasticsearch.gateway.PrimaryShardAllocator.makeAllocationDecision(PrimaryShardAllocator.java:86)
       org.elasticsearch.gateway.BaseGatewayShardAllocator.allocateUnassigned(BaseGatewayShardAllocator.java:59)
       org.elasticsearch.gateway.GatewayAllocator.innerAllocatedUnassigned(GatewayAllocator.java:114)
       org.elasticsearch.gateway.GatewayAllocator.allocateUnassigned(GatewayAllocator.java:104)
       org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:410)
       org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:378)
       org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:361)
       org.elasticsearch.cluster.coordination.JoinTaskExecutor.execute(JoinTaskExecutor.java:155)
       org.elasticsearch.cluster.coordination.JoinHelper$1.execute(JoinHelper.java:118)
       org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:687)
       org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:310)
       org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:210)
       org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:142)
       org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150)
       org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188)
       org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:690)
       org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252)
       org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       java.lang.Thread.run(Thread.java:748)

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions