-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Description
Shards get stuck initializing on a large cluster
Elasticsearch version:
"version": {
"number": "5.5.1",
"build_hash": "19c13d0",
"build_date": "2017-07-18T20:44:24.823Z",
"build_snapshot": false,
"lucene_version": "6.6.0"
}
Plugins installed: [analysis-icu, analysis-kuromoji, mapper-murmur3, repository-s3, custom-discovery]
JVM version (java -version):
java version "1.8.0_144"
Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
**Java** HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)
OS version (uname -a):
Linux ... 4.4.0-87-generic #110-Ubuntu SMP Tue Jul 18 12:55:35 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Description of the problem and steps to reproduce
In this particular case I have a cluster with 3 dedicated masters and 120 data nodes, I create an index of 360 shards and 2 replicas. It is possible to reproduce it on an index with less shards, just takes longer.
After several successful attempts to create 360 shard indices, you get an index that would have a shard stuck in the initializing state. After ~ 15 minutes, problematic shard gets unstuck, allocated, and started.
Cluster configuration
Here's index template that is applied to these indices:
{
"order": 0,
"template": "test360_*",
"settings": {
"index": {
"refresh_interval": "15s",
"number_of_shards": "360",
"number_of_replicas": "2"
}
},
"mappings": {}
}
Here's the request I send to create an index:
POST /test360_19/default
{
"msg": "hello"
}
elasticsearch.yml:
bootstrap.memory_lock: true
node.max_local_storage_nodes: 1
action.destructive_requires_name: true
network.bind_host:
- _local_
- _global_
network.publish_host: _global_
transport.tcp.port: 9002
http.port: 9004
discovery.zen.ping_timeout: 30s
discovery.zen.publish_timeout: 60s
discovery.zen.fd.ping_interval: 30s
discovery.zen.fd.ping_timeout: 30s
discovery.zen.fd.ping_retries: 10
discovery.zen.minimum_master_nodes: 2
script.engine.painless.inline: true
script.engine.painless.stored.search: true
cluster.routing.rebalance.enable: all
cluster.routing.allocation.cluster_concurrent_rebalance: 4
cluster.routing.allocation.node_concurrent_recoveries: 4
cluster.routing.allocation.disk.watermark.low: 80%
cluster.routing.allocation.disk.watermark.high: 85%
thread_pool.bulk.queue_size: 2000
thread_pool.index.queue_size: 2000
thread_pool.search.queue_size: 1000
logger.org.elasticsearch.indices.recovery: DEBUG
logger.org.elasticsearch.cluster.routing: DEBUG
logger.org.elasticsearch.cluster.action: DEBUG
logger.org.elasticsearch.cluster.service: DEBUG
logger.org.elasticsearch.indices.cluster: DEBUG
reindex.remote.whitelist: '*:9004'
cluster.name: es_cluster_120
node.name: node_az3_006a1fb63b6
path.data: /data/es
path.logs: /logs/es
http.cors.enabled: true
http.cors.allow-origin: '*'
discovery.zen.hosts_provider: custom-discovery
cluster.routing.allocation.awareness.attributes: rack_id
node.attr.rack_id: az3
node.master: false
node.data: true
node.ingest: false
That is for data nodes, master nodes have node.master: true and node.data: false.
Additional settings:
{
"persistent": {},
"transient": {
"logger": {
"org": {
"elasticsearch": {
"TransportService": {
"tracer": "TRACE"
}
}
}
},
"transport": {
"tracer": {
"include": "internal:index/shard/*"
}
}
}
}
jvm.options:
## JVM configuration
################################################################
## IMPORTANT: JVM heap size
################################################################
##
## You should always set the min and max JVM heap
## size to the same value. For example, to set
## the heap to 4 GB, set:
##
## -Xms4g
## -Xmx4g
##
## See https://www.elastic.co/guide/en/elasticsearch/reference/current/heap-size.html
## for more information
##
################################################################
# Xms represents the initial size of total heap space
# Xmx represents the maximum size of total heap space
-Xms15338M
-Xmx15338M
################################################################
## Expert settings
################################################################
##
## All settings below this section are considered
## expert settings. Don't tamper with them unless
## you understand what you are doing
##
################################################################
## GC configuration
-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly
## optimizations
# pre-touch memory pages used by the JVM during initialization
-XX:+AlwaysPreTouch
## basic
# force the server VM (remove on 32-bit client JVMs)
-server
# explicitly set the stack size (reduce to 320k on 32-bit client JVMs)
-Xss1m
# set to headless, just in case
-Djava.awt.headless=true
# ensure UTF-8 encoding by default (e.g. filenames)
-Dfile.encoding=UTF-8
# use our provided JNA always versus the system one
-Djna.nosys=true
# use old-style file permissions on JDK9
-Djdk.io.permissionsUseCanonicalPath=true
# flags to configure Netty
-Dio.netty.noUnsafe=true
-Dio.netty.noKeySetOptimization=true
-Dio.netty.recycler.maxCapacityPerThread=0
# log4j 2
-Dlog4j.shutdownHookEnabled=false
-Dlog4j2.disable.jmx=true
-Dlog4j.skipJansi=true
## heap dumps
# generate a heap dump when an allocation from the Java heap fails
# heap dumps are created in the working directory of the JVM
-XX:+HeapDumpOnOutOfMemoryError
# specify an alternative path for heap dumps
# ensure the directory exists and has sufficient space
-XX:HeapDumpPath=/logs/es/es_heap_dump.hprof
## GC logging
-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps
-XX:+PrintGCDateStamps
-XX:+PrintClassHistogram
-XX:+PrintTenuringDistribution
-XX:+PrintGCApplicationStoppedTime
# log GC status to a file with time stamps
# ensure the directory exists
-Xloggc:/logs/es/es_gc-%t.log
# By default, the GC log file will not rotate.
# By uncommenting the lines below, the GC log file
# will be rotated every 128MB at most 32 times.
-XX:+UseGCLogFileRotation
-XX:NumberOfGCLogFiles=16
-XX:GCLogFileSize=128M
# Elasticsearch 5.0.0 will throw an exception on unquoted field names in JSON.
# If documents were already indexed with unquoted fields in a previous version
# of Elasticsearch, some operations may throw errors.
#
# WARNING: This option will be removed in Elasticsearch 6.0.0 and is provided
# only for migration purposes.
#-Delasticsearch.json.allow_unquoted_field_names=true
Command line of the actual process:
elastic+ 21904 1 10 Aug17 ? 01:57:51 /usr/bin/java -Xms15338M -Xmx15338M -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+AlwaysPreTouch -server -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -Djdk.io.permissionsUseCanonicalPath=true -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Dlog4j.skipJansi=true -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/logs/es/es_heap_dump.hprof -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintClassHistogram -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime -Xloggc:/logs/es/es_gc-%t.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=16 -XX:GCLogFileSize=128M -Des.path.home=/apps/elasticsearch -cp /apps/elasticsearch/lib/* org.elasticsearch.bootstrap.Elasticsearch -p /run/elasticsearch/elasticsearch.pid --quiet
Process limits:
$ cat /proc/21904/limits
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 8388608 unlimited bytes
Max core file size 0 unlimited bytes
Max resident set unlimited unlimited bytes
Max processes 1967818 1967818 processes
Max open files 100000 100000 files
Max locked memory unlimited unlimited bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 1967818 1967818 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0
Max realtime timeout unlimited unlimited us
Logs
There's not too many mentions about the problematic shard, however when it gets unstuck I consistently observe the following exception on both the relevant data node and the active master.
Data node that had problematic shard:
...
[2017-08-18T17:02:57,201][DEBUG][o.e.c.s.ClusterService ] [node_az3_006a1fb63b6] applying cluster state version 1171
[2017-08-18T17:02:57,201][DEBUG][o.e.c.s.ClusterService ] [node_az3_006a1fb63b6] set local cluster state to version 1171
[2017-08-18T17:02:57,212][DEBUG][o.e.c.s.ClusterService ] [node_az3_006a1fb63b6] processing [zen-disco-receive(from master [master {node_az2_084354b0d5}{86-R436YR56aHOtxCptzjg}{YSViwD-tQhSJqe3y9w6dWw}{100.0.0.2}{100.0.0.2:9002}{rack_id=az2} committed version [1171]])]: took [11ms] done applying updated cluster_state (version: 1171, uuid: -U_RCKbTSIG6kiRDtuWSmg)
[2017-08-18T17:17:55,337][TRACE][o.e.t.T.tracer ] [node_az3_006a1fb63b6] [3262984][internal:index/shard/recovery/start_recovery] received response from [{node_az2_f79f3497d0}{CTBV-A2tT5ysQ7uVX1vjPw}{4IMXO6pHQ_SNQS5mSzF6kg}{100.0.0.3}{100.0.0.3:9002}{rack_id=az2}]
[2017-08-18T17:17:55,338][WARN ][o.e.i.c.IndicesClusterStateService] [node_az3_006a1fb63b6] [[test360_19][143]] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [test360_19][143]: Recovery failed from {node_az2_f79f3497d0}{CTBV-A2tT5ysQ7uVX1vjPw}{4IMXO6pHQ_SNQS5mSzF6kg}{100.0.0.3}{100.0.0.3:9002}{rack_id=az2} into {node_az3_006a1fb63b6}{b-sQwH5dQQixJ9fN27l2eA}{NBnUGKW7R2CUXlI7byGm5g}{100.0.0.1}{100.0.0.1:9002}{rack_id=az3}
at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.doRecovery(PeerRecoveryTargetService.java:314) [elasticsearch-5.5.1.jar:5.5.1]
at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.access$900(PeerRecoveryTargetService.java:73) [elasticsearch-5.5.1.jar:5.5.1]
at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$RecoveryRunner.doRun(PeerRecoveryTargetService.java:556) [elasticsearch-5.5.1.jar:5.5.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) [elasticsearch-5.5.1.jar:5.5.1]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.5.1.jar:5.5.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_144]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_144]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_144]
Caused by: org.elasticsearch.transport.RemoteTransportException: [node_az2_0ff81340d0bc][100.0.0.3:9002][internal:index/shard/recovery/start_recovery]
Caused by: org.elasticsearch.index.engine.RecoveryEngineException: Phase[1] phase1 failed
at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:140) ~[elasticsearch-5.5.1.jar:5.5.1]
at org.elasticsearch.indices.recovery.PeerRecoverySourceService.recover(PeerRecoverySourceService.java:132) ~[elasticsearch-5.5.1.jar:5.5.1]
at org.elasticsearch.indices.recovery.PeerRecoverySourceService.access$100(PeerRecoverySourceService.java:54) ~[elasticsearch-5.5.1.jar:5.5.1]
at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:141) ~[elasticsearch-5.5.1.jar:5.5.1]
at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:138) ~[elasticsearch-5.5.1.jar:5.5.1]
at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33) ~[elasticsearch-5.5.1.jar:5.5.1]
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) ~[elasticsearch-5.5.1.jar:5.5.1]
at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1544) ~[elasticsearch-5.5.1.jar:5.5.1]
... 5 more
Caused by: org.elasticsearch.indices.recovery.RecoverFilesRecoveryException: Failed to transfer [1] files with total size of [162b]
at org.elasticsearch.indices.recovery.RecoverySourceHandler.phase1(RecoverySourceHandler.java:337) ~[elasticsearch-5.5.1.jar:5.5.1]
at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:138) ~[elasticsearch-5.5.1.jar:5.5.1]
at org.elasticsearch.indices.recovery.PeerRecoverySourceService.recover(PeerRecoverySourceService.java:132) ~[elasticsearch-5.5.1.jar:5.5.1]
at org.elasticsearch.indices.recovery.PeerRecoverySourceService.access$100(PeerRecoverySourceService.java:54) ~[elasticsearch-5.5.1.jar:5.5.1]
at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:141) ~[elasticsearch-5.5.1.jar:5.5.1]
at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:138) ~[elasticsearch-5.5.1.jar:5.5.1]
at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33) ~[elasticsearch-5.5.1.jar:5.5.1]
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) ~[elasticsearch-5.5.1.jar:5.5.1]
at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1544) ~[elasticsearch-5.5.1.jar:5.5.1]
... 5 more
Caused by: org.elasticsearch.transport.ReceiveTimeoutTransportException: [node_az3_006a1fb63b6][100.0.0.1:9002][internal:index/shard/recovery/prepare_translog] request_id [1106203] timed out after [900000ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:951) ~[elasticsearch-5.5.1.jar:5.5.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) ~[elasticsearch-5.5.1.jar:5.5.1]
... 3 more
[2017-08-18T17:17:55,339][DEBUG][o.e.c.a.s.ShardStateAction] [node_az3_006a1fb63b6] [test360_19][143] sending [internal:cluster/shard/failure] to [86-R436YR56aHOtxCptzjg] for shard entry [shard id [[test360_19][143]], allocation id [uLI9lUk3To2entCZ-KQcAw], primary term [0], message [failed recovery], failure [RecoveryFailedException[[test360_19][143]: Recovery failed from {node_az2_f79f3497d0}{CTBV-A2tT5ysQ7uVX1vjPw}{4IMXO6pHQ_SNQS5mSzF6kg}{100.0.0.3}{100.0.0.3:9002}{rack_id=az2} into {node_az3_006a1fb63b6}{b-sQwH5dQQixJ9fN27l2eA}{NBnUGKW7R2CUXlI7byGm5g}{100.0.0.1}{100.0.0.1:9002}{rack_id=az3}]; nested: RemoteTransportException[[node_az2_f79f3497d0][100.0.0.3:9002][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [1] files with total size of [162b]]; nested: ReceiveTimeoutTransportException[[node_az3_006a1fb63b6][100.0.0.1:9002][internal:index/shard/recovery/prepare_translog] request_id [1106203] timed out after [900000ms]]; ]]
[2017-08-18T17:17:55,619][DEBUG][o.e.c.s.ClusterService ] [node_az3_006a1fb63b6] processing [zen-disco-receive(from master [master {node_az2_084354b0d5}{86-R436YR56aHOtxCptzjg}{YSViwD-tQhSJqe3y9w6dWw}{100.0.0.2}{100.0.0.2:9002}{rack_id=az2} committed version [1172]])]: execute
[2017-08-18T17:17:55,619][DEBUG][o.e.c.s.ClusterService ] [node_az3_006a1fb63b6] cluster state updated, version [1172], source [zen-disco-receive(from master [master {node_az2_084354b0d5}{86-R436YR56aHOtxCptzjg}{YSViwD-tQhSJqe3y9w6dWw}{100.0.0.2}{100.0.0.2:9002}{rack_id=az2} committed version [1172]])]
[2017-08-18T17:17:55,620][DEBUG][o.e.c.s.ClusterService ] [node_az3_006a1fb63b6] applying cluster state version 1172
[2017-08-18T17:17:55,620][DEBUG][o.e.c.s.ClusterService ] [node_az3_006a1fb63b6] set local cluster state to version 1172
...
Even though this is connection related exception, I don't see any other exceptions that would show that there's a genuine network issue between the two nodes.
Active master:
...
[2017-08-18T17:02:56,068][DEBUG][o.e.c.a.s.ShardStateAction] [us-east-1d.i-086432e5a1d4b01d5] [test360_19][143] starting shard [test360_19][143], node[ljMLTQZbRce8245_yhGLGA], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=FBeIYPFeSWq8HwuzBwdxQw], unassigned_info[[reason=INDEX_CREATED], at[2017-08-18T17:02:53.615Z], delayed=false, allocation_status[no_attempt]] (shard started task: [shard id [[test360_19][143]], allocation id [FBeIYPFeSWq8HwuzBwdxQw], primary term [0], message [after peer recovery]])
...
[2017-08-18T17:02:57,189][DEBUG][o.e.c.s.ClusterService ] [node_az2_084354b0d5] processing [put-mapping[default]]: execute
[2017-08-18T17:02:57,192][INFO ][o.e.c.m.MetaDataMappingService] [node_az2_084354b0d5] [test360_19/Okz9GdoYTgqnBBDr0KDRdQ] create_mapping [default]
[2017-08-18T17:02:57,192][DEBUG][o.e.c.s.ClusterService ] [node_az2_084354b0d5] cluster state updated, version [1171], source [put-mapping[default]]
[2017-08-18T17:02:57,192][DEBUG][o.e.c.s.ClusterService ] [node_az2_084354b0d5] publishing cluster state version [1171]
[2017-08-18T17:02:57,303][DEBUG][o.e.c.s.ClusterService ] [node_az2_084354b0d5] applying cluster state version 1171
[2017-08-18T17:02:57,303][DEBUG][o.e.c.s.ClusterService ] [node_az2_084354b0d5] set local cluster state to version 1171
[2017-08-18T17:02:57,315][DEBUG][o.e.c.s.ClusterService ] [node_az2_084354b0d5] processing [put-mapping[default]]: took [126ms] done applying updated cluster_state (version: 1171, uuid: -U_RCKbTSIG6kiRDtuWSmg)
[2017-08-18T17:17:55,340][WARN ][o.e.c.a.s.ShardStateAction] [node_az2_084354b0d5] [test360_19][143] received shard failed for shard id [[test360_19][143]], allocation id [uLI9lUk3To2entCZ-KQcAw], primary term [0], message [failed recovery], failure [RecoveryFailedException[[test360_19][143]: Recovery failed from {node_az2_f79f3497d0}{CTBV-A2tT5ysQ7uVX1vjPw}{4IMXO6pHQ_SNQS5mSzF6kg}{100.0.0.3}{100.0.0.3:9002}{rack_id=az2} into {node_az3_006a1fb63b6}{b-sQwH5dQQixJ9fN27l2eA}{NBnUGKW7R2CUXlI7byGm5g}{100.0.0.1}{100.0.0.1:9002}{rack_id=az3}]; nested: RemoteTransportException[[node_az2_f79f3497d0][100.0.0.3:9002][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [1] files with total size of [162b]]; nested: ReceiveTimeoutTransportException[[node_az3_006a1fb63b6][100.0.0.1:9002][internal:index/shard/recovery/prepare_translog] request_id [1106203] timed out after [900000ms]]; ]
org.elasticsearch.indices.recovery.RecoveryFailedException: [test360_19][143]: Recovery failed from {node_az2_f79f3497d0}{CTBV-A2tT5ysQ7uVX1vjPw}{4IMXO6pHQ_SNQS5mSzF6kg}{100.0.0.3}{100.0.0.3:9002}{rack_id=az2} into {node_az3_006a1fb63b6}{b-sQwH5dQQixJ9fN27l2eA}{NBnUGKW7R2CUXlI7byGm5g}{100.0.0.1}{100.0.0.1:9002}{rack_id=az3}
at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.doRecovery(PeerRecoveryTargetService.java:314) ~[elasticsearch-5.5.1.jar:5.5.1]
at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.access$900(PeerRecoveryTargetService.java:73) ~[elasticsearch-5.5.1.jar:5.5.1]
at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$RecoveryRunner.doRun(PeerRecoveryTargetService.java:556) ~[elasticsearch-5.5.1.jar:5.5.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) ~[elasticsearch-5.5.1.jar:5.5.1]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-5.5.1.jar:5.5.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_144]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_144]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_144]
Caused by: org.elasticsearch.transport.RemoteTransportException: [node_az2_f79f3497d0][100.0.0.3:9002][internal:index/shard/recovery/start_recovery]
Caused by: org.elasticsearch.index.engine.RecoveryEngineException: Phase[1] phase1 failed
at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:140) ~[elasticsearch-5.5.1.jar:5.5.1]
at org.elasticsearch.indices.recovery.PeerRecoverySourceService.recover(PeerRecoverySourceService.java:132) ~[elasticsearch-5.5.1.jar:5.5.1]
at org.elasticsearch.indices.recovery.PeerRecoverySourceService.access$100(PeerRecoverySourceService.java:54) ~[elasticsearch-5.5.1.jar:5.5.1]
at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:141) ~[elasticsearch-5.5.1.jar:5.5.1]
at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:138) ~[elasticsearch-5.5.1.jar:5.5.1]
at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33) ~[elasticsearch-5.5.1.jar:5.5.1]
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) ~[elasticsearch-5.5.1.jar:5.5.1]
at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1544) ~[elasticsearch-5.5.1.jar:5.5.1]
... 5 more
Caused by: org.elasticsearch.indices.recovery.RecoverFilesRecoveryException: Failed to transfer [1] files with total size of [162b]
at org.elasticsearch.indices.recovery.RecoverySourceHandler.phase1(RecoverySourceHandler.java:337) ~[elasticsearch-5.5.1.jar:5.5.1]
at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:138) ~[elasticsearch-5.5.1.jar:5.5.1]
at org.elasticsearch.indices.recovery.PeerRecoverySourceService.recover(PeerRecoverySourceService.java:132) ~[elasticsearch-5.5.1.jar:5.5.1]
at org.elasticsearch.indices.recovery.PeerRecoverySourceService.access$100(PeerRecoverySourceService.java:54) ~[elasticsearch-5.5.1.jar:5.5.1]
at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:141) ~[elasticsearch-5.5.1.jar:5.5.1]
at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:138) ~[elasticsearch-5.5.1.jar:5.5.1]
at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33) ~[elasticsearch-5.5.1.jar:5.5.1]
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) ~[elasticsearch-5.5.1.jar:5.5.1]
at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1544) ~[elasticsearch-5.5.1.jar:5.5.1]
... 5 more
Caused by: org.elasticsearch.transport.ReceiveTimeoutTransportException: [node_az3_006a1fb63b6][100.0.0.1:9002][internal:index/shard/recovery/prepare_translog] request_id [1106203] timed out after [900000ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:951) ~[elasticsearch-5.5.1.jar:5.5.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) ~[elasticsearch-5.5.1.jar:5.5.1]
... 3 more
[2017-08-18T17:17:55,342][DEBUG][o.e.c.s.ClusterService ] [node_az2_084354b0d5] processing [shard-failed[shard id [[test360_19][143]], allocation id [uLI9lUk3To2entCZ-KQcAw], primary term [0], message [failed recovery], failure [RecoveryFailedException[[test360_19][143]: Recovery failed from {node_az2_f79f3497d0}{CTBV-A2tT5ysQ7uVX1vjPw}{4IMXO6pHQ_SNQS5mSzF6kg}{100.0.0.3}{100.0.0.3:9002}{rack_id=az2} into {node_az3_006a1fb63b6}{b-sQwH5dQQixJ9fN27l2eA}{NBnUGKW7R2CUXlI7byGm5g}{100.0.0.1}{100.0.0.1:9002}{rack_id=az3}]; nested: RemoteTransportException[[node_az2_f79f3497d0][100.0.0.3:9002][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [1] files with total size of [162b]]; nested: ReceiveTimeoutTransportException[[node_az3_006a1fb63b6][100.0.0.1:9002][internal:index/shard/recovery/prepare_translog] request_id [1106203] timed out after [900000ms]]; ]]]: execute
[2017-08-18T17:17:55,342][DEBUG][o.e.c.a.s.ShardStateAction] [node_az2_084354b0d5] [test360_19][143] failing shard [test360_19][143], node[b-sQwH5dQQixJ9fN27l2eA], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=uLI9lUk3To2entCZ-KQcAw], unassigned_info[[reason=INDEX_CREATED], at[2017-08-18T17:02:53.615Z], delayed=false, allocation_status[no_attempt]] (shard failed task: [shard id [[test360_19][143]], allocation id [uLI9lUk3To2entCZ-KQcAw], primary term [0], message [failed recovery], failure [RecoveryFailedException[[test360_19][143]: Recovery failed from {node_az2_f79f3497d0}{CTBV-A2tT5ysQ7uVX1vjPw}{4IMXO6pHQ_SNQS5mSzF6kg}{100.0.0.3}{100.0.0.3:9002}{rack_id=az2} into {node_az3_006a1fb63b6}{b-sQwH5dQQixJ9fN27l2eA}{NBnUGKW7R2CUXlI7byGm5g}{100.0.0.1}{100.0.0.1:9002}{rack_id=az3}]; nested: RemoteTransportException[[node_az2_f79f3497d0][100.0.0.3:9002][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [1] files with total size of [162b]]; nested: ReceiveTimeoutTransportException[[node_az3_006a1fb63b6][100.0.0.1:9002][internal:index/shard/recovery/prepare_translog] request_id [1106203] timed out after [900000ms]]; ]])
[2017-08-18T17:17:55,348][DEBUG][o.e.c.r.a.AllocationService] [node_az2_084354b0d5] [test360_19][143] failing shard [test360_19][143], node[b-sQwH5dQQixJ9fN27l2eA], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=uLI9lUk3To2entCZ-KQcAw], unassigned_info[[reason=INDEX_CREATED], at[2017-08-18T17:02:53.615Z], delayed=false, allocation_status[no_attempt]] with unassigned info ([reason=ALLOCATION_FAILED], at[2017-08-18T17:17:55.348Z], failed_attempts[1], delayed=false, details[failed recovery, failure RecoveryFailedException[[test360_19][143]: Recovery failed from {node_az2_f79f3497d0}{CTBV-A2tT5ysQ7uVX1vjPw}{4IMXO6pHQ_SNQS5mSzF6kg}{100.0.0.3}{100.0.0.3:9002}{rack_id=az2} into {node_az3_006a1fb63b6}{b-sQwH5dQQixJ9fN27l2eA}{NBnUGKW7R2CUXlI7byGm5g}{100.0.0.1}{100.0.0.1:9002}{rack_id=az3}]; nested: RemoteTransportException[[node_az2_f79f3497d0][100.0.0.3:9002][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [1] files with total size of [162b]]; nested: ReceiveTimeoutTransportException[[node_az3_006a1fb63b6][100.0.0.1:9002][internal:index/shard/recovery/prepare_translog] request_id [1106203] timed out after [900000ms]]; ], allocation_status[no_attempt])
[2017-08-18T17:17:55,582][DEBUG][o.e.c.r.a.a.BalancedShardsAllocator] [node_az2_084354b0d5] skipping rebalance due to in-flight shard/store fetches
[2017-08-18T17:17:55,600][DEBUG][o.e.c.s.ClusterService ] [node_az2_084354b0d5] cluster state updated, version [1172], source [shard-failed[shard id [[test360_19][143]], allocation id [uLI9lUk3To2entCZ-KQcAw], primary term [0], message [failed recovery], failure [RecoveryFailedException[[test360_19][143]: Recovery failed from {node_az2_f79f3497d0}{CTBV-A2tT5ysQ7uVX1vjPw}{4IMXO6pHQ_SNQS5mSzF6kg}{100.0.0.3}{100.0.0.3:9002}{rack_id=az2} into {node_az3_006a1fb63b6}{b-sQwH5dQQixJ9fN27l2eA}{NBnUGKW7R2CUXlI7byGm5g}{100.0.0.1}{100.0.0.1:9002}{rack_id=az3}]; nested: RemoteTransportException[[node_az2_f79f3497d0][100.0.0.3:9002][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [1] files with total size of [162b]]; nested: ReceiveTimeoutTransportException[[node_az3_006a1fb63b6][100.0.0.1:9002][internal:index/shard/recovery/prepare_translog] request_id [1106203] timed out after [900000ms]]; ]]]
[2017-08-18T17:17:55,600][DEBUG][o.e.c.s.ClusterService ] [node_az2_084354b0d5] publishing cluster state version [1172]
[2017-08-18T17:17:55,843][DEBUG][o.e.c.s.ClusterService ] [node_az2_084354b0d5] applying cluster state version 1172
[2017-08-18T17:17:55,843][DEBUG][o.e.c.s.ClusterService ] [node_az2_084354b0d5] set local cluster state to version 1172
[2017-08-18T17:17:55,852][DEBUG][o.e.c.s.ClusterService ] [node_az2_084354b0d5] processing [shard-failed[shard id [[test360_19][143]], allocation id [uLI9lUk3To2entCZ-KQcAw], primary term [0], message [failed recovery], failure [RecoveryFailedException[[test360_19][143]: Recovery failed from {node_az2_f79f3497d0}{CTBV-A2tT5ysQ7uVX1vjPw}{4IMXO6pHQ_SNQS5mSzF6kg}{100.0.0.3}{100.0.0.3:9002}{rack_id=az2} into {node_az3_006a1fb63b6}{b-sQwH5dQQixJ9fN27l2eA}{NBnUGKW7R2CUXlI7byGm5g}{100.0.0.1}{100.0.0.1:9002}{rack_id=az3}]; nested: RemoteTransportException[[node_az2_f79f3497d0][100.0.0.3:9002][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [1] files with total size of [162b]]; nested: ReceiveTimeoutTransportException[[node_az3_006a1fb63b6][100.0.0.1:9002][internal:index/shard/recovery/prepare_translog] request_id [1106203] timed out after [900000ms]]; ]]]: took [510ms] done applying updated cluster_state (version: 1172, uuid: I7EG7cqgTKGbqp2thv2ZKw)
[2017-08-18T17:17:55,852][DEBUG][o.e.c.s.ClusterService ] [node_az2_084354b0d5] processing [cluster_reroute(async_shard_fetch)]: execute
[2017-08-18T17:17:56,155][DEBUG][o.e.c.s.ClusterService ] [node_az2_084354b0d5] cluster state updated, version [1173], source [cluster_reroute(async_shard_fetch)]
[2017-08-18T17:17:56,156][DEBUG][o.e.c.s.ClusterService ] [node_az2_084354b0d5] publishing cluster state version [1173]
[2017-08-18T17:17:56,205][DEBUG][o.e.c.a.s.ShardStateAction] [node_az2_084354b0d5] [test360_19][143] received shard started for [shard id [[test360_19][143]], allocation id [uAUMBccQRji9i6oeO9yCSA], primary term [0], message [after peer recovery]]
[2017-08-18T17:17:56,459][DEBUG][o.e.c.s.ClusterService ] [node_az2_084354b0d5] applying cluster state version 1173
[2017-08-18T17:17:56,459][DEBUG][o.e.c.s.ClusterService ] [node_az2_084354b0d5] set local cluster state to version 1173
[2017-08-18T17:17:56,467][DEBUG][o.e.c.s.ClusterService ] [node_az2_084354b0d5] processing [cluster_reroute(async_shard_fetch)]: took [614ms] done applying updated cluster_state (version: 1173, uuid: F8Vn_O1RSV-A4I1deTv_8Q)
[2017-08-18T17:17:56,467][DEBUG][o.e.c.s.ClusterService ] [node_az2_084354b0d5] processing [shard-started shard id [[test360_19][143]], allocation id [uAUMBccQRji9i6oeO9yCSA], primary term [0], message [after peer recovery][shard id [[test360_19][143]], allocation id [uAUMBccQRji9i6oeO9yCSA], primary term [0], message [after peer recovery]]]: execute
[2017-08-18T17:17:56,467][DEBUG][o.e.c.a.s.ShardStateAction] [node_az2_084354b0d5] [test360_19][143] starting shard [test360_19][143], node[pw31xd1SR6Gino4TrbqZpw], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=uAUMBccQRji9i6oeO9yCSA], unassigned_info[[reason=ALLOCATION_FAILED], at[2017-08-18T17:17:55.348Z], failed_attempts[1], delayed=false, details[failed recovery, failure RecoveryFailedException[[test360_19][143]: Recovery failed from {node_az2_f79f3497d0}{CTBV-A2tT5ysQ7uVX1vjPw}{4IMXO6pHQ_SNQS5mSzF6kg}{100.0.0.3}{100.0.0.3:9002}{rack_id=az2} into {node_az3_006a1fb63b6}{b-sQwH5dQQixJ9fN27l2eA}{NBnUGKW7R2CUXlI7byGm5g}{100.0.0.1}{100.0.0.1:9002}{rack_id=az3}]; nested: RemoteTransportException[[node_az2_f79f3497d0][100.0.0.3:9002][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [1] files with total size of [162b]]; nested: ReceiveTimeoutTransportException[[node_az3_006a1fb63b6][100.0.0.1:9002][internal:index/shard/recovery/prepare_translog] request_id [1106203] timed out after [900000ms]]; ], allocation_status[no_attempt]], expected_shard_size[162] (shard started task: [shard id [[test360_19][143]], allocation id [uAUMBccQRji9i6oeO9yCSA], primary term [0], message [after peer recovery]])
[2017-08-18T17:17:57,044][DEBUG][o.e.c.r.a.a.BalancedShardsAllocator] [node_az2_084354b0d5] Relocate shard [[test360_19][1], node[pw31xd1SR6Gino4TrbqZpw], [R], s[STARTED], a[id=7MkVj-PkQXePR-Htukpl7g]] from node [pw31xd1SR6Gino4TrbqZpw] to node [b-sQwH5dQQixJ9fN27l2eA]
[2017-08-18T17:18:01,307][INFO ][o.e.c.r.a.AllocationService] [node_az2_084354b0d5] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[test360_19][143]] ...]).
...
Any insight will be appreciated.
Thank you!