Shards get stuck initializing on a large cluster (5.5.1)

## Shards get stuck initializing on a large cluster

**Elasticsearch version**:
```
"version": {
  "number": "5.5.1",
  "build_hash": "19c13d0",
  "build_date": "2017-07-18T20:44:24.823Z",
  "build_snapshot": false,
  "lucene_version": "6.6.0"
}
```

**Plugins installed**: [analysis-icu, analysis-kuromoji, mapper-murmur3, repository-s3, custom-discovery]

**JVM version** (`java -version`):
```
java version "1.8.0_144"
Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
**Java** HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)
```

**OS version** (`uname -a`):
```
Linux ... 4.4.0-87-generic #110-Ubuntu SMP Tue Jul 18 12:55:35 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
```

## Description of the problem and steps to reproduce

In this particular case I have a cluster with 3 dedicated masters and 120 data nodes, I create an index of 360 shards and 2 replicas. It is possible to reproduce it on an index with less shards, just takes longer.

After several successful attempts to create 360 shard indices, you get an index that would have a shard stuck in the initializing state. After ~ 15 minutes, problematic shard gets unstuck, allocated, and started.

## Cluster configuration

Here's index template that is applied to these indices:
```
{
  "order": 0,
  "template": "test360_*",
  "settings": {
    "index": {
      "refresh_interval": "15s",
      "number_of_shards": "360",
      "number_of_replicas": "2"
    }
  },
  "mappings": {}
}
```

Here's the request I send to create an index:
```
POST /test360_19/default
{
  "msg": "hello"
}
```

**elasticsearch.yml**:
```
bootstrap.memory_lock: true
node.max_local_storage_nodes: 1
action.destructive_requires_name: true
network.bind_host:
- _local_
- _global_
network.publish_host: _global_
transport.tcp.port: 9002
http.port: 9004
discovery.zen.ping_timeout: 30s
discovery.zen.publish_timeout: 60s
discovery.zen.fd.ping_interval: 30s
discovery.zen.fd.ping_timeout: 30s
discovery.zen.fd.ping_retries: 10
discovery.zen.minimum_master_nodes: 2
script.engine.painless.inline: true
script.engine.painless.stored.search: true
cluster.routing.rebalance.enable: all
cluster.routing.allocation.cluster_concurrent_rebalance: 4
cluster.routing.allocation.node_concurrent_recoveries: 4
cluster.routing.allocation.disk.watermark.low: 80%
cluster.routing.allocation.disk.watermark.high: 85%
thread_pool.bulk.queue_size: 2000
thread_pool.index.queue_size: 2000
thread_pool.search.queue_size: 1000
logger.org.elasticsearch.indices.recovery: DEBUG
logger.org.elasticsearch.cluster.routing: DEBUG
logger.org.elasticsearch.cluster.action: DEBUG
logger.org.elasticsearch.cluster.service: DEBUG
logger.org.elasticsearch.indices.cluster: DEBUG
reindex.remote.whitelist: '*:9004'
cluster.name: es_cluster_120
node.name: node_az3_006a1fb63b6
path.data: /data/es
path.logs: /logs/es
http.cors.enabled: true
http.cors.allow-origin: '*'
discovery.zen.hosts_provider: custom-discovery
cluster.routing.allocation.awareness.attributes: rack_id
node.attr.rack_id: az3
node.master: false
node.data: true
node.ingest: false
```
That is for data nodes, master nodes have `node.master: true` and `node.data: false`.

**Additional settings**:
```
{
    "persistent": {},
    "transient": {
        "logger": {
            "org": {
                "elasticsearch": {
                    "TransportService": {
                        "tracer": "TRACE"
                    }
                }
            }
        },
        "transport": {
            "tracer": {
                "include": "internal:index/shard/*"
            }
        }
    }
}
```

**jvm.options**:
```
## JVM configuration

################################################################
## IMPORTANT: JVM heap size
################################################################
##
## You should always set the min and max JVM heap
## size to the same value. For example, to set
## the heap to 4 GB, set:
##
## -Xms4g
## -Xmx4g
##
## See https://www.elastic.co/guide/en/elasticsearch/reference/current/heap-size.html
## for more information
##
################################################################

# Xms represents the initial size of total heap space
# Xmx represents the maximum size of total heap space

-Xms15338M
-Xmx15338M

################################################################
## Expert settings
################################################################
##
## All settings below this section are considered
## expert settings. Don't tamper with them unless
## you understand what you are doing
##
################################################################

## GC configuration
-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly


## optimizations

# pre-touch memory pages used by the JVM during initialization
-XX:+AlwaysPreTouch

## basic

# force the server VM (remove on 32-bit client JVMs)
-server

# explicitly set the stack size (reduce to 320k on 32-bit client JVMs)
-Xss1m

# set to headless, just in case
-Djava.awt.headless=true

# ensure UTF-8 encoding by default (e.g. filenames)
-Dfile.encoding=UTF-8

# use our provided JNA always versus the system one
-Djna.nosys=true

# use old-style file permissions on JDK9
-Djdk.io.permissionsUseCanonicalPath=true

# flags to configure Netty
-Dio.netty.noUnsafe=true
-Dio.netty.noKeySetOptimization=true
-Dio.netty.recycler.maxCapacityPerThread=0

# log4j 2
-Dlog4j.shutdownHookEnabled=false
-Dlog4j2.disable.jmx=true
-Dlog4j.skipJansi=true

## heap dumps

# generate a heap dump when an allocation from the Java heap fails
# heap dumps are created in the working directory of the JVM
-XX:+HeapDumpOnOutOfMemoryError

# specify an alternative path for heap dumps
# ensure the directory exists and has sufficient space
-XX:HeapDumpPath=/logs/es/es_heap_dump.hprof

## GC logging

-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps
-XX:+PrintGCDateStamps
-XX:+PrintClassHistogram
-XX:+PrintTenuringDistribution
-XX:+PrintGCApplicationStoppedTime

# log GC status to a file with time stamps
# ensure the directory exists
-Xloggc:/logs/es/es_gc-%t.log

# By default, the GC log file will not rotate.
# By uncommenting the lines below, the GC log file
# will be rotated every 128MB at most 32 times.
-XX:+UseGCLogFileRotation
-XX:NumberOfGCLogFiles=16
-XX:GCLogFileSize=128M

# Elasticsearch 5.0.0 will throw an exception on unquoted field names in JSON.
# If documents were already indexed with unquoted fields in a previous version
# of Elasticsearch, some operations may throw errors.
#
# WARNING: This option will be removed in Elasticsearch 6.0.0 and is provided
# only for migration purposes.
#-Delasticsearch.json.allow_unquoted_field_names=true
```

Command line of the actual process:
```
elastic+ 21904     1 10 Aug17 ?        01:57:51 /usr/bin/java -Xms15338M -Xmx15338M -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+AlwaysPreTouch -server -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -Djdk.io.permissionsUseCanonicalPath=true -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Dlog4j.skipJansi=true -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/logs/es/es_heap_dump.hprof -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintClassHistogram -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime -Xloggc:/logs/es/es_gc-%t.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=16 -XX:GCLogFileSize=128M -Des.path.home=/apps/elasticsearch -cp /apps/elasticsearch/lib/* org.elasticsearch.bootstrap.Elasticsearch -p /run/elasticsearch/elasticsearch.pid --quiet
```

Process limits:
```
$ cat /proc/21904/limits
Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited            unlimited            seconds
Max file size             unlimited            unlimited            bytes
Max data size             unlimited            unlimited            bytes
Max stack size            8388608              unlimited            bytes
Max core file size        0                    unlimited            bytes
Max resident set          unlimited            unlimited            bytes
Max processes             1967818              1967818              processes
Max open files            100000               100000               files
Max locked memory         unlimited            unlimited            bytes
Max address space         unlimited            unlimited            bytes
Max file locks            unlimited            unlimited            locks
Max pending signals       1967818              1967818              signals
Max msgqueue size         819200               819200               bytes
Max nice priority         0                    0
Max realtime priority     0                    0
Max realtime timeout      unlimited            unlimited            us
```

## Logs

There's not too many mentions about the problematic shard, however when it gets unstuck I consistently observe the following exception on both the relevant data node and the active master.

**Data node that had problematic shard**:
```
...
[2017-08-18T17:02:57,201][DEBUG][o.e.c.s.ClusterService   ] [node_az3_006a1fb63b6] applying cluster state version 1171
[2017-08-18T17:02:57,201][DEBUG][o.e.c.s.ClusterService   ] [node_az3_006a1fb63b6] set local cluster state to version 1171
[2017-08-18T17:02:57,212][DEBUG][o.e.c.s.ClusterService   ] [node_az3_006a1fb63b6] processing [zen-disco-receive(from master [master {node_az2_084354b0d5}{86-R436YR56aHOtxCptzjg}{YSViwD-tQhSJqe3y9w6dWw}{100.0.0.2}{100.0.0.2:9002}{rack_id=az2} committed version [1171]])]: took [11ms] done applying updated cluster_state (version: 1171, uuid: -U_RCKbTSIG6kiRDtuWSmg)
[2017-08-18T17:17:55,337][TRACE][o.e.t.T.tracer           ] [node_az3_006a1fb63b6] [3262984][internal:index/shard/recovery/start_recovery] received response from [{node_az2_f79f3497d0}{CTBV-A2tT5ysQ7uVX1vjPw}{4IMXO6pHQ_SNQS5mSzF6kg}{100.0.0.3}{100.0.0.3:9002}{rack_id=az2}]
[2017-08-18T17:17:55,338][WARN ][o.e.i.c.IndicesClusterStateService] [node_az3_006a1fb63b6] [[test360_19][143]] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [test360_19][143]: Recovery failed from {node_az2_f79f3497d0}{CTBV-A2tT5ysQ7uVX1vjPw}{4IMXO6pHQ_SNQS5mSzF6kg}{100.0.0.3}{100.0.0.3:9002}{rack_id=az2} into {node_az3_006a1fb63b6}{b-sQwH5dQQixJ9fN27l2eA}{NBnUGKW7R2CUXlI7byGm5g}{100.0.0.1}{100.0.0.1:9002}{rack_id=az3}
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.doRecovery(PeerRecoveryTargetService.java:314) [elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.access$900(PeerRecoveryTargetService.java:73) [elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$RecoveryRunner.doRun(PeerRecoveryTargetService.java:556) [elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) [elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.5.1.jar:5.5.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_144]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_144]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_144]
Caused by: org.elasticsearch.transport.RemoteTransportException: [node_az2_0ff81340d0bc][100.0.0.3:9002][internal:index/shard/recovery/start_recovery]
Caused by: org.elasticsearch.index.engine.RecoveryEngineException: Phase[1] phase1 failed
	at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:140) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService.recover(PeerRecoverySourceService.java:132) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService.access$100(PeerRecoverySourceService.java:54) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:141) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:138) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1544) ~[elasticsearch-5.5.1.jar:5.5.1]
	... 5 more
Caused by: org.elasticsearch.indices.recovery.RecoverFilesRecoveryException: Failed to transfer [1] files with total size of [162b]
	at org.elasticsearch.indices.recovery.RecoverySourceHandler.phase1(RecoverySourceHandler.java:337) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:138) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService.recover(PeerRecoverySourceService.java:132) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService.access$100(PeerRecoverySourceService.java:54) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:141) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:138) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1544) ~[elasticsearch-5.5.1.jar:5.5.1]
	... 5 more
Caused by: org.elasticsearch.transport.ReceiveTimeoutTransportException: [node_az3_006a1fb63b6][100.0.0.1:9002][internal:index/shard/recovery/prepare_translog] request_id [1106203] timed out after [900000ms]
	at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:951) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) ~[elasticsearch-5.5.1.jar:5.5.1]
	... 3 more
[2017-08-18T17:17:55,339][DEBUG][o.e.c.a.s.ShardStateAction] [node_az3_006a1fb63b6] [test360_19][143] sending [internal:cluster/shard/failure] to [86-R436YR56aHOtxCptzjg] for shard entry [shard id [[test360_19][143]], allocation id [uLI9lUk3To2entCZ-KQcAw], primary term [0], message [failed recovery], failure [RecoveryFailedException[[test360_19][143]: Recovery failed from {node_az2_f79f3497d0}{CTBV-A2tT5ysQ7uVX1vjPw}{4IMXO6pHQ_SNQS5mSzF6kg}{100.0.0.3}{100.0.0.3:9002}{rack_id=az2} into {node_az3_006a1fb63b6}{b-sQwH5dQQixJ9fN27l2eA}{NBnUGKW7R2CUXlI7byGm5g}{100.0.0.1}{100.0.0.1:9002}{rack_id=az3}]; nested: RemoteTransportException[[node_az2_f79f3497d0][100.0.0.3:9002][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [1] files with total size of [162b]]; nested: ReceiveTimeoutTransportException[[node_az3_006a1fb63b6][100.0.0.1:9002][internal:index/shard/recovery/prepare_translog] request_id [1106203] timed out after [900000ms]]; ]]
[2017-08-18T17:17:55,619][DEBUG][o.e.c.s.ClusterService   ] [node_az3_006a1fb63b6] processing [zen-disco-receive(from master [master {node_az2_084354b0d5}{86-R436YR56aHOtxCptzjg}{YSViwD-tQhSJqe3y9w6dWw}{100.0.0.2}{100.0.0.2:9002}{rack_id=az2} committed version [1172]])]: execute
[2017-08-18T17:17:55,619][DEBUG][o.e.c.s.ClusterService   ] [node_az3_006a1fb63b6] cluster state updated, version [1172], source [zen-disco-receive(from master [master {node_az2_084354b0d5}{86-R436YR56aHOtxCptzjg}{YSViwD-tQhSJqe3y9w6dWw}{100.0.0.2}{100.0.0.2:9002}{rack_id=az2} committed version [1172]])]
[2017-08-18T17:17:55,620][DEBUG][o.e.c.s.ClusterService   ] [node_az3_006a1fb63b6] applying cluster state version 1172
[2017-08-18T17:17:55,620][DEBUG][o.e.c.s.ClusterService   ] [node_az3_006a1fb63b6] set local cluster state to version 1172
...
```

Even though this is connection related exception, I don't see any other exceptions that would show that there's a genuine network issue between the two nodes.

**Active master**:
```
...
[2017-08-18T17:02:56,068][DEBUG][o.e.c.a.s.ShardStateAction] [us-east-1d.i-086432e5a1d4b01d5] [test360_19][143] starting shard [test360_19][143], node[ljMLTQZbRce8245_yhGLGA], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=FBeIYPFeSWq8HwuzBwdxQw], unassigned_info[[reason=INDEX_CREATED], at[2017-08-18T17:02:53.615Z], delayed=false, allocation_status[no_attempt]] (shard started task: [shard id [[test360_19][143]], allocation id [FBeIYPFeSWq8HwuzBwdxQw], primary term [0], message [after peer recovery]])
...

[2017-08-18T17:02:57,189][DEBUG][o.e.c.s.ClusterService   ] [node_az2_084354b0d5] processing [put-mapping[default]]: execute
[2017-08-18T17:02:57,192][INFO ][o.e.c.m.MetaDataMappingService] [node_az2_084354b0d5] [test360_19/Okz9GdoYTgqnBBDr0KDRdQ] create_mapping [default]
[2017-08-18T17:02:57,192][DEBUG][o.e.c.s.ClusterService   ] [node_az2_084354b0d5] cluster state updated, version [1171], source [put-mapping[default]]
[2017-08-18T17:02:57,192][DEBUG][o.e.c.s.ClusterService   ] [node_az2_084354b0d5] publishing cluster state version [1171]
[2017-08-18T17:02:57,303][DEBUG][o.e.c.s.ClusterService   ] [node_az2_084354b0d5] applying cluster state version 1171
[2017-08-18T17:02:57,303][DEBUG][o.e.c.s.ClusterService   ] [node_az2_084354b0d5] set local cluster state to version 1171
[2017-08-18T17:02:57,315][DEBUG][o.e.c.s.ClusterService   ] [node_az2_084354b0d5] processing [put-mapping[default]]: took [126ms] done applying updated cluster_state (version: 1171, uuid: -U_RCKbTSIG6kiRDtuWSmg)
[2017-08-18T17:17:55,340][WARN ][o.e.c.a.s.ShardStateAction] [node_az2_084354b0d5] [test360_19][143] received shard failed for shard id [[test360_19][143]], allocation id [uLI9lUk3To2entCZ-KQcAw], primary term [0], message [failed recovery], failure [RecoveryFailedException[[test360_19][143]: Recovery failed from {node_az2_f79f3497d0}{CTBV-A2tT5ysQ7uVX1vjPw}{4IMXO6pHQ_SNQS5mSzF6kg}{100.0.0.3}{100.0.0.3:9002}{rack_id=az2} into {node_az3_006a1fb63b6}{b-sQwH5dQQixJ9fN27l2eA}{NBnUGKW7R2CUXlI7byGm5g}{100.0.0.1}{100.0.0.1:9002}{rack_id=az3}]; nested: RemoteTransportException[[node_az2_f79f3497d0][100.0.0.3:9002][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [1] files with total size of [162b]]; nested: ReceiveTimeoutTransportException[[node_az3_006a1fb63b6][100.0.0.1:9002][internal:index/shard/recovery/prepare_translog] request_id [1106203] timed out after [900000ms]]; ]
org.elasticsearch.indices.recovery.RecoveryFailedException: [test360_19][143]: Recovery failed from {node_az2_f79f3497d0}{CTBV-A2tT5ysQ7uVX1vjPw}{4IMXO6pHQ_SNQS5mSzF6kg}{100.0.0.3}{100.0.0.3:9002}{rack_id=az2} into {node_az3_006a1fb63b6}{b-sQwH5dQQixJ9fN27l2eA}{NBnUGKW7R2CUXlI7byGm5g}{100.0.0.1}{100.0.0.1:9002}{rack_id=az3}
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.doRecovery(PeerRecoveryTargetService.java:314) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.access$900(PeerRecoveryTargetService.java:73) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$RecoveryRunner.doRun(PeerRecoveryTargetService.java:556) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-5.5.1.jar:5.5.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_144]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_144]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_144]
Caused by: org.elasticsearch.transport.RemoteTransportException: [node_az2_f79f3497d0][100.0.0.3:9002][internal:index/shard/recovery/start_recovery]
Caused by: org.elasticsearch.index.engine.RecoveryEngineException: Phase[1] phase1 failed
	at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:140) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService.recover(PeerRecoverySourceService.java:132) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService.access$100(PeerRecoverySourceService.java:54) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:141) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:138) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1544) ~[elasticsearch-5.5.1.jar:5.5.1]
	... 5 more
Caused by: org.elasticsearch.indices.recovery.RecoverFilesRecoveryException: Failed to transfer [1] files with total size of [162b]
	at org.elasticsearch.indices.recovery.RecoverySourceHandler.phase1(RecoverySourceHandler.java:337) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:138) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService.recover(PeerRecoverySourceService.java:132) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService.access$100(PeerRecoverySourceService.java:54) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:141) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:138) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1544) ~[elasticsearch-5.5.1.jar:5.5.1]
	... 5 more
Caused by: org.elasticsearch.transport.ReceiveTimeoutTransportException: [node_az3_006a1fb63b6][100.0.0.1:9002][internal:index/shard/recovery/prepare_translog] request_id [1106203] timed out after [900000ms]
	at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:951) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) ~[elasticsearch-5.5.1.jar:5.5.1]
	... 3 more
[2017-08-18T17:17:55,342][DEBUG][o.e.c.s.ClusterService   ] [node_az2_084354b0d5] processing [shard-failed[shard id [[test360_19][143]], allocation id [uLI9lUk3To2entCZ-KQcAw], primary term [0], message [failed recovery], failure [RecoveryFailedException[[test360_19][143]: Recovery failed from {node_az2_f79f3497d0}{CTBV-A2tT5ysQ7uVX1vjPw}{4IMXO6pHQ_SNQS5mSzF6kg}{100.0.0.3}{100.0.0.3:9002}{rack_id=az2} into {node_az3_006a1fb63b6}{b-sQwH5dQQixJ9fN27l2eA}{NBnUGKW7R2CUXlI7byGm5g}{100.0.0.1}{100.0.0.1:9002}{rack_id=az3}]; nested: RemoteTransportException[[node_az2_f79f3497d0][100.0.0.3:9002][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [1] files with total size of [162b]]; nested: ReceiveTimeoutTransportException[[node_az3_006a1fb63b6][100.0.0.1:9002][internal:index/shard/recovery/prepare_translog] request_id [1106203] timed out after [900000ms]]; ]]]: execute
[2017-08-18T17:17:55,342][DEBUG][o.e.c.a.s.ShardStateAction] [node_az2_084354b0d5] [test360_19][143] failing shard [test360_19][143], node[b-sQwH5dQQixJ9fN27l2eA], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=uLI9lUk3To2entCZ-KQcAw], unassigned_info[[reason=INDEX_CREATED], at[2017-08-18T17:02:53.615Z], delayed=false, allocation_status[no_attempt]] (shard failed task: [shard id [[test360_19][143]], allocation id [uLI9lUk3To2entCZ-KQcAw], primary term [0], message [failed recovery], failure [RecoveryFailedException[[test360_19][143]: Recovery failed from {node_az2_f79f3497d0}{CTBV-A2tT5ysQ7uVX1vjPw}{4IMXO6pHQ_SNQS5mSzF6kg}{100.0.0.3}{100.0.0.3:9002}{rack_id=az2} into {node_az3_006a1fb63b6}{b-sQwH5dQQixJ9fN27l2eA}{NBnUGKW7R2CUXlI7byGm5g}{100.0.0.1}{100.0.0.1:9002}{rack_id=az3}]; nested: RemoteTransportException[[node_az2_f79f3497d0][100.0.0.3:9002][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [1] files with total size of [162b]]; nested: ReceiveTimeoutTransportException[[node_az3_006a1fb63b6][100.0.0.1:9002][internal:index/shard/recovery/prepare_translog] request_id [1106203] timed out after [900000ms]]; ]])
[2017-08-18T17:17:55,348][DEBUG][o.e.c.r.a.AllocationService] [node_az2_084354b0d5] [test360_19][143] failing shard [test360_19][143], node[b-sQwH5dQQixJ9fN27l2eA], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=uLI9lUk3To2entCZ-KQcAw], unassigned_info[[reason=INDEX_CREATED], at[2017-08-18T17:02:53.615Z], delayed=false, allocation_status[no_attempt]] with unassigned info ([reason=ALLOCATION_FAILED], at[2017-08-18T17:17:55.348Z], failed_attempts[1], delayed=false, details[failed recovery, failure RecoveryFailedException[[test360_19][143]: Recovery failed from {node_az2_f79f3497d0}{CTBV-A2tT5ysQ7uVX1vjPw}{4IMXO6pHQ_SNQS5mSzF6kg}{100.0.0.3}{100.0.0.3:9002}{rack_id=az2} into {node_az3_006a1fb63b6}{b-sQwH5dQQixJ9fN27l2eA}{NBnUGKW7R2CUXlI7byGm5g}{100.0.0.1}{100.0.0.1:9002}{rack_id=az3}]; nested: RemoteTransportException[[node_az2_f79f3497d0][100.0.0.3:9002][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [1] files with total size of [162b]]; nested: ReceiveTimeoutTransportException[[node_az3_006a1fb63b6][100.0.0.1:9002][internal:index/shard/recovery/prepare_translog] request_id [1106203] timed out after [900000ms]]; ], allocation_status[no_attempt])
[2017-08-18T17:17:55,582][DEBUG][o.e.c.r.a.a.BalancedShardsAllocator] [node_az2_084354b0d5] skipping rebalance due to in-flight shard/store fetches
[2017-08-18T17:17:55,600][DEBUG][o.e.c.s.ClusterService   ] [node_az2_084354b0d5] cluster state updated, version [1172], source [shard-failed[shard id [[test360_19][143]], allocation id [uLI9lUk3To2entCZ-KQcAw], primary term [0], message [failed recovery], failure [RecoveryFailedException[[test360_19][143]: Recovery failed from {node_az2_f79f3497d0}{CTBV-A2tT5ysQ7uVX1vjPw}{4IMXO6pHQ_SNQS5mSzF6kg}{100.0.0.3}{100.0.0.3:9002}{rack_id=az2} into {node_az3_006a1fb63b6}{b-sQwH5dQQixJ9fN27l2eA}{NBnUGKW7R2CUXlI7byGm5g}{100.0.0.1}{100.0.0.1:9002}{rack_id=az3}]; nested: RemoteTransportException[[node_az2_f79f3497d0][100.0.0.3:9002][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [1] files with total size of [162b]]; nested: ReceiveTimeoutTransportException[[node_az3_006a1fb63b6][100.0.0.1:9002][internal:index/shard/recovery/prepare_translog] request_id [1106203] timed out after [900000ms]]; ]]]
[2017-08-18T17:17:55,600][DEBUG][o.e.c.s.ClusterService   ] [node_az2_084354b0d5] publishing cluster state version [1172]
[2017-08-18T17:17:55,843][DEBUG][o.e.c.s.ClusterService   ] [node_az2_084354b0d5] applying cluster state version 1172
[2017-08-18T17:17:55,843][DEBUG][o.e.c.s.ClusterService   ] [node_az2_084354b0d5] set local cluster state to version 1172
[2017-08-18T17:17:55,852][DEBUG][o.e.c.s.ClusterService   ] [node_az2_084354b0d5] processing [shard-failed[shard id [[test360_19][143]], allocation id [uLI9lUk3To2entCZ-KQcAw], primary term [0], message [failed recovery], failure [RecoveryFailedException[[test360_19][143]: Recovery failed from {node_az2_f79f3497d0}{CTBV-A2tT5ysQ7uVX1vjPw}{4IMXO6pHQ_SNQS5mSzF6kg}{100.0.0.3}{100.0.0.3:9002}{rack_id=az2} into {node_az3_006a1fb63b6}{b-sQwH5dQQixJ9fN27l2eA}{NBnUGKW7R2CUXlI7byGm5g}{100.0.0.1}{100.0.0.1:9002}{rack_id=az3}]; nested: RemoteTransportException[[node_az2_f79f3497d0][100.0.0.3:9002][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [1] files with total size of [162b]]; nested: ReceiveTimeoutTransportException[[node_az3_006a1fb63b6][100.0.0.1:9002][internal:index/shard/recovery/prepare_translog] request_id [1106203] timed out after [900000ms]]; ]]]: took [510ms] done applying updated cluster_state (version: 1172, uuid: I7EG7cqgTKGbqp2thv2ZKw)
[2017-08-18T17:17:55,852][DEBUG][o.e.c.s.ClusterService   ] [node_az2_084354b0d5] processing [cluster_reroute(async_shard_fetch)]: execute
[2017-08-18T17:17:56,155][DEBUG][o.e.c.s.ClusterService   ] [node_az2_084354b0d5] cluster state updated, version [1173], source [cluster_reroute(async_shard_fetch)]
[2017-08-18T17:17:56,156][DEBUG][o.e.c.s.ClusterService   ] [node_az2_084354b0d5] publishing cluster state version [1173]
[2017-08-18T17:17:56,205][DEBUG][o.e.c.a.s.ShardStateAction] [node_az2_084354b0d5] [test360_19][143] received shard started for [shard id [[test360_19][143]], allocation id [uAUMBccQRji9i6oeO9yCSA], primary term [0], message [after peer recovery]]
[2017-08-18T17:17:56,459][DEBUG][o.e.c.s.ClusterService   ] [node_az2_084354b0d5] applying cluster state version 1173
[2017-08-18T17:17:56,459][DEBUG][o.e.c.s.ClusterService   ] [node_az2_084354b0d5] set local cluster state to version 1173
[2017-08-18T17:17:56,467][DEBUG][o.e.c.s.ClusterService   ] [node_az2_084354b0d5] processing [cluster_reroute(async_shard_fetch)]: took [614ms] done applying updated cluster_state (version: 1173, uuid: F8Vn_O1RSV-A4I1deTv_8Q)
[2017-08-18T17:17:56,467][DEBUG][o.e.c.s.ClusterService   ] [node_az2_084354b0d5] processing [shard-started shard id [[test360_19][143]], allocation id [uAUMBccQRji9i6oeO9yCSA], primary term [0], message [after peer recovery][shard id [[test360_19][143]], allocation id [uAUMBccQRji9i6oeO9yCSA], primary term [0], message [after peer recovery]]]: execute
[2017-08-18T17:17:56,467][DEBUG][o.e.c.a.s.ShardStateAction] [node_az2_084354b0d5] [test360_19][143] starting shard [test360_19][143], node[pw31xd1SR6Gino4TrbqZpw], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=uAUMBccQRji9i6oeO9yCSA], unassigned_info[[reason=ALLOCATION_FAILED], at[2017-08-18T17:17:55.348Z], failed_attempts[1], delayed=false, details[failed recovery, failure RecoveryFailedException[[test360_19][143]: Recovery failed from {node_az2_f79f3497d0}{CTBV-A2tT5ysQ7uVX1vjPw}{4IMXO6pHQ_SNQS5mSzF6kg}{100.0.0.3}{100.0.0.3:9002}{rack_id=az2} into {node_az3_006a1fb63b6}{b-sQwH5dQQixJ9fN27l2eA}{NBnUGKW7R2CUXlI7byGm5g}{100.0.0.1}{100.0.0.1:9002}{rack_id=az3}]; nested: RemoteTransportException[[node_az2_f79f3497d0][100.0.0.3:9002][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [1] files with total size of [162b]]; nested: ReceiveTimeoutTransportException[[node_az3_006a1fb63b6][100.0.0.1:9002][internal:index/shard/recovery/prepare_translog] request_id [1106203] timed out after [900000ms]]; ], allocation_status[no_attempt]], expected_shard_size[162] (shard started task: [shard id [[test360_19][143]], allocation id [uAUMBccQRji9i6oeO9yCSA], primary term [0], message [after peer recovery]])
[2017-08-18T17:17:57,044][DEBUG][o.e.c.r.a.a.BalancedShardsAllocator] [node_az2_084354b0d5] Relocate shard [[test360_19][1], node[pw31xd1SR6Gino4TrbqZpw], [R], s[STARTED], a[id=7MkVj-PkQXePR-Htukpl7g]] from node [pw31xd1SR6Gino4TrbqZpw] to node [b-sQwH5dQQixJ9fN27l2eA]
[2017-08-18T17:18:01,307][INFO ][o.e.c.r.a.AllocationService] [node_az2_084354b0d5] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[test360_19][143]] ...]).
...
```

Any insight will be appreciated.
Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Shards get stuck initializing on a large cluster (5.5.1) #26293

Shards get stuck initializing on a large cluster

Description of the problem and steps to reproduce

Cluster configuration

Logs

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Shards get stuck initializing on a large cluster (5.5.1) #26293

Description

Shards get stuck initializing on a large cluster

Description of the problem and steps to reproduce

Cluster configuration

Logs

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions