Monitoring plugin indices stuck in failed recovery status

**Elasticsearch version**:

5.0.0-alpha3

**JVM version**:

The docker image inherits from the official java:8-jre image.

**OS version**:

Official docker image (https://hub.docker.com/_/elasticsearch/, Debian Jessie) running on a Fedora 23 host.

**Description of the problem including expected versus actual behavior**:

Shards created by the Monitoring X-Pack plugin are stuck trying to recover. Kibana reports "Elasticsearch is still initializing the Monitoring indices". The logs are given below, notably "marking and sending shard failed due to [failed recovery]" and "IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to create engine];".

This has happened twice (out of 2 times tried) on a testing environment following the steps given below. A few things to note:
- This happens only on the shards created by the monitoring plugin, custom indices and kibana's indices are properly initialized and all data seem to be in place and can be queried.
- When the environment is first set up, all looks ok. The problem occurs when it is shut down and restarted.
- Similar issues indicate data corruption. I'm not sure why this would happen only on the Monitoring indices though. Related issue: https://github.com/elastic/elasticsearch/issues/2598
- Elastic Search continuously tries to recover the index, logging the same entry 1-7 times per second. This results in gigabytes of logs in small amount of time, while the machine becomes sluggish for quite a while due to high memory usage, possibly related to continuous efforts for index recovery. Pull request https://github.com/elastic/elasticsearch/pull/11269 released in 1.6 is supposed to fix this, but it still occurs in 5.0.0-alpha3, related issue: https://github.com/elastic/elasticsearch/issues/11681.
- Even if there is a reason for Elastic Search to retry several times per second, not infinitely though, there is no reason to clutter the logs up to 7 times per second. Related issue: https://github.com/elastic/elasticsearch/issues/19164.

I do not recall having the same problem with Marvel and ElasticSearch 2 when following the same steps.

Any suggestions on how to figure out if this is due to data corruption, and why only on the monitoring plugin indices? Not sure if it is related, but other indices would have really low write rate, since this is a test environment, not sure how many writes per second the monitoring plugin would do.

**Steps to reproduce**:
1. Initialise a single node Elastic Search cluster in a docker image, connected to another docker image running Kibana. X-Pack plugin should be installed in both docker images.
2. Give some time for initializing the indices, and load Kibana - all should be looking good, including the Monitoring plugin. Use the cluster for a while, send some records to Elastic Search.
3. Stop the cluster, for example with "docker-compose stop".
4. Restart the cluster. The monitoring indices can't be recovered.

**Provide logs (if relevant)**:

```
{"log":"[2016-07-05 20:56:55,776][INFO ][gateway                  ] [Flying Tiger] recovered [18] indices into cluster_state\n","stream":"stdout","time":"2016-07-05T20:56:55.776542565Z"}
{"log":"[2016-07-05 20:57:05,422][INFO ][monitor.jvm              ] [Flying Tiger] [gc][35] overhead, spent [369ms] collecting in the last [1.2s]\n","stream":"stdout","time":"2016-07-05T20:57:05.468691606Z"}
{"log":"[2016-07-05 20:57:26,236][WARN ][indices.cluster          ] [Flying Tiger] [[.monitoring-data-2][0]] marking and sending shard failed due to [failed recovery]\n","stream":"stdout","time":"2016-07-05T20:57:26.238632342Z"}
{"log":"RecoveryFailedException[[.monitoring-data-2][0]: Recovery failed from null into {Flying Tiger}{CgS1ucvZSay5Rjw07LqhMQ}{172.17.0.6}{172.17.0.6:9300}]; nested: IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to create engine]; nested: EOFException;\n","stream":"stdout","time":"2016-07-05T20:57:26.238689525Z"}
{"log":"\u0009at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$0(IndexShard.java:1369)\n","stream":"stdout","time":"2016-07-05T20:57:26.238706279Z"}
{"log":"\u0009at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:392)\n","stream":"stdout","time":"2016-07-05T20:57:26.238720854Z"}
{"log":"\u0009at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)\n","stream":"stdout","time":"2016-07-05T20:57:26.23873453Z"}
{"log":"\u0009at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)\n","stream":"stdout","time":"2016-07-05T20:57:26.238750303Z"}
{"log":"\u0009at java.lang.Thread.run(Thread.java:745)\n","stream":"stdout","time":"2016-07-05T20:57:26.238766104Z"}
{"log":"Caused by: [.monitoring-data-2/-4ZQISQpRQu5B58rxnkYgg][[.monitoring-data-2][0]] IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to create engine]; nested: EOFException;\n","stream":"stdout","time":"2016-07-05T20:57:26.238789176Z"}
{"log":"\u0009at org.elasticsearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:229)\n","stream":"stdout","time":"2016-07-05T20:57:26.238805723Z"}
{"log":"\u0009at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:78)\n","stream":"stdout","time":"2016-07-05T20:57:26.238818972Z"}
{"log":"\u0009at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:123)\n","stream":"stdout","time":"2016-07-05T20:57:26.238834596Z"}
{"log":"\u0009at org.elasticsearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:76)\n","stream":"stdout","time":"2016-07-05T20:57:26.238849856Z"}
{"log":"\u0009at org.elasticsearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:1119)\n","stream":"stdout","time":"2016-07-05T20:57:26.238862794Z"}
{"log":"\u0009at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$0(IndexShard.java:1365)\n","stream":"stdout","time":"2016-07-05T20:57:26.23887556Z"}
{"log":"\u0009... 4 more\n","stream":"stdout","time":"2016-07-05T20:57:26.238889044Z"}
{"log":"Caused by: [.monitoring-data-2/-4ZQISQpRQu5B58rxnkYgg][[.monitoring-data-2][0]] EngineCreationFailureException[failed to create engine]; nested: EOFException;\n","stream":"stdout","time":"2016-07-05T20:57:26.23890146Z"}
{"log":"\u0009at org.elasticsearch.index.engine.InternalEngine.\u003cinit\u003e(InternalEngine.java:139)\n","stream":"stdout","time":"2016-07-05T20:57:26.238917752Z"}
{"log":"\u0009at org.elasticsearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:25)\n","stream":"stdout","time":"2016-07-05T20:57:26.238934861Z"}
{"log":"\u0009at org.elasticsearch.index.shard.IndexShard.newEngine(IndexShard.java:1430)\n","stream":"stdout","time":"2016-07-05T20:57:26.238947889Z"}
{"log":"\u0009at org.elasticsearch.index.shard.IndexShard.createNewEngine(IndexShard.java:1414)\n","stream":"stdout","time":"2016-07-05T20:57:26.238960702Z"}
{"log":"\u0009at org.elasticsearch.index.shard.IndexShard.internalPerformTranslogRecovery(IndexShard.java:928)\n","stream":"stdout","time":"2016-07-05T20:57:26.238972961Z"}
{"log":"\u0009at org.elasticsearch.index.shard.IndexShard.performTranslogRecovery(IndexShard.java:890)\n","stream":"stdout","time":"2016-07-05T20:57:26.238985065Z"}
{"log":"\u0009at org.elasticsearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:225)\n","stream":"stdout","time":"2016-07-05T20:57:26.239019441Z"}
{"log":"\u0009... 9 more\n","stream":"stdout","time":"2016-07-05T20:57:26.239032857Z"}
{"log":"Caused by: java.io.EOFException\n","stream":"stdout","time":"2016-07-05T20:57:26.239044567Z"}
{"log":"\u0009at org.apache.lucene.store.InputStreamDataInput.readByte(InputStreamDataInput.java:37)\n","stream":"stdout","time":"2016-07-05T20:57:26.239056161Z"}
{"log":"\u0009at org.apache.lucene.store.DataInput.readInt(DataInput.java:101)\n","stream":"stdout","time":"2016-07-05T20:57:26.239068626Z"}
{"log":"\u0009at org.apache.lucene.store.DataInput.readLong(DataInput.java:157)\n","stream":"stdout","time":"2016-07-05T20:57:26.239084091Z"}
{"log":"\u0009at org.elasticsearch.index.translog.Checkpoint.\u003cinit\u003e(Checkpoint.java:52)\n","stream":"stdout","time":"2016-07-05T20:57:26.23910018Z"}
{"log":"\u0009at org.elasticsearch.index.translog.Checkpoint.read(Checkpoint.java:81)\n","stream":"stdout","time":"2016-07-05T20:57:26.239113599Z"}
{"log":"\u0009at org.elasticsearch.index.translog.Translog.recoverFromFiles(Translog.java:238)\n","stream":"stdout","time":"2016-07-05T20:57:26.239125431Z"}
{"log":"\u0009at org.elasticsearch.index.translog.Translog.\u003cinit\u003e(Translog.java:180)\n","stream":"stdout","time":"2016-07-05T20:57:26.239137437Z"}
{"log":"\u0009at org.elasticsearch.index.engine.InternalEngine.openTranslog(InternalEngine.java:229)\n","stream":"stdout","time":"2016-07-05T20:57:26.239149796Z"}
{"log":"\u0009at org.elasticsearch.index.engine.InternalEngine.\u003cinit\u003e(InternalEngine.java:136)\n","stream":"stdout","time":"2016-07-05T20:57:26.239164805Z"}
{"log":"\u0009... 15 more\n","stream":"stdout","time":"2016-07-05T20:57:26.239179966Z"}
{"log":"[2016-07-05 20:57:26,238][WARN ][cluster.action.shard     ] [Flying Tiger] [.monitoring-data-2][0] received shard failed for target shard [[.monitoring-data-2][0], node[CgS1ucvZSay5Rjw07LqhMQ], [P], s[INITIALIZING], a[id=kB0PjIVzSkSdXq_MYghS-A], unassigned_info[[reason=CLUSTER_RECOVERED], at[2016-07-05T20:56:47.076Z]]], source shard [[.monitoring-data-2][0], node[CgS1ucvZSay5Rjw07LqhMQ], [P], s[INITIALIZING], a[id=kB0PjIVzSkSdXq_MYghS-A], unassigned_info[[reason=CLUSTER_RECOVERED], at[2016-07-05T20:56:47.076Z]]], message [failed recovery], failure [RecoveryFailedException[[.monitoring-data-2][0]: Recovery failed from null into {Flying Tiger}{CgS1ucvZSay5Rjw07LqhMQ}{172.17.0.6}{172.17.0.6:9300}]; nested: IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to create engine]; nested: EOFException; ]\n","stream":"stdout","time":"2016-07-05T20:57:26.251208298Z"}
{"log":"RecoveryFailedException[[.monitoring-data-2][0]: Recovery failed from null into {Flying Tiger}{CgS1ucvZSay5Rjw07LqhMQ}{172.17.0.6}{172.17.0.6:9300}]; nested: IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to create engine]; nested: EOFException;\n","stream":"stdout","time":"2016-07-05T20:57:26.25132659Z"}
{"log":"\u0009at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$0(IndexShard.java:1369)\n","stream":"stdout","time":"2016-07-05T20:57:26.251345996Z"}
{"log":"\u0009at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:392)\n","stream":"stdout","time":"2016-07-05T20:57:26.25136007Z"}
{"log":"\u0009at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)\n","stream":"stdout","time":"2016-07-05T20:57:26.251373122Z"}
{"log":"\u0009at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)\n","stream":"stdout","time":"2016-07-05T20:57:26.251386727Z"}
{"log":"\u0009at java.lang.Thread.run(Thread.java:745)\n","stream":"stdout","time":"2016-07-05T20:57:26.251399962Z"}
{"log":"Caused by: [.monitoring-data-2/-4ZQISQpRQu5B58rxnkYgg][[.monitoring-data-2][0]] IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to create engine]; nested: EOFException;\n","stream":"stdout","time":"2016-07-05T20:57:26.251413566Z"}
{"log":"\u0009at org.elasticsearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:229)\n","stream":"stdout","time":"2016-07-05T20:57:26.251446542Z"}
{"log":"\u0009at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:78)\n","stream":"stdout","time":"2016-07-05T20:57:26.251462264Z"}
{"log":"\u0009at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:123)\n","stream":"stdout","time":"2016-07-05T20:57:26.251474711Z"}
{"log":"\u0009at org.elasticsearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:76)\n","stream":"stdout","time":"2016-07-05T20:57:26.251487195Z"}
{"log":"\u0009at org.elasticsearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:1119)\n","stream":"stdout","time":"2016-07-05T20:57:26.251499122Z"}
{"log":"\u0009at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$0(IndexShard.java:1365)\n","stream":"stdout","time":"2016-07-05T20:57:26.25151102Z"}
{"log":"\u0009... 4 more\n","stream":"stdout","time":"2016-07-05T20:57:26.251523225Z"}
{"log":"Caused by: [.monitoring-data-2/-4ZQISQpRQu5B58rxnkYgg][[.monitoring-data-2][0]] EngineCreationFailureException[failed to create engine]; nested: EOFException;\n","stream":"stdout","time":"2016-07-05T20:57:26.251535417Z"}
{"log":"\u0009at org.elasticsearch.index.engine.InternalEngine.\u003cinit\u003e(InternalEngine.java:139)\n","stream":"stdout","time":"2016-07-05T20:57:26.251548104Z"}
{"log":"\u0009at org.elasticsearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:25)\n","stream":"stdout","time":"2016-07-05T20:57:26.251560111Z"}
{"log":"\u0009at org.elasticsearch.index.shard.IndexShard.newEngine(IndexShard.java:1430)\n","stream":"stdout","time":"2016-07-05T20:57:26.251572432Z"}
{"log":"\u0009at org.elasticsearch.index.shard.IndexShard.createNewEngine(IndexShard.java:1414)\n","stream":"stdout","time":"2016-07-05T20:57:26.251584408Z"}
{"log":"\u0009at org.elasticsearch.index.shard.IndexShard.internalPerformTranslogRecovery(IndexShard.java:928)\n","stream":"stdout","time":"2016-07-05T20:57:26.251596442Z"}
{"log":"\u0009at org.elasticsearch.index.shard.IndexShard.performTranslogRecovery(IndexShard.java:890)\n","stream":"stdout","time":"2016-07-05T20:57:26.251608984Z"}
{"log":"\u0009at org.elasticsearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:225)\n","stream":"stdout","time":"2016-07-05T20:57:26.251621031Z"}
{"log":"\u0009... 9 more\n","stream":"stdout","time":"2016-07-05T20:57:26.251633349Z"}
{"log":"Caused by: java.io.EOFException\n","stream":"stdout","time":"2016-07-05T20:57:26.251644844Z"}
{"log":"\u0009at org.apache.lucene.store.InputStreamDataInput.readByte(InputStreamDataInput.java:37)\n","stream":"stdout","time":"2016-07-05T20:57:26.251656861Z"}
{"log":"\u0009at org.apache.lucene.store.DataInput.readInt(DataInput.java:101)\n","stream":"stdout","time":"2016-07-05T20:57:26.251668845Z"}
{"log":"\u0009at org.apache.lucene.store.DataInput.readLong(DataInput.java:157)\n","stream":"stdout","time":"2016-07-05T20:57:26.251680411Z"}
{"log":"\u0009at org.elasticsearch.index.translog.Checkpoint.\u003cinit\u003e(Checkpoint.java:52)\n","stream":"stdout","time":"2016-07-05T20:57:26.251692468Z"}
{"log":"\u0009at org.elasticsearch.index.translog.Checkpoint.read(Checkpoint.java:81)\n","stream":"stdout","time":"2016-07-05T20:57:26.251704403Z"}
{"log":"\u0009at org.elasticsearch.index.translog.Translog.recoverFromFiles(Translog.java:238)\n","stream":"stdout","time":"2016-07-05T20:57:26.251719643Z"}
{"log":"\u0009at org.elasticsearch.index.translog.Translog.\u003cinit\u003e(Translog.java:180)\n","stream":"stdout","time":"2016-07-05T20:57:26.251731953Z"}
{"log":"\u0009at org.elasticsearch.index.engine.InternalEngine.openTranslog(InternalEngine.java:229)\n","stream":"stdout","time":"2016-07-05T20:57:26.251744128Z"}
{"log":"\u0009at org.elasticsearch.index.engine.InternalEngine.\u003cinit\u003e(InternalEngine.java:136)\n","stream":"stdout","time":"2016-07-05T20:57:26.25175681Z"}
{"log":"\u0009... 15 more\n","stream":"stdout","time":"2016-07-05T20:57:26.251779075Z"}
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Monitoring plugin indices stuck in failed recovery status #19275

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Monitoring plugin indices stuck in failed recovery status #19275

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions