-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Description
Elasticsearch version (bin/elasticsearch --version):
# rpm -qa |grep elasticsearch
elasticsearch-5.6.2-1.noarch
Plugins installed:
discovery-ec2
repository-s3
x-pack
JVM version (java -version):
# java -version
java version "1.8.0_141"
Java(TM) SE Runtime Environment (build 1.8.0_141-b15)
Java HotSpot(TM) 64-Bit Server VM (build 25.141-b15, mixed mode)
OS version (uname -a if on a Unix-like system):
Fedora 26
Linux 4.12.14-300.fc26.x86_64 #1 SMP Wed Sep 20 16:28:07 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Description of the problem including expected versus actual behavior:
We have had about twenty indexes that are stuck in a red state after trying to restore a snapshot taken from elasticsearch 5.4.1 to a brand new cluster running 5.6.2. For this issue, I will focus on one index logstash-2017.09.20.
You can see here that the index is in a red state:
# curl -XGET 'localhost:9200/_cluster/health/logstash-2017.09.20?level=shards&pretty'
{
"cluster_name" : "redacted",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 11,
"number_of_data_nodes" : 5,
"active_primary_shards" : 4,
"active_shards" : 4,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 1,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 98.60064585575888,
"indices" : {
"logstash-2017.09.20" : {
"status" : "red",
"number_of_shards" : 5,
"number_of_replicas" : 0,
"active_primary_shards" : 4,
"active_shards" : 4,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 1,
"shards" : {
"0" : {
"status" : "green",
"primary_active" : true,
"active_shards" : 1,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
},
"1" : {
"status" : "green",
"primary_active" : true,
"active_shards" : 1,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
},
"2" : {
"status" : "green",
"primary_active" : true,
"active_shards" : 1,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
},
"3" : {
"status" : "green",
"primary_active" : true,
"active_shards" : 1,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
},
"4" : {
"status" : "red",
"primary_active" : false,
"active_shards" : 0,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 1
}
}
}
}
}
You can see the restore says it finished with a SUCCESS:
# curl -XGET 'localhost:9200/_snapshot/my_cool_backup/snapshot_0?pretty'
{
"snapshots" : [
{
"snapshot" : "snapshot_0",
"uuid" : "e_wavyGfTD-SwXC-imkF0g",
"version_id" : 5040199,
"version" : "5.4.1",
"indices" : [
** SNIP **
],
"state" : "SUCCESS",
"start_time" : "2017-09-27T07:00:01.807Z",
"start_time_in_millis" : 1506495601807,
"end_time" : "2017-09-27T08:44:35.377Z",
"end_time_in_millis" : 1506501875377,
"duration_in_millis" : 6273570,
"failures" : [ ],
"shards" : {
"total" : 929,
"failed" : 0,
"successful" : 929
}
}
]
}
Looking at the restore process in detail for the example index, you can see that it says this index has been put into the DONE state for each shard.
$ curl -XGET 'localhost:9200/_snapshot/my_cool_backup/snapshot_0/_status?pretty'
"snapshots" : [
{
"snapshot" : "snapshot_0",
"repository" : "my_cool_backup",
"uuid" : "e_wavyGfTD-SwXC-imkF0g",
"state" : "SUCCESS",
"shards_stats" : {
"initializing" : 0,
"started" : 0,
"finalizing" : 0,
"done" : 929,
"failed" : 0,
"total" : 929
},
"stats" : {
"number_of_files" : 2364,
"processed_files" : 2364,
"total_size_in_bytes" : 15393945691,
"processed_size_in_bytes" : 15393945691,
"start_time_in_millis" : 1506495618226,
"time_in_millis" : 6252967
},
"indices" : {
"logstash-2017.09.20" : {
"shards_stats" : {
"initializing" : 0,
"started" : 0,
"finalizing" : 0,
"done" : 5,
"failed" : 0,
"total" : 5
},
"stats" : {
"number_of_files" : 31,
"processed_files" : 31,
"total_size_in_bytes" : 168664,
"processed_size_in_bytes" : 168664,
"start_time_in_millis" : 1506495678150,
"time_in_millis" : 2401656
},
"shards" : {
"0" : {
"stage" : "DONE",
"stats" : {
"number_of_files" : 7,
"processed_files" : 7,
"total_size_in_bytes" : 118135,
"processed_size_in_bytes" : 118135,
"start_time_in_millis" : 1506495720316,
"time_in_millis" : 1949
}
},
"1" : {
"stage" : "DONE",
"stats" : {
"number_of_files" : 16,
"processed_files" : 16,
"total_size_in_bytes" : 33918,
"processed_size_in_bytes" : 33918,
"start_time_in_millis" : 1506495722992,
"time_in_millis" : 2804
}
},
"2" : {
"stage" : "DONE",
"stats" : {
"number_of_files" : 0,
"processed_files" : 0,
"total_size_in_bytes" : 0,
"processed_size_in_bytes" : 0,
"start_time_in_millis" : 1506498067865,
"time_in_millis" : 11941
}
},
"3" : {
"stage" : "DONE",
"stats" : {
"number_of_files" : 4,
"processed_files" : 4,
"total_size_in_bytes" : 8434,
"processed_size_in_bytes" : 8434,
"start_time_in_millis" : 1506495678150,
"time_in_millis" : 1206
}
},
"4" : {
"stage" : "DONE",
"stats" : {
"number_of_files" : 4,
"processed_files" : 4,
"total_size_in_bytes" : 8177,
"processed_size_in_bytes" : 8177,
"start_time_in_millis" : 1506495684287,
"time_in_millis" : 1164
}
}
}
}
Looking at /_cat/recovery it says it's done too
# curl -XGET localhost:9200/_cat/recovery|grep logstash-2017.09.20
logstash-2017.09.20 0 7.9s snapshot done n/a n/a redacted data-03 my_cool_backup snapshot_0 1 1 100.0% 109 1699 1699 100.0% 2911728303 0 0 100.0%
logstash-2017.09.20 1 14.5m snapshot done n/a n/a redacted data-04 my_cool_backup snapshot_0 136 136 100.0% 136 2842065772 2842065772 100.0% 2842065772 0 0 100.0%
logstash-2017.09.20 2 1.7s snapshot done n/a n/a redacted data-00 my_cool_backup snapshot_0 1 1 100.0% 109 1699 1699 100.0% 2889504028 0 0 100.0%
logstash-2017.09.20 3 13.9m snapshot done n/a n/a redacted data-02 my_cool_backup snapshot_0 127 127 100.0% 127 2929823683 2929823683 100.0% 2929823683 0 0 100.0%
But if you try to close the index it says that it is still being restored:
$ curl -XPOST 'localhost:9200/logstash-2017.09.20/_close?pretty'
{
"error" : {
"root_cause" : [
{
"type" : "remote_transport_exception",
"reason" : "[master-01][redacted:9300][indices:admin/close]"
}
],
"type" : "illegal_argument_exception",
"reason" : "Cannot close indices that are being restored: [[logstash-2017.09.20/crXjrjtwTEqkK6_ITG1HVQ]]"
},
"status" : 400
}
Looking in the logs it says that it failed to recover the index because the file already exists:
[2017-10-02T19:50:28,790][WARN ][o.e.c.a.s.ShardStateAction] [master-01] [logstash-2017.09.20][4] received shard failed for shard id [[logstash-2017.09.20][4]], allocation id [lW_4BSVGSc6phnI1vLEPWg], primary term [0], message [failed recovery], failure [RecoveryFailedException[[logstash-2017.09.20][4]: Recovery failed on {data-02}{Af43AKvBRf6r-PTr2s9KRg}{O1R6sKwAQK2FyYYmdFLjPA}{redacted}{redacted:9300}{aws_availability_zone=us-west-2c, ml.max_open_jobs=10, ml.enabled=true}]; nested: IndexShardRecoveryException[failed recovery]; nested: IndexShardRestoreFailedException[restore failed]; nested: IndexShardRestoreFailedException[failed to restore snapshot [snapshot_0/e_wavyGfTD-SwXC-imkF0g]]; nested: IndexShardRestoreFailedException[Failed to recover index]; nested: FileAlreadyExistsException[/var/lib/elasticsearch/nodes/0/indices/crXjrjtwTEqkK6_ITG1HVQ/4/index/_22g.si]; ]
[2017-10-02T19:50:28,790][WARN ][o.e.c.a.s.ShardStateAction] [master-01] [logstash-2017.09.20][4] received shard failed for shard id [[logstash-2017.09.20][4]
], allocation id [lW_4BSVGSc6phnI1vLEPWg], primary term [0], message [failed recovery], failure [RecoveryFailedException[[logstash-2017.09.20][4]: Recovery failed
on {data-02}{Af43AKvBRf6r-PTr2s9KRg}{O1R6sKwAQK2FyYYmdFLjPA}{redacted}{redacted:9300}{aws_availability_zone=us-west-2c, ml.max_open_jobs=10, ml.enabled=
true}]; nested: IndexShardRecoveryException[failed recovery]; nested: IndexShardRestoreFailedException[restore failed]; nested: IndexShardRestoreFailedException[fa
iled to restore snapshot [snapshot_0/e_wavyGfTD-SwXC-imkF0g]]; nested: IndexShardRestoreFailedException[Failed to recover index]; nested: FileAlre
adyExistsException[/var/lib/elasticsearch/nodes/0/indices/crXjrjtwTEqkK6_ITG1HVQ/4/index/_22g.si]; ]
org.elasticsearch.indices.recovery.RecoveryFailedException: [logstash-2017.09.20][4]: Recovery failed on {data-02}{Af43AKvBRf6r-PTr2s9KRg}{O1R6sKwAQK2FyYYmdFL
jPA}{redacted}{redacted:9300}{aws_availability_zone=us-west-2c, ml.max_open_jobs=10, ml.enabled=true}
at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$2(IndexShard.java:1511) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) ~[elasticsearch-5.6.2.jar:5.6.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_141]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_141]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_141]
Caused by: org.elasticsearch.index.shard.IndexShardRecoveryException: failed recovery
at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:299) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.index.shard.StoreRecovery.recoverFromRepository(StoreRecovery.java:232) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.index.shard.IndexShard.restoreFromRepository(IndexShard.java:1243) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$2(IndexShard.java:1507) ~[elasticsearch-5.6.2.jar:5.6.2]
... 4 more
Caused by: org.elasticsearch.index.snapshots.IndexShardRestoreFailedException: restore failed
at org.elasticsearch.index.shard.StoreRecovery.restore(StoreRecovery.java:405) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromRepository$4(StoreRecovery.java:234) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:257) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.index.shard.StoreRecovery.recoverFromRepository(StoreRecovery.java:232) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.index.shard.IndexShard.restoreFromRepository(IndexShard.java:1243) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$2(IndexShard.java:1507) ~[elasticsearch-5.6.2.jar:5.6.2]
... 4 more
Caused by: org.elasticsearch.index.snapshots.IndexShardRestoreFailedException: failed to restore snapshot [snapshot_0/e_wavyGfTD-SwXC-imkF0g]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository.restoreShard(BlobStoreRepository.java:993) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.index.shard.StoreRecovery.restore(StoreRecovery.java:400) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromRepository$4(StoreRecovery.java:234) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:257) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.index.shard.StoreRecovery.recoverFromRepository(StoreRecovery.java:232) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.index.shard.IndexShard.restoreFromRepository(IndexShard.java:1243) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$2(IndexShard.java:1507) ~[elasticsearch-5.6.2.jar:5.6.2]
... 4 more
Caused by: org.elasticsearch.index.snapshots.IndexShardRestoreFailedException: Failed to recover index
at org.elasticsearch.repositories.blobstore.BlobStoreRepository$RestoreContext.restore(BlobStoreRepository.java:1679) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository.restoreShard(BlobStoreRepository.java:991) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.index.shard.StoreRecovery.restore(StoreRecovery.java:400) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromRepository$4(StoreRecovery.java:234) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:257) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.index.shard.StoreRecovery.recoverFromRepository(StoreRecovery.java:232) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.index.shard.IndexShard.restoreFromRepository(IndexShard.java:1243) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$2(IndexShard.java:1507) ~[elasticsearch-5.6.2.jar:5.6.2]
... 4 more
Caused by: java.nio.file.FileAlreadyExistsException: /var/lib/elasticsearch/nodes/0/indices/crXjrjtwTEqkK6_ITG1HVQ/4/index/_22g.si
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:88) ~[?:?]
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) ~[?:?]
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) ~[?:?]
at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214) ~[?:?]
at java.nio.file.spi.FileSystemProvider.newOutputStream(FileSystemProvider.java:434) ~[?:1.8.0_141]
at java.nio.file.Files.newOutputStream(Files.java:216) ~[?:1.8.0_141]
at org.apache.lucene.store.FSDirectory$FSIndexOutput.<init>(FSDirectory.java:413) ~[lucene-core-6.6.1.jar:6.6.1 9aa465a89b64ff2dabe7b4d50c472de32c298683 - varunthacker - 2017-08-29 21:54:39]
at org.apache.lucene.store.FSDirectory$FSIndexOutput.<init>(FSDirectory.java:409) ~[lucene-core-6.6.1.jar:6.6.1 9aa465a89b64ff2dabe7b4d50c472de32c298683 - varunthacker - 2017-08-29 21:54:39]
at org.apache.lucene.store.FSDirectory.createOutput(FSDirectory.java:253) ~[lucene-core-6.6.1.jar:6.6.1 9aa465a89b64ff2dabe7b4d50c472de32c298683 - varunthacker - 2017-08-29 21:54:39]
at org.apache.lucene.store.RateLimitedFSDirectory.createOutput(RateLimitedFSDirectory.java:40) ~[elasticsearch-5.6.2.jar:6.6.1 9aa465a89b64ff2dabe7b4d50c472de32c298683 - varunthacker - 2017-08-29 21:54:39]
at org.apache.lucene.store.FilterDirectory.createOutput(FilterDirectory.java:73) ~[lucene-core-6.6.1.jar:6.6.1 9aa465a89b64ff2dabe7b4d50c472de32c298683 - varunthacker - 2017-08-29 21:54:39]
at org.elasticsearch.index.store.Store.createVerifyingOutput(Store.java:463) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository$RestoreContext.restoreFile(BlobStoreRepository.java:1734) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository$RestoreContext.restore(BlobStoreRepository.java:1676) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository.restoreShard(BlobStoreRepository.java:991) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.index.shard.StoreRecovery.restore(StoreRecovery.java:400) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromRepository$4(StoreRecovery.java:234) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:257) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.index.shard.StoreRecovery.recoverFromRepository(StoreRecovery.java:232) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.index.shard.IndexShard.restoreFromRepository(IndexShard.java:1243) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$2(IndexShard.java:1507) ~[elasticsearch-5.6.2.jar:5.6.2]
... 4 more
And if you look on for that file it says is already exists, it is not present on the data node:
# ll /var/lib/elasticsearch/nodes/0/indices/crXjrjtwTEqkK6_ITG1HVQ/4/index/_22g.si
ls: cannot access '/var/lib/elasticsearch/nodes/0/indices/crXjrjtwTEqkK6_ITG1HVQ/4/index/_22g.si': No such file or directory
The only way I have been able to get the cluster out of this hung state is to do a full cluster shutdown and start it back up again. From there I am able to close these red indexes and retry the restore again. When I first encountered this issue, I had ~20 indexes that failed to restore. After retrying to restore these failures with the process above, I was able to get all but seven of them restored. The remaining failures are in the same state.