Skip to content

Restoring a snapshot from S3 to 5.6.2 results in a hung and incomplete restore.  #26865

@jdoss

Description

@jdoss

Elasticsearch version (bin/elasticsearch --version):

# rpm -qa |grep elasticsearch
elasticsearch-5.6.2-1.noarch

Plugins installed:

discovery-ec2
repository-s3
x-pack

JVM version (java -version):

# java -version
java version "1.8.0_141"
Java(TM) SE Runtime Environment (build 1.8.0_141-b15)
Java HotSpot(TM) 64-Bit Server VM (build 25.141-b15, mixed mode)

OS version (uname -a if on a Unix-like system):

Fedora 26
Linux 4.12.14-300.fc26.x86_64 #1 SMP Wed Sep 20 16:28:07 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:

We have had about twenty indexes that are stuck in a red state after trying to restore a snapshot taken from elasticsearch 5.4.1 to a brand new cluster running 5.6.2. For this issue, I will focus on one index logstash-2017.09.20.

You can see here that the index is in a red state:

# curl -XGET 'localhost:9200/_cluster/health/logstash-2017.09.20?level=shards&pretty'
{
  "cluster_name" : "redacted",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 11,
  "number_of_data_nodes" : 5,
  "active_primary_shards" : 4,
  "active_shards" : 4,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 1,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 98.60064585575888,
  "indices" : {
    "logstash-2017.09.20" : {
      "status" : "red",
      "number_of_shards" : 5,
      "number_of_replicas" : 0,
      "active_primary_shards" : 4,
      "active_shards" : 4,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 1,
      "shards" : {
        "0" : {
          "status" : "green",
          "primary_active" : true,
          "active_shards" : 1,
          "relocating_shards" : 0,
          "initializing_shards" : 0,
          "unassigned_shards" : 0
        },
        "1" : {
          "status" : "green",
          "primary_active" : true,
          "active_shards" : 1,
          "relocating_shards" : 0,
          "initializing_shards" : 0,
          "unassigned_shards" : 0
        },
        "2" : {
          "status" : "green",
          "primary_active" : true,
          "active_shards" : 1,
          "relocating_shards" : 0,
          "initializing_shards" : 0,
          "unassigned_shards" : 0
        },
        "3" : {
          "status" : "green",
          "primary_active" : true,
          "active_shards" : 1,
          "relocating_shards" : 0,
          "initializing_shards" : 0,
          "unassigned_shards" : 0
        },
        "4" : {
          "status" : "red",
          "primary_active" : false,
          "active_shards" : 0,
          "relocating_shards" : 0,
          "initializing_shards" : 0,
          "unassigned_shards" : 1
        }
      }
    }
  }
}

You can see the restore says it finished with a SUCCESS:

# curl -XGET 'localhost:9200/_snapshot/my_cool_backup/snapshot_0?pretty'
{
  "snapshots" : [
    {
      "snapshot" : "snapshot_0",
      "uuid" : "e_wavyGfTD-SwXC-imkF0g",
      "version_id" : 5040199,
      "version" : "5.4.1",
      "indices" : [
        ** SNIP **
      ],
      "state" : "SUCCESS",
      "start_time" : "2017-09-27T07:00:01.807Z",
      "start_time_in_millis" : 1506495601807,
      "end_time" : "2017-09-27T08:44:35.377Z",
      "end_time_in_millis" : 1506501875377,
      "duration_in_millis" : 6273570,
      "failures" : [ ],
      "shards" : {
        "total" : 929,
        "failed" : 0,
        "successful" : 929
      }
    }
  ]
}

Looking at the restore process in detail for the example index, you can see that it says this index has been put into the DONE state for each shard.

$ curl -XGET 'localhost:9200/_snapshot/my_cool_backup/snapshot_0/_status?pretty'
"snapshots" : [
    {
      "snapshot" : "snapshot_0",
      "repository" : "my_cool_backup",
      "uuid" : "e_wavyGfTD-SwXC-imkF0g",
      "state" : "SUCCESS",
      "shards_stats" : {
        "initializing" : 0,
        "started" : 0,
        "finalizing" : 0,
        "done" : 929,
        "failed" : 0,
        "total" : 929
      },
      "stats" : {
        "number_of_files" : 2364,
        "processed_files" : 2364,
        "total_size_in_bytes" : 15393945691,
        "processed_size_in_bytes" : 15393945691,
        "start_time_in_millis" : 1506495618226,
        "time_in_millis" : 6252967
      },
      "indices" : {
        "logstash-2017.09.20" : {
                  "shards_stats" : {
                    "initializing" : 0,
                    "started" : 0,
                    "finalizing" : 0,
                    "done" : 5,
                    "failed" : 0,
                    "total" : 5
                  },
                  "stats" : {
                    "number_of_files" : 31,
                    "processed_files" : 31,
                    "total_size_in_bytes" : 168664,
                    "processed_size_in_bytes" : 168664,
                    "start_time_in_millis" : 1506495678150,
                    "time_in_millis" : 2401656
                  },
                  "shards" : {
                    "0" : {
                      "stage" : "DONE",
                      "stats" : {
                        "number_of_files" : 7,
                        "processed_files" : 7,
                        "total_size_in_bytes" : 118135,
                        "processed_size_in_bytes" : 118135,
                        "start_time_in_millis" : 1506495720316,
                        "time_in_millis" : 1949
                      }
                    },
                    "1" : {
                      "stage" : "DONE",
                      "stats" : {
                        "number_of_files" : 16,
                        "processed_files" : 16,
                        "total_size_in_bytes" : 33918,
                        "processed_size_in_bytes" : 33918,
                        "start_time_in_millis" : 1506495722992,
                        "time_in_millis" : 2804
                      }
                    },
                    "2" : {
                      "stage" : "DONE",
                      "stats" : {
                        "number_of_files" : 0,
                        "processed_files" : 0,
                        "total_size_in_bytes" : 0,
                        "processed_size_in_bytes" : 0,
                        "start_time_in_millis" : 1506498067865,
                        "time_in_millis" : 11941
                      }
                    },
                    "3" : {
                      "stage" : "DONE",
                      "stats" : {
                        "number_of_files" : 4,
                        "processed_files" : 4,
                        "total_size_in_bytes" : 8434,
                        "processed_size_in_bytes" : 8434,
                        "start_time_in_millis" : 1506495678150,
                        "time_in_millis" : 1206
                      }
                    },
                    "4" : {
                      "stage" : "DONE",
                      "stats" : {
                        "number_of_files" : 4,
                        "processed_files" : 4,
                        "total_size_in_bytes" : 8177,
                        "processed_size_in_bytes" : 8177,
                        "start_time_in_millis" : 1506495684287,
                        "time_in_millis" : 1164
                      }
                    }
                  }
                }

Looking at /_cat/recovery it says it's done too

# curl -XGET localhost:9200/_cat/recovery|grep logstash-2017.09.20

logstash-2017.09.20         0 7.9s  snapshot       done n/a n/a redacted data-03 my_cool_backup snapshot_0 1   1   100.0% 109 1699       1699       100.0% 2911728303 0 0 100.0%
logstash-2017.09.20         1 14.5m snapshot       done n/a n/a redacted  data-04 my_cool_backup snapshot_0 136 136 100.0% 136 2842065772 2842065772 100.0% 2842065772 0 0 100.0%
logstash-2017.09.20         2 1.7s  snapshot       done n/a n/a redacted data-00 my_cool_backup snapshot_0 1   1   100.0% 109 1699       1699       100.0% 2889504028 0 0 100.0%
logstash-2017.09.20         3 13.9m snapshot       done n/a n/a redacted data-02 my_cool_backup snapshot_0 127 127 100.0% 127 2929823683 2929823683 100.0% 2929823683 0 0 100.0%

But if you try to close the index it says that it is still being restored:

$ curl -XPOST 'localhost:9200/logstash-2017.09.20/_close?pretty'
{
  "error" : {
    "root_cause" : [
      {
        "type" : "remote_transport_exception",
        "reason" : "[master-01][redacted:9300][indices:admin/close]"
      }
    ],
    "type" : "illegal_argument_exception",
    "reason" : "Cannot close indices that are being restored: [[logstash-2017.09.20/crXjrjtwTEqkK6_ITG1HVQ]]"
  },
  "status" : 400
}

Looking in the logs it says that it failed to recover the index because the file already exists:

[2017-10-02T19:50:28,790][WARN ][o.e.c.a.s.ShardStateAction] [master-01] [logstash-2017.09.20][4] received shard failed for shard id [[logstash-2017.09.20][4]], allocation id [lW_4BSVGSc6phnI1vLEPWg], primary term [0], message [failed recovery], failure [RecoveryFailedException[[logstash-2017.09.20][4]: Recovery failed on {data-02}{Af43AKvBRf6r-PTr2s9KRg}{O1R6sKwAQK2FyYYmdFLjPA}{redacted}{redacted:9300}{aws_availability_zone=us-west-2c, ml.max_open_jobs=10, ml.enabled=true}]; nested: IndexShardRecoveryException[failed recovery]; nested: IndexShardRestoreFailedException[restore failed]; nested: IndexShardRestoreFailedException[failed to restore snapshot [snapshot_0/e_wavyGfTD-SwXC-imkF0g]]; nested: IndexShardRestoreFailedException[Failed to recover index]; nested: FileAlreadyExistsException[/var/lib/elasticsearch/nodes/0/indices/crXjrjtwTEqkK6_ITG1HVQ/4/index/_22g.si]; ]

[2017-10-02T19:50:28,790][WARN ][o.e.c.a.s.ShardStateAction] [master-01] [logstash-2017.09.20][4] received shard failed for shard id [[logstash-2017.09.20][4]
], allocation id [lW_4BSVGSc6phnI1vLEPWg], primary term [0], message [failed recovery], failure [RecoveryFailedException[[logstash-2017.09.20][4]: Recovery failed 
on {data-02}{Af43AKvBRf6r-PTr2s9KRg}{O1R6sKwAQK2FyYYmdFLjPA}{redacted}{redacted:9300}{aws_availability_zone=us-west-2c, ml.max_open_jobs=10, ml.enabled=
true}]; nested: IndexShardRecoveryException[failed recovery]; nested: IndexShardRestoreFailedException[restore failed]; nested: IndexShardRestoreFailedException[fa
iled to restore snapshot [snapshot_0/e_wavyGfTD-SwXC-imkF0g]]; nested: IndexShardRestoreFailedException[Failed to recover index]; nested: FileAlre
adyExistsException[/var/lib/elasticsearch/nodes/0/indices/crXjrjtwTEqkK6_ITG1HVQ/4/index/_22g.si]; ]
org.elasticsearch.indices.recovery.RecoveryFailedException: [logstash-2017.09.20][4]: Recovery failed on {data-02}{Af43AKvBRf6r-PTr2s9KRg}{O1R6sKwAQK2FyYYmdFL
jPA}{redacted}{redacted:9300}{aws_availability_zone=us-west-2c, ml.max_open_jobs=10, ml.enabled=true}
        at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$2(IndexShard.java:1511) ~[elasticsearch-5.6.2.jar:5.6.2]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) ~[elasticsearch-5.6.2.jar:5.6.2]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_141]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_141]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_141]
Caused by: org.elasticsearch.index.shard.IndexShardRecoveryException: failed recovery
        at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:299) ~[elasticsearch-5.6.2.jar:5.6.2]
        at org.elasticsearch.index.shard.StoreRecovery.recoverFromRepository(StoreRecovery.java:232) ~[elasticsearch-5.6.2.jar:5.6.2]
        at org.elasticsearch.index.shard.IndexShard.restoreFromRepository(IndexShard.java:1243) ~[elasticsearch-5.6.2.jar:5.6.2]
        at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$2(IndexShard.java:1507) ~[elasticsearch-5.6.2.jar:5.6.2]
        ... 4 more
Caused by: org.elasticsearch.index.snapshots.IndexShardRestoreFailedException: restore failed
        at org.elasticsearch.index.shard.StoreRecovery.restore(StoreRecovery.java:405) ~[elasticsearch-5.6.2.jar:5.6.2]
        at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromRepository$4(StoreRecovery.java:234) ~[elasticsearch-5.6.2.jar:5.6.2]
        at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:257) ~[elasticsearch-5.6.2.jar:5.6.2]
        at org.elasticsearch.index.shard.StoreRecovery.recoverFromRepository(StoreRecovery.java:232) ~[elasticsearch-5.6.2.jar:5.6.2]
        at org.elasticsearch.index.shard.IndexShard.restoreFromRepository(IndexShard.java:1243) ~[elasticsearch-5.6.2.jar:5.6.2]
        at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$2(IndexShard.java:1507) ~[elasticsearch-5.6.2.jar:5.6.2]
        ... 4 more
Caused by: org.elasticsearch.index.snapshots.IndexShardRestoreFailedException: failed to restore snapshot [snapshot_0/e_wavyGfTD-SwXC-imkF0g]
        at org.elasticsearch.repositories.blobstore.BlobStoreRepository.restoreShard(BlobStoreRepository.java:993) ~[elasticsearch-5.6.2.jar:5.6.2]
        at org.elasticsearch.index.shard.StoreRecovery.restore(StoreRecovery.java:400) ~[elasticsearch-5.6.2.jar:5.6.2]
        at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromRepository$4(StoreRecovery.java:234) ~[elasticsearch-5.6.2.jar:5.6.2]
        at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:257) ~[elasticsearch-5.6.2.jar:5.6.2]
        at org.elasticsearch.index.shard.StoreRecovery.recoverFromRepository(StoreRecovery.java:232) ~[elasticsearch-5.6.2.jar:5.6.2]
        at org.elasticsearch.index.shard.IndexShard.restoreFromRepository(IndexShard.java:1243) ~[elasticsearch-5.6.2.jar:5.6.2]
        at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$2(IndexShard.java:1507) ~[elasticsearch-5.6.2.jar:5.6.2]
        ... 4 more
Caused by: org.elasticsearch.index.snapshots.IndexShardRestoreFailedException: Failed to recover index
        at org.elasticsearch.repositories.blobstore.BlobStoreRepository$RestoreContext.restore(BlobStoreRepository.java:1679) ~[elasticsearch-5.6.2.jar:5.6.2]
        at org.elasticsearch.repositories.blobstore.BlobStoreRepository.restoreShard(BlobStoreRepository.java:991) ~[elasticsearch-5.6.2.jar:5.6.2]
        at org.elasticsearch.index.shard.StoreRecovery.restore(StoreRecovery.java:400) ~[elasticsearch-5.6.2.jar:5.6.2]
        at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromRepository$4(StoreRecovery.java:234) ~[elasticsearch-5.6.2.jar:5.6.2]
        at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:257) ~[elasticsearch-5.6.2.jar:5.6.2]
        at org.elasticsearch.index.shard.StoreRecovery.recoverFromRepository(StoreRecovery.java:232) ~[elasticsearch-5.6.2.jar:5.6.2]
        at org.elasticsearch.index.shard.IndexShard.restoreFromRepository(IndexShard.java:1243) ~[elasticsearch-5.6.2.jar:5.6.2]
        at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$2(IndexShard.java:1507) ~[elasticsearch-5.6.2.jar:5.6.2]
        ... 4 more
Caused by: java.nio.file.FileAlreadyExistsException: /var/lib/elasticsearch/nodes/0/indices/crXjrjtwTEqkK6_ITG1HVQ/4/index/_22g.si
        at sun.nio.fs.UnixException.translateToIOException(UnixException.java:88) ~[?:?]
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) ~[?:?]
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) ~[?:?]
        at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214) ~[?:?]
        at java.nio.file.spi.FileSystemProvider.newOutputStream(FileSystemProvider.java:434) ~[?:1.8.0_141]
        at java.nio.file.Files.newOutputStream(Files.java:216) ~[?:1.8.0_141]
        at org.apache.lucene.store.FSDirectory$FSIndexOutput.<init>(FSDirectory.java:413) ~[lucene-core-6.6.1.jar:6.6.1 9aa465a89b64ff2dabe7b4d50c472de32c298683 - varunthacker - 2017-08-29 21:54:39]
        at org.apache.lucene.store.FSDirectory$FSIndexOutput.<init>(FSDirectory.java:409) ~[lucene-core-6.6.1.jar:6.6.1 9aa465a89b64ff2dabe7b4d50c472de32c298683 - varunthacker - 2017-08-29 21:54:39]
        at org.apache.lucene.store.FSDirectory.createOutput(FSDirectory.java:253) ~[lucene-core-6.6.1.jar:6.6.1 9aa465a89b64ff2dabe7b4d50c472de32c298683 - varunthacker - 2017-08-29 21:54:39]
        at org.apache.lucene.store.RateLimitedFSDirectory.createOutput(RateLimitedFSDirectory.java:40) ~[elasticsearch-5.6.2.jar:6.6.1 9aa465a89b64ff2dabe7b4d50c472de32c298683 - varunthacker - 2017-08-29 21:54:39]
        at org.apache.lucene.store.FilterDirectory.createOutput(FilterDirectory.java:73) ~[lucene-core-6.6.1.jar:6.6.1 9aa465a89b64ff2dabe7b4d50c472de32c298683 - varunthacker - 2017-08-29 21:54:39]
        at org.elasticsearch.index.store.Store.createVerifyingOutput(Store.java:463) ~[elasticsearch-5.6.2.jar:5.6.2]
        at org.elasticsearch.repositories.blobstore.BlobStoreRepository$RestoreContext.restoreFile(BlobStoreRepository.java:1734) ~[elasticsearch-5.6.2.jar:5.6.2]
        at org.elasticsearch.repositories.blobstore.BlobStoreRepository$RestoreContext.restore(BlobStoreRepository.java:1676) ~[elasticsearch-5.6.2.jar:5.6.2]
        at org.elasticsearch.repositories.blobstore.BlobStoreRepository.restoreShard(BlobStoreRepository.java:991) ~[elasticsearch-5.6.2.jar:5.6.2]
        at org.elasticsearch.index.shard.StoreRecovery.restore(StoreRecovery.java:400) ~[elasticsearch-5.6.2.jar:5.6.2]
        at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromRepository$4(StoreRecovery.java:234) ~[elasticsearch-5.6.2.jar:5.6.2]
        at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:257) ~[elasticsearch-5.6.2.jar:5.6.2]
        at org.elasticsearch.index.shard.StoreRecovery.recoverFromRepository(StoreRecovery.java:232) ~[elasticsearch-5.6.2.jar:5.6.2]
        at org.elasticsearch.index.shard.IndexShard.restoreFromRepository(IndexShard.java:1243) ~[elasticsearch-5.6.2.jar:5.6.2]
        at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$2(IndexShard.java:1507) ~[elasticsearch-5.6.2.jar:5.6.2]
        ... 4 more

And if you look on for that file it says is already exists, it is not present on the data node:

# ll /var/lib/elasticsearch/nodes/0/indices/crXjrjtwTEqkK6_ITG1HVQ/4/index/_22g.si
ls: cannot access '/var/lib/elasticsearch/nodes/0/indices/crXjrjtwTEqkK6_ITG1HVQ/4/index/_22g.si': No such file or directory

The only way I have been able to get the cluster out of this hung state is to do a full cluster shutdown and start it back up again. From there I am able to close these red indexes and retry the restore again. When I first encountered this issue, I had ~20 indexes that failed to restore. After retrying to restore these failures with the process above, I was able to get all but seven of them restored. The remaining failures are in the same state.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions