Skip to content

Hot swappable path.data disks #18279

@PhaedrusTheGreek

Description

@PhaedrusTheGreek

It seems that when making use of path.data over multiple physical disks, that when a disk is removed, the system should recover automatically. Currently, searches and or indexing requests over missing shards throw exceptions, and no allocation/recovery occurs. The only way to bring the data back online is to restart the node, or to reinsert the original disk with existing data.

It would be great if Elasticsearch could:

  • Automatically recover when disks are removed
  • Automatically make use of a newly returned empty disk

Steps to Test / Reproduce:

  1. Set up path.data over 2 disks, and start 2 elasticsearch nodes locally
path.data: ["/Volumes/KINGSTON", "/Volumes/SDCARD"]
  1. Index some data over 5 shards.
index    shard prirep state   docs  store ip        node
test1003 4     r      STARTED    2 10.1kb 127.0.0.1 Jacqueline Falsworth
test1003 4     p      STARTED    2 10.1kb 127.0.0.1 Vindicator
test1003 3     r      STARTED    6 24.4kb 127.0.0.1 Jacqueline Falsworth
test1003 3     p      STARTED    6 24.5kb 127.0.0.1 Vindicator
test1003 1     r      STARTED   10 40.6kb 127.0.0.1 Jacqueline Falsworth
test1003 1     p      STARTED   10 45.5kb 127.0.0.1 Vindicator
test1003 2     r      STARTED    2 10.1kb 127.0.0.1 Jacqueline Falsworth
test1003 2     p      STARTED    2 10.1kb 127.0.0.1 Vindicator
test1003 0     r      STARTED    3 10.1kb 127.0.0.1 Jacqueline Falsworth
test1003 0     p      STARTED    3 10.1kb 127.0.0.1 Vindicator
  1. Remove the disk that contains most/all of the data

Exceptions start to show in logs

2016-05-11 11:50:18,961][DEBUG][action.admin.indices.stats] [Vindicator] [indices:monitor/stats] failed to execute operation for shard [[[test1003/01ABN7pTQDCoTa80WMdAvg]][0], node[AMr_NWrVSFCuNV-YCOfsVg], [P], s[STARTED], a[id=IMwYwgWrTLCZYa08WJRNvg]]
ElasticsearchException[failed to refresh store stats]; nested: NoSuchFileException[/Volumes/KINGSTON/elasticsearch/nodes/0/indices/01ABN7pTQDCoTa80WMdAvg/0/index];
    at org.elasticsearch.index.store.Store$StoreStatsCache.refresh(Store.java:1411)
    at org.elasticsearch.index.store.Store$StoreStatsCache.refresh(Store.java:1396)
    at org.elasticsearch.common.util.SingleObjectCache.getOrRefresh(SingleObjectCache.java:54)
    at org.elasticsearch.index.store.Store.stats(Store.java:321)
    at org.elasticsearch.index.shard.IndexShard.storeStats(IndexShard.java:632)
    at org.elasticsearch.action.admin.indices.stats.CommonStats.<init>(CommonStats.java:137)
    at org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:166)
    at org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:47)
    at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.onShardOperation(TransportBroadcastByNodeAction.java:414)
    at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:393)
    at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:380)
    at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33)
    at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:65)
    at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376)
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:468)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.nio.file.NoSuchFileException: /Volumes/KINGSTON/elasticsearch/nodes/0/indices/01ABN7pTQDCoTa80WMdAvg/0/index
    at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
    at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
    at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
    at sun.nio.fs.UnixFileSystemProvider.newDirectoryStream(UnixFileSystemProvider.java:407)
    at java.nio.file.Files.newDirectoryStream(Files.java:457)
    at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:215)
    at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:234)
    at org.elasticsearch.index.store.FsDirectoryService$1.listAll(FsDirectoryService.java:135)
    at org.apache.lucene.store.FilterDirectory.listAll(FilterDirectory.java:57)
    at org.apache.lucene.store.FilterDirectory.listAll(FilterDirectory.java:57)
    at org.elasticsearch.index.store.Store$StoreStatsCache.estimateSize(Store.java:1417)
    at org.elasticsearch.index.store.Store$StoreStatsCache.refresh(Store.java:1409)
    ... 18 more
[2016-05-11 11:50:26,796][WARN ][monitor.fs               ] [Vindicator] Failed to fetch fs stats - returning empty instance

but _cat/shards shows everything is OK

index    shard prirep state   docs store ip        node
test1003 4     r      STARTED            127.0.0.1 Jacqueline Falsworth
test1003 4     p      STARTED            127.0.0.1 Vindicator
test1003 3     r      STARTED            127.0.0.1 Jacqueline Falsworth
test1003 3     p      STARTED            127.0.0.1 Vindicator
test1003 1     r      STARTED            127.0.0.1 Jacqueline Falsworth
test1003 1     p      STARTED            127.0.0.1 Vindicator
test1003 2     r      STARTED            127.0.0.1 Jacqueline Falsworth
test1003 2     p      STARTED            127.0.0.1 Vindicator
test1003 0     r      STARTED            127.0.0.1 Jacqueline Falsworth
test1003 0     p      STARTED            127.0.0.1 Vindicator
  1. Post a _refresh

No change

  1. Index some data
{
   "error": {
      "root_cause": [
         {
            "type": "index_failed_engine_exception",
            "reason": "Index failed for [test1003#AVSghrSCuf6DFWq498vy]",
            "index_uuid": "01ABN7pTQDCoTa80WMdAvg",
            "shard": "1",
            "index": "test1003"
         }
      ],
      "type": "index_failed_engine_exception",
      "reason": "Index failed for [test1003#AVSghrSCuf6DFWq498vy]",
      "index_uuid": "01ABN7pTQDCoTa80WMdAvg",
      "shard": "1",
      "index": "test1003",
      "caused_by": {
         "type": "i_o_exception",
         "reason": "Input/output error: NIOFSIndexInput(path=\"/Volumes/KINGSTON/elasticsearch/nodes/0/indices/01ABN7pTQDCoTa80WMdAvg/1/index/_a.cfs\") [slice=_a_Lucene50_0.tim]",
         "caused_by": {
            "type": "i_o_exception",
            "reason": "Input/output error"
         }
      }
   },
   "status": 500
}

Logs show an exception

[2016-05-11 11:52:26,911][DEBUG][action.admin.indices.stats] [Vindicator] [indices:monitor/stats] failed to execute operation for shard [[[test1003/01ABN7pTQDCoTa80WMdAvg]][0], node[AMr_NWrVSFCuNV-YCOfsVg], [P], s[STARTED], a[id=IMwYwgWrTLCZYa08WJRNvg]]
ElasticsearchException[failed to refresh store stats]; nested: NoSuchFileException[/Volumes/KINGSTON/elasticsearch/nodes/0/indices/01ABN7pTQDCoTa80WMdAvg/0/index];
    at org.elasticsearch.index.store.Store$StoreStatsCache.refresh(Store.java:1411)
    at org.elasticsearch.index.store.Store$StoreStatsCache.refresh(Store.java:1396)
    at org.elasticsearch.common.util.SingleObjectCache.getOrRefresh(SingleObjectCache.java:54)
    at org.elasticsearch.index.store.Store.stats(Store.java:321)
    at org.elasticsearch.index.shard.IndexShard.storeStats(IndexShard.java:632)
    at org.elasticsearch.action.admin.indices.stats.CommonStats.<init>(CommonStats.java:137)
    at org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:166)
    at org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:47)
    at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.onShardOperation(TransportBroadcastByNodeAction.java:414)
    at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:393)
    at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:380)
    at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33)
    at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:65)
    at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376)
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:468)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.nio.file.NoSuchFileException: /Volumes/KINGSTON/elasticsearch/nodes/0/indices/01ABN7pTQDCoTa80WMdAvg/0/index
    at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
    at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
    at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
    at sun.nio.fs.UnixFileSystemProvider.newDirectoryStream(UnixFileSystemProvider.java:407)
    at java.nio.file.Files.newDirectoryStream(Files.java:457)
    at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:215)
    at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:234)
    at org.elasticsearch.index.store.FsDirectoryService$1.listAll(FsDirectoryService.java:135)
    at org.apache.lucene.store.FilterDirectory.listAll(FilterDirectory.java:57)
    at org.apache.lucene.store.FilterDirectory.listAll(FilterDirectory.java:57)
    at org.elasticsearch.index.store.Store$StoreStatsCache.estimateSize(Store.java:1417)
    at org.elasticsearch.index.store.Store$StoreStatsCache.refresh(Store.java:1409)
    ... 18 more

_cat/shards still show all shards STARTED

index    shard prirep state   docs store ip        node
test1003 4     r      STARTED            127.0.0.1 Jacqueline Falsworth
test1003 4     p      STARTED            127.0.0.1 Vindicator
test1003 3     r      STARTED            127.0.0.1 Jacqueline Falsworth
test1003 3     p      STARTED            127.0.0.1 Vindicator
test1003 1     r      STARTED            127.0.0.1 Jacqueline Falsworth
test1003 1     p      STARTED            127.0.0.1 Vindicator
test1003 2     r      STARTED            127.0.0.1 Jacqueline Falsworth
test1003 2     p      STARTED            127.0.0.1 Vindicator
test1003 0     r      STARTED            127.0.0.1 Jacqueline Falsworth
test1003 0     p      STARTED            127.0.0.1 Vindicator
  1. Wait 5 minutes, Search some data:

No change

{
   "took": 15,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 3,
      "failed": 2,
      "failures": [
         {
            "shard": 0,
            "index": "test1003",
            "node": "AMr_NWrVSFCuNV-YCOfsVg",
            "reason": {
               "type": "i_o_exception",
               "reason": "Input/output error: NIOFSIndexInput(path=\"/Volumes/KINGSTON/elasticsearch/nodes/0/indices/01ABN7pTQDCoTa80WMdAvg/0/index/_0.cfs\") [slice=_0.fdt]",
               "caused_by": {
                  "type": "i_o_exception",
                  "reason": "Input/output error"
               }
            }
         },
         {
            "shard": 1,
            "index": "test1003",
            "node": "wK5mnEIaT82Wz3wdTAjv6Q",
            "reason": {
               "type": "i_o_exception",
               "reason": "Input/output error: NIOFSIndexInput(path=\"/Volumes/KINGSTON/elasticsearch/nodes/1/indices/01ABN7pTQDCoTa80WMdAvg/1/index/_2.cfs\") [slice=_2.fdt]",
               "caused_by": {
                  "type": "i_o_exception",
                  "reason": "Input/output error"
               }
            }
         }
      ]
   },
   "hits": {
      "total": 23,
      "max_score": 1,
      "hits": []
   }
}
index    shard prirep state   docs store ip        node
test1003 4     r      STARTED            127.0.0.1 Jacqueline Falsworth
test1003 4     p      STARTED            127.0.0.1 Vindicator
test1003 3     r      STARTED            127.0.0.1 Jacqueline Falsworth
test1003 3     p      STARTED            127.0.0.1 Vindicator
test1003 1     r      STARTED            127.0.0.1 Jacqueline Falsworth
test1003 1     p      STARTED            127.0.0.1 Vindicator
test1003 2     r      STARTED            127.0.0.1 Jacqueline Falsworth
test1003 2     p      STARTED            127.0.0.1 Vindicator
test1003 0     r      STARTED            127.0.0.1 Jacqueline Falsworth
test1003 0     p      STARTED            127.0.0.1 Vindicator

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions