-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Closed
Labels
:Core/Infra/ResiliencyKeep running when everything is ok. Die quickly if things go horribly wrong.Keep running when everything is ok. Die quickly if things go horribly wrong.>enhancementresiliency
Description
It seems that when making use of path.data over multiple physical disks, that when a disk is removed, the system should recover automatically. Currently, searches and or indexing requests over missing shards throw exceptions, and no allocation/recovery occurs. The only way to bring the data back online is to restart the node, or to reinsert the original disk with existing data.
It would be great if Elasticsearch could:
- Automatically recover when disks are removed
- Automatically make use of a newly returned empty disk
Steps to Test / Reproduce:
- Set up
path.dataover 2 disks, and start 2 elasticsearch nodes locally
path.data: ["/Volumes/KINGSTON", "/Volumes/SDCARD"]
- Index some data over 5 shards.
index shard prirep state docs store ip node
test1003 4 r STARTED 2 10.1kb 127.0.0.1 Jacqueline Falsworth
test1003 4 p STARTED 2 10.1kb 127.0.0.1 Vindicator
test1003 3 r STARTED 6 24.4kb 127.0.0.1 Jacqueline Falsworth
test1003 3 p STARTED 6 24.5kb 127.0.0.1 Vindicator
test1003 1 r STARTED 10 40.6kb 127.0.0.1 Jacqueline Falsworth
test1003 1 p STARTED 10 45.5kb 127.0.0.1 Vindicator
test1003 2 r STARTED 2 10.1kb 127.0.0.1 Jacqueline Falsworth
test1003 2 p STARTED 2 10.1kb 127.0.0.1 Vindicator
test1003 0 r STARTED 3 10.1kb 127.0.0.1 Jacqueline Falsworth
test1003 0 p STARTED 3 10.1kb 127.0.0.1 Vindicator
- Remove the disk that contains most/all of the data
Exceptions start to show in logs
2016-05-11 11:50:18,961][DEBUG][action.admin.indices.stats] [Vindicator] [indices:monitor/stats] failed to execute operation for shard [[[test1003/01ABN7pTQDCoTa80WMdAvg]][0], node[AMr_NWrVSFCuNV-YCOfsVg], [P], s[STARTED], a[id=IMwYwgWrTLCZYa08WJRNvg]]
ElasticsearchException[failed to refresh store stats]; nested: NoSuchFileException[/Volumes/KINGSTON/elasticsearch/nodes/0/indices/01ABN7pTQDCoTa80WMdAvg/0/index];
at org.elasticsearch.index.store.Store$StoreStatsCache.refresh(Store.java:1411)
at org.elasticsearch.index.store.Store$StoreStatsCache.refresh(Store.java:1396)
at org.elasticsearch.common.util.SingleObjectCache.getOrRefresh(SingleObjectCache.java:54)
at org.elasticsearch.index.store.Store.stats(Store.java:321)
at org.elasticsearch.index.shard.IndexShard.storeStats(IndexShard.java:632)
at org.elasticsearch.action.admin.indices.stats.CommonStats.<init>(CommonStats.java:137)
at org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:166)
at org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:47)
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.onShardOperation(TransportBroadcastByNodeAction.java:414)
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:393)
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:380)
at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33)
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:65)
at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376)
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:468)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.nio.file.NoSuchFileException: /Volumes/KINGSTON/elasticsearch/nodes/0/indices/01ABN7pTQDCoTa80WMdAvg/0/index
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at sun.nio.fs.UnixFileSystemProvider.newDirectoryStream(UnixFileSystemProvider.java:407)
at java.nio.file.Files.newDirectoryStream(Files.java:457)
at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:215)
at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:234)
at org.elasticsearch.index.store.FsDirectoryService$1.listAll(FsDirectoryService.java:135)
at org.apache.lucene.store.FilterDirectory.listAll(FilterDirectory.java:57)
at org.apache.lucene.store.FilterDirectory.listAll(FilterDirectory.java:57)
at org.elasticsearch.index.store.Store$StoreStatsCache.estimateSize(Store.java:1417)
at org.elasticsearch.index.store.Store$StoreStatsCache.refresh(Store.java:1409)
... 18 more
[2016-05-11 11:50:26,796][WARN ][monitor.fs ] [Vindicator] Failed to fetch fs stats - returning empty instance
but _cat/shards shows everything is OK
index shard prirep state docs store ip node
test1003 4 r STARTED 127.0.0.1 Jacqueline Falsworth
test1003 4 p STARTED 127.0.0.1 Vindicator
test1003 3 r STARTED 127.0.0.1 Jacqueline Falsworth
test1003 3 p STARTED 127.0.0.1 Vindicator
test1003 1 r STARTED 127.0.0.1 Jacqueline Falsworth
test1003 1 p STARTED 127.0.0.1 Vindicator
test1003 2 r STARTED 127.0.0.1 Jacqueline Falsworth
test1003 2 p STARTED 127.0.0.1 Vindicator
test1003 0 r STARTED 127.0.0.1 Jacqueline Falsworth
test1003 0 p STARTED 127.0.0.1 Vindicator
- Post a
_refresh
No change
- Index some data
{
"error": {
"root_cause": [
{
"type": "index_failed_engine_exception",
"reason": "Index failed for [test1003#AVSghrSCuf6DFWq498vy]",
"index_uuid": "01ABN7pTQDCoTa80WMdAvg",
"shard": "1",
"index": "test1003"
}
],
"type": "index_failed_engine_exception",
"reason": "Index failed for [test1003#AVSghrSCuf6DFWq498vy]",
"index_uuid": "01ABN7pTQDCoTa80WMdAvg",
"shard": "1",
"index": "test1003",
"caused_by": {
"type": "i_o_exception",
"reason": "Input/output error: NIOFSIndexInput(path=\"/Volumes/KINGSTON/elasticsearch/nodes/0/indices/01ABN7pTQDCoTa80WMdAvg/1/index/_a.cfs\") [slice=_a_Lucene50_0.tim]",
"caused_by": {
"type": "i_o_exception",
"reason": "Input/output error"
}
}
},
"status": 500
}
Logs show an exception
[2016-05-11 11:52:26,911][DEBUG][action.admin.indices.stats] [Vindicator] [indices:monitor/stats] failed to execute operation for shard [[[test1003/01ABN7pTQDCoTa80WMdAvg]][0], node[AMr_NWrVSFCuNV-YCOfsVg], [P], s[STARTED], a[id=IMwYwgWrTLCZYa08WJRNvg]]
ElasticsearchException[failed to refresh store stats]; nested: NoSuchFileException[/Volumes/KINGSTON/elasticsearch/nodes/0/indices/01ABN7pTQDCoTa80WMdAvg/0/index];
at org.elasticsearch.index.store.Store$StoreStatsCache.refresh(Store.java:1411)
at org.elasticsearch.index.store.Store$StoreStatsCache.refresh(Store.java:1396)
at org.elasticsearch.common.util.SingleObjectCache.getOrRefresh(SingleObjectCache.java:54)
at org.elasticsearch.index.store.Store.stats(Store.java:321)
at org.elasticsearch.index.shard.IndexShard.storeStats(IndexShard.java:632)
at org.elasticsearch.action.admin.indices.stats.CommonStats.<init>(CommonStats.java:137)
at org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:166)
at org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:47)
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.onShardOperation(TransportBroadcastByNodeAction.java:414)
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:393)
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:380)
at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33)
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:65)
at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376)
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:468)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.nio.file.NoSuchFileException: /Volumes/KINGSTON/elasticsearch/nodes/0/indices/01ABN7pTQDCoTa80WMdAvg/0/index
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at sun.nio.fs.UnixFileSystemProvider.newDirectoryStream(UnixFileSystemProvider.java:407)
at java.nio.file.Files.newDirectoryStream(Files.java:457)
at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:215)
at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:234)
at org.elasticsearch.index.store.FsDirectoryService$1.listAll(FsDirectoryService.java:135)
at org.apache.lucene.store.FilterDirectory.listAll(FilterDirectory.java:57)
at org.apache.lucene.store.FilterDirectory.listAll(FilterDirectory.java:57)
at org.elasticsearch.index.store.Store$StoreStatsCache.estimateSize(Store.java:1417)
at org.elasticsearch.index.store.Store$StoreStatsCache.refresh(Store.java:1409)
... 18 more
_cat/shards still show all shards STARTED
index shard prirep state docs store ip node
test1003 4 r STARTED 127.0.0.1 Jacqueline Falsworth
test1003 4 p STARTED 127.0.0.1 Vindicator
test1003 3 r STARTED 127.0.0.1 Jacqueline Falsworth
test1003 3 p STARTED 127.0.0.1 Vindicator
test1003 1 r STARTED 127.0.0.1 Jacqueline Falsworth
test1003 1 p STARTED 127.0.0.1 Vindicator
test1003 2 r STARTED 127.0.0.1 Jacqueline Falsworth
test1003 2 p STARTED 127.0.0.1 Vindicator
test1003 0 r STARTED 127.0.0.1 Jacqueline Falsworth
test1003 0 p STARTED 127.0.0.1 Vindicator
- Wait 5 minutes, Search some data:
No change
{
"took": 15,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 3,
"failed": 2,
"failures": [
{
"shard": 0,
"index": "test1003",
"node": "AMr_NWrVSFCuNV-YCOfsVg",
"reason": {
"type": "i_o_exception",
"reason": "Input/output error: NIOFSIndexInput(path=\"/Volumes/KINGSTON/elasticsearch/nodes/0/indices/01ABN7pTQDCoTa80WMdAvg/0/index/_0.cfs\") [slice=_0.fdt]",
"caused_by": {
"type": "i_o_exception",
"reason": "Input/output error"
}
}
},
{
"shard": 1,
"index": "test1003",
"node": "wK5mnEIaT82Wz3wdTAjv6Q",
"reason": {
"type": "i_o_exception",
"reason": "Input/output error: NIOFSIndexInput(path=\"/Volumes/KINGSTON/elasticsearch/nodes/1/indices/01ABN7pTQDCoTa80WMdAvg/1/index/_2.cfs\") [slice=_2.fdt]",
"caused_by": {
"type": "i_o_exception",
"reason": "Input/output error"
}
}
}
]
},
"hits": {
"total": 23,
"max_score": 1,
"hits": []
}
}
index shard prirep state docs store ip node
test1003 4 r STARTED 127.0.0.1 Jacqueline Falsworth
test1003 4 p STARTED 127.0.0.1 Vindicator
test1003 3 r STARTED 127.0.0.1 Jacqueline Falsworth
test1003 3 p STARTED 127.0.0.1 Vindicator
test1003 1 r STARTED 127.0.0.1 Jacqueline Falsworth
test1003 1 p STARTED 127.0.0.1 Vindicator
test1003 2 r STARTED 127.0.0.1 Jacqueline Falsworth
test1003 2 p STARTED 127.0.0.1 Vindicator
test1003 0 r STARTED 127.0.0.1 Jacqueline Falsworth
test1003 0 p STARTED 127.0.0.1 Vindicator
otisg and ppf2
Metadata
Metadata
Assignees
Labels
:Core/Infra/ResiliencyKeep running when everything is ok. Die quickly if things go horribly wrong.Keep running when everything is ok. Die quickly if things go horribly wrong.>enhancementresiliency