-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Description
Describe the feature: Document cluster behavior when a file system crashes but node remains operational
Elasticsearch version: Generic
Plugins installed: [] n/a
JVM version (java -version): n/a
OS version (uname -a if on a Unix-like system): generic
Description of the problem including expected versus actual behavior:
Elasticsearch documentation currently describes behavior when a node in a cluster fails. The documentation does not describe behavior when a node's file system fails but the node itself remains operational. Such failure conditions can and will happen especially for customers using 3rd-party high-performance disk systems (SSD, RAID, etc.) which are loosely coupled with the OS. Additionally it is common that customers will mount their data directories on high-performance disk systems while keeping their log data on the system drive.
General issues that need to be addressed:
- cluster actions when primary shards are lost due to disk failure (according to my testing, replica shards are promoted on other nodes)
- cluster actions when replica shards are lost due to disk failure (new replica shards are created on surviving nodes)
- parameters affecting shard management when a disk failure occurs
- cluster response when disk failure is resolved and the disk system is brought back online (according to my testing, nothing happens until the entire cluster is restarted)
- response of the node and the cluster to queries and CRUD requests addressed to the node with the failed system.
Relevant Discussions
"Expected behavior" during disk crashes has changed significantly between elastic search versions and there are several significant open issues speaking to this question including:
Cluster response specifically to failed disk conditions should be documented for user system design and recovery planning.