Skip to content

[DOCS] Document cluster behavior when a file system crashes but node remains operational #25591

@MorrieAtElastic

Description

@MorrieAtElastic

Describe the feature: Document cluster behavior when a file system crashes but node remains operational

Elasticsearch version: Generic

Plugins installed: [] n/a

JVM version (java -version): n/a

OS version (uname -a if on a Unix-like system): generic

Description of the problem including expected versus actual behavior:

Elasticsearch documentation currently describes behavior when a node in a cluster fails. The documentation does not describe behavior when a node's file system fails but the node itself remains operational. Such failure conditions can and will happen especially for customers using 3rd-party high-performance disk systems (SSD, RAID, etc.) which are loosely coupled with the OS. Additionally it is common that customers will mount their data directories on high-performance disk systems while keeping their log data on the system drive.

General issues that need to be addressed:

  • cluster actions when primary shards are lost due to disk failure (according to my testing, replica shards are promoted on other nodes)
  • cluster actions when replica shards are lost due to disk failure (new replica shards are created on surviving nodes)
  • parameters affecting shard management when a disk failure occurs
  • cluster response when disk failure is resolved and the disk system is brought back online (according to my testing, nothing happens until the entire cluster is restarted)
  • response of the node and the cluster to queries and CRUD requests addressed to the node with the failed system.

Relevant Discussions

"Expected behavior" during disk crashes has changed significantly between elastic search versions and there are several significant open issues speaking to this question including:

#18417
#18467
#19789

Cluster response specifically to failed disk conditions should be documented for user system design and recovery planning.

Metadata

Metadata

Assignees

No one assigned

    Labels

    :Core/Infra/CoreCore issues without another label:Distributed Coordination/AllocationAll issues relating to the decision making around placing a shard (both master logic & on the nodes):Distributed Coordination/Cluster CoordinationCluster formation and cluster state publication, including cluster membership and fault detection.>docsGeneral docs changesTeam:Core/InfraMeta label for core/infra teamTeam:Distributed (Obsolete)Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.resiliency

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions