Skip to content

Conversation

@masseyke
Copy link
Member

@masseyke masseyke commented Jun 7, 2022

This PR builds on #86524 by supporting two additional conditions, both of which happen when there has been
no elected master for more than 30 seconds (from the queried node's point of view), and both of which return
a RED status:

  1. There are no master-eligible nodes found in the cluster
  2. The node being queried sees a master-eligible node that has been elected the master, but cannot join it

This PR does NOT cover two additional conditions when there has been no elected master for more than
30 seconds that will be handled in subsequent PRs:

  1. There are master-eligible nodes but none is elected master and the queried node is not master-eligible
  2. There are master-eligible nodes but none is elected master and the queried node is master-eligible

Here is an example response in the case that there are no master-eligible nodes (case 1 above):

{
    "status": "red",
    "cluster_name": "TEST-TEST_WORKER_VM=[--not-gradle--]-CLUSTER_SEED=[-7055790250029418587]-HASH=[CCFA36BEC69DB]-cluster",
    "components": {
        "cluster_coordination": {
            "status": "red",
            "indicators": {
                "master_is_stable": {
                    "status": "red",
                    "summary": "No master eligible nodes found in the cluster",
                    "help_url": "https://ela.st/fix-master",
                    "details": {
                        "recent_masters": [
                            {
                                "node_id": "FhnTO8lgQJeTnL2y_TqjOQ",
                                "name": "node_t0"
                            }
                        ],
                        "cluster_formation": {
                            "description": "master not discovered yet: have discovered [{node_t1}{ZK-WxrjtSQ2FVSS5DOJo_A}{GJ_d8eTwR5yHfJQ3iVmNeQ}{node_t1}{127.0.0.1}{127.0.0.1:49785}{d}]; discovery will continue using [127.0.0.1:49782] from hosts providers and [{node_t0}{FhnTO8lgQJeTnL2y_TqjOQ}{HKoK0SzXSDqqolptGB2TKg}{node_t0}{127.0.0.1}{127.0.0.1:49782}{m}] from last-known cluster state; node term 1, last-accepted version 4 in term 1"
                        }
                    },
                    "impacts": [
                        {
                            "severity": 1,
                            "description": "The cluster cannot create, delete, or rebalance indices, and cannot insert or update documents.",
                            "impact_areas": [
                                "ingest"
                            ]
                        },
                        {
                            "severity": 1,
                            "description": "Scheduled tasks such as Watcher, ILM, and SLM will not work. The _cat APIs will not work.",
                            "impact_areas": [
                                "deployment_management"
                            ]
                        },
                        {
                            "severity": 3,
                            "description": "Snapshot and restore will not work. Searchable snapshots cannot be mounted.",
                            "impact_areas": [
                                "backup"
                            ]
                        }
                    ],
                    "user_actions": [
                        {
                            "message": "The Elasticsearch cluster does not have a stable master node. Please contact Elastic Support (https://support.elastic.co) to discuss available options."
                        }
                    ]
                }
            }
        },
        "data": {
            "status": "unknown",
            "indicators": {
                "shards_availability": {
                    "status": "unknown",
                    "summary": "Could not determine health status. Check details on critical issues preventing the health status from reporting.",
                    "details": {
                        "reasons": {
                            "master_is_stable": "red"
                        }
                    }
                }
            }
        },
        "snapshot": {
            "status": "unknown",
            "indicators": {
                "repository_integrity": {
                    "status": "unknown",
                    "summary": "Could not determine health status. Check details on critical issues preventing the health status from reporting.",
                    "details": {
                        "reasons": {
                            "master_is_stable": "red"
                        }
                    }
                }
            }
        }
    }
}

And here is an example response when the node sees that a master node has been elected but cannot join it (case 2 above). The exact cluster_coordination message will likely be different in practice:

{
    "status": "red",
    "cluster_name": "TEST-TEST_WORKER_VM=[--not-gradle--]-CLUSTER_SEED=[-7408532121582643433]-HASH=[CCFDA72C6324B]-cluster",
    "components": {
        "cluster_coordination": {
            "status": "red",
            "indicators": {
                "master_is_stable": {
                    "status": "red",
                    "summary": "{node_t0}{5vpPatP2Qu6E2IXEPnVLhg}{u7JzobuISzm30ShCQrCxbA}{node_t0}{127.0.0.1}{127.0.0.1:50899}{m} has been elected master, but the node being queried, {node_t1}{1l0QyjoCS9Oc0eBwcanhUA}{_OMQcT7XRfSRnf6CdHPNIw}{node_t1}{127.0.0.1}{127.0.0.1:50902}{d}, is unable to join it",
                    "help_url": "https://ela.st/fix-master",
                    "details": {
                        "recent_masters": [
                            {
                                "node_id": "5vpPatP2Qu6E2IXEPnVLhg",
                                "name": "node_t0"
                            }
                        ],
                        "cluster_formation": {
                            "description": "master not discovered yet: have discovered [{node_t1}{1l0QyjoCS9Oc0eBwcanhUA}{_OMQcT7XRfSRnf6CdHPNIw}{node_t1}{127.0.0.1}{127.0.0.1:50902}{d}]; discovery will continue using [127.0.0.1:50899] from hosts providers and [{node_t0}{5vpPatP2Qu6E2IXEPnVLhg}{u7JzobuISzm30ShCQrCxbA}{node_t0}{127.0.0.1}{127.0.0.1:50899}{m}] from last-known cluster state; node term 1, last-accepted version 4 in term 1"
                        }
                    },
                    "impacts": [
                        {
                            "severity": 1,
                            "description": "The cluster cannot create, delete, or rebalance indices, and cannot insert or update documents.",
                            "impact_areas": [
                                "ingest"
                            ]
                        },
                        {
                            "severity": 1,
                            "description": "Scheduled tasks such as Watcher, ILM, and SLM will not work. The _cat APIs will not work.",
                            "impact_areas": [
                                "deployment_management"
                            ]
                        },
                        {
                            "severity": 3,
                            "description": "Snapshot and restore will not work. Searchable snapshots cannot be mounted.",
                            "impact_areas": [
                                "backup"
                            ]
                        }
                    ],
                    "user_actions": [
                        {
                            "message": "The Elasticsearch cluster does not have a stable master node. Please contact Elastic Support (https://support.elastic.co) to discuss available options."
                        }
                    ]
                }
            }
        },
        "data": {
            "status": "unknown",
            "indicators": {
                "shards_availability": {
                    "status": "unknown",
                    "summary": "Could not determine health status. Check details on critical issues preventing the health status from reporting.",
                    "details": {
                        "reasons": {
                            "master_is_stable": "red"
                        }
                    }
                }
            }
        },
        "snapshot": {
            "status": "unknown",
            "indicators": {
                "repository_integrity": {
                    "status": "unknown",
                    "summary": "Could not determine health status. Check details on critical issues preventing the health status from reporting.",
                    "details": {
                        "reasons": {
                            "master_is_stable": "red"
                        }
                    }
                }
            }
        }
    }
}

@elasticsearchmachine
Copy link
Collaborator

Hi @masseyke, I've created a changelog YAML for you.

@masseyke masseyke marked this pull request as ready for review June 14, 2022 14:30
@elasticmachine elasticmachine added the Team:Data Management Meta label for data/management team label Jun 14, 2022
@masseyke masseyke requested a review from andreidan June 14, 2022 14:30
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@masseyke
Copy link
Member Author

2022-04-18 Coordination health
The attached file shows the full master stability health indicator flow. This PR covers only the 1.2.2.1 and 1.2.2.2 branches.

Copy link
Contributor

@andreidan andreidan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this Keith.

This looks great - left a few minor questions

* @param explain If true, details are returned
* @return A HealthIndicatorResult with a RED status
*/
private HealthIndicatorResult calculateOnNoMasterEligibleNodes(MasterHistory localMasterHistory, boolean explain) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This calculate method is a bit different than the rest in this service as it's just creating the indicator result. Should we name it something to reflect that? ie. getIndicatorResultOnNoMasterEligibleNodes

Same for calculateOnCannotJoinLeader

@masseyke masseyke requested a review from andreidan June 20, 2022 14:24
@masseyke
Copy link
Member Author

@elasticmachine update branch

Copy link
Contributor

@andreidan andreidan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating on this Keith.

Left a few more questions

}
});
if (clusterCoordinationMessage != null) {
builder.field("cluster_coordination", clusterCoordinationMessage);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we name this cluster_formation? (together with the java fields)

*/
private HealthIndicatorResult getIndicatorResultOnNoMasterEligibleNodes(MasterHistory localMasterHistory, boolean explain) {
String summary = "No master eligible nodes found in the cluster";
HealthIndicatorDetails details = getDetails(explain, localMasterHistory, coordinator.getClusterFormationState().getDescription());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need more structure in the cluster formation details section here?

ie. we'd like to report a discovery problem - configured addresses and results of contacting those addresses
Should these come as part of PeerFinder and PeerFinder#probeConnectionResult ?
If so, do we want an established structure in the API as opposed to the ClusterFormationState description?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need more structure in the response? Isn't the goal to get something human-readable here? My assumption was that someone has already put a good bit of thought into making an information human-readable description here, so best to use that. The additional structure is useful for making decisions in the indicator, but not necessarily to the end user, right? Or did you have something in mind that would be useful to the end user?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The details section is the more technical field of the API. I think we will indeed need some structure but we don't know now what that would look like - support, SREs, admins, telemetry(?) will want to parse/register this information in a more structured way.

Until we get more requirements, let's keep it free text and we'll add the structure once we get some more information on how we'll use it.

Shall we prepare the cluster_formation field for future structure by making it an object with a description field?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I've just made that change.

Copy link
Contributor

@andreidan andreidan Jun 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please update the PR description with the latest format?

currentMaster,
clusterService.localNode()
);
HealthIndicatorDetails details = getDetails(explain, localMasterHistory, coordinator.getClusterFormationState().getDescription());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, do we want an established structure in the api details? ie. display ClusterFormationState#inFlightJoinStatuses in a structured way?

masseyke added 2 commits June 21, 2022 08:49
…-or-discovery-problem' of github.com:masseyke/elasticsearch into feature/health-api-master-stability-indicator-no-master-or-discovery-problem
@masseyke masseyke requested a review from andreidan June 21, 2022 14:07
Copy link
Contributor

@andreidan andreidan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (pending test fixes :) ), thanks for iterating on this Keith

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants