Adding additional capability to the master_is_stable health indicator service #87482

masseyke · 2022-06-07T20:27:28Z

This PR builds on #86524 by supporting two additional conditions, both of which happen when there has been
no elected master for more than 30 seconds (from the queried node's point of view), and both of which return
a RED status:

There are no master-eligible nodes found in the cluster
The node being queried sees a master-eligible node that has been elected the master, but cannot join it

This PR does NOT cover two additional conditions when there has been no elected master for more than
30 seconds that will be handled in subsequent PRs:

There are master-eligible nodes but none is elected master and the queried node is not master-eligible
There are master-eligible nodes but none is elected master and the queried node is master-eligible

Here is an example response in the case that there are no master-eligible nodes (case 1 above):

{
    "status": "red",
    "cluster_name": "TEST-TEST_WORKER_VM=[--not-gradle--]-CLUSTER_SEED=[-7055790250029418587]-HASH=[CCFA36BEC69DB]-cluster",
    "components": {
        "cluster_coordination": {
            "status": "red",
            "indicators": {
                "master_is_stable": {
                    "status": "red",
                    "summary": "No master eligible nodes found in the cluster",
                    "help_url": "https://ela.st/fix-master",
                    "details": {
                        "recent_masters": [
                            {
                                "node_id": "FhnTO8lgQJeTnL2y_TqjOQ",
                                "name": "node_t0"
                            }
                        ],
                        "cluster_formation": {
                            "description": "master not discovered yet: have discovered [{node_t1}{ZK-WxrjtSQ2FVSS5DOJo_A}{GJ_d8eTwR5yHfJQ3iVmNeQ}{node_t1}{127.0.0.1}{127.0.0.1:49785}{d}]; discovery will continue using [127.0.0.1:49782] from hosts providers and [{node_t0}{FhnTO8lgQJeTnL2y_TqjOQ}{HKoK0SzXSDqqolptGB2TKg}{node_t0}{127.0.0.1}{127.0.0.1:49782}{m}] from last-known cluster state; node term 1, last-accepted version 4 in term 1"
                        }
                    },
                    "impacts": [
                        {
                            "severity": 1,
                            "description": "The cluster cannot create, delete, or rebalance indices, and cannot insert or update documents.",
                            "impact_areas": [
                                "ingest"
                            ]
                        },
                        {
                            "severity": 1,
                            "description": "Scheduled tasks such as Watcher, ILM, and SLM will not work. The _cat APIs will not work.",
                            "impact_areas": [
                                "deployment_management"
                            ]
                        },
                        {
                            "severity": 3,
                            "description": "Snapshot and restore will not work. Searchable snapshots cannot be mounted.",
                            "impact_areas": [
                                "backup"
                            ]
                        }
                    ],
                    "user_actions": [
                        {
                            "message": "The Elasticsearch cluster does not have a stable master node. Please contact Elastic Support (https://support.elastic.co) to discuss available options."
                        }
                    ]
                }
            }
        },
        "data": {
            "status": "unknown",
            "indicators": {
                "shards_availability": {
                    "status": "unknown",
                    "summary": "Could not determine health status. Check details on critical issues preventing the health status from reporting.",
                    "details": {
                        "reasons": {
                            "master_is_stable": "red"
                        }
                    }
                }
            }
        },
        "snapshot": {
            "status": "unknown",
            "indicators": {
                "repository_integrity": {
                    "status": "unknown",
                    "summary": "Could not determine health status. Check details on critical issues preventing the health status from reporting.",
                    "details": {
                        "reasons": {
                            "master_is_stable": "red"
                        }
                    }
                }
            }
        }
    }
}

And here is an example response when the node sees that a master node has been elected but cannot join it (case 2 above). The exact cluster_coordination message will likely be different in practice:

{
    "status": "red",
    "cluster_name": "TEST-TEST_WORKER_VM=[--not-gradle--]-CLUSTER_SEED=[-7408532121582643433]-HASH=[CCFDA72C6324B]-cluster",
    "components": {
        "cluster_coordination": {
            "status": "red",
            "indicators": {
                "master_is_stable": {
                    "status": "red",
                    "summary": "{node_t0}{5vpPatP2Qu6E2IXEPnVLhg}{u7JzobuISzm30ShCQrCxbA}{node_t0}{127.0.0.1}{127.0.0.1:50899}{m} has been elected master, but the node being queried, {node_t1}{1l0QyjoCS9Oc0eBwcanhUA}{_OMQcT7XRfSRnf6CdHPNIw}{node_t1}{127.0.0.1}{127.0.0.1:50902}{d}, is unable to join it",
                    "help_url": "https://ela.st/fix-master",
                    "details": {
                        "recent_masters": [
                            {
                                "node_id": "5vpPatP2Qu6E2IXEPnVLhg",
                                "name": "node_t0"
                            }
                        ],
                        "cluster_formation": {
                            "description": "master not discovered yet: have discovered [{node_t1}{1l0QyjoCS9Oc0eBwcanhUA}{_OMQcT7XRfSRnf6CdHPNIw}{node_t1}{127.0.0.1}{127.0.0.1:50902}{d}]; discovery will continue using [127.0.0.1:50899] from hosts providers and [{node_t0}{5vpPatP2Qu6E2IXEPnVLhg}{u7JzobuISzm30ShCQrCxbA}{node_t0}{127.0.0.1}{127.0.0.1:50899}{m}] from last-known cluster state; node term 1, last-accepted version 4 in term 1"
                        }
                    },
                    "impacts": [
                        {
                            "severity": 1,
                            "description": "The cluster cannot create, delete, or rebalance indices, and cannot insert or update documents.",
                            "impact_areas": [
                                "ingest"
                            ]
                        },
                        {
                            "severity": 1,
                            "description": "Scheduled tasks such as Watcher, ILM, and SLM will not work. The _cat APIs will not work.",
                            "impact_areas": [
                                "deployment_management"
                            ]
                        },
                        {
                            "severity": 3,
                            "description": "Snapshot and restore will not work. Searchable snapshots cannot be mounted.",
                            "impact_areas": [
                                "backup"
                            ]
                        }
                    ],
                    "user_actions": [
                        {
                            "message": "The Elasticsearch cluster does not have a stable master node. Please contact Elastic Support (https://support.elastic.co) to discuss available options."
                        }
                    ]
                }
            }
        },
        "data": {
            "status": "unknown",
            "indicators": {
                "shards_availability": {
                    "status": "unknown",
                    "summary": "Could not determine health status. Check details on critical issues preventing the health status from reporting.",
                    "details": {
                        "reasons": {
                            "master_is_stable": "red"
                        }
                    }
                }
            }
        },
        "snapshot": {
            "status": "unknown",
            "indicators": {
                "repository_integrity": {
                    "status": "unknown",
                    "summary": "Could not determine health status. Check details on critical issues preventing the health status from reporting.",
                    "details": {
                        "reasons": {
                            "master_is_stable": "red"
                        }
                    }
                }
            }
        }
    }
}

elasticsearchmachine · 2022-06-07T20:27:52Z

Hi @masseyke, I've created a changelog YAML for you.

…tor-no-master-or-discovery-problem

elasticmachine · 2022-06-14T14:30:54Z

Pinging @elastic/es-data-management (Team:Data Management)

masseyke · 2022-06-14T14:38:42Z

The attached file shows the full master stability health indicator flow. This PR covers only the 1.2.2.1 and 1.2.2.2 branches.

andreidan

Thanks for working on this Keith.

This looks great - left a few minor questions

andreidan · 2022-06-17T16:12:30Z

...src/main/java/org/elasticsearch/cluster/coordination/StableMasterHealthIndicatorService.java

+     * @param explain If true, details are returned
+     * @return A HealthIndicatorResult with a RED status
+     */
+    private HealthIndicatorResult calculateOnNoMasterEligibleNodes(MasterHistory localMasterHistory, boolean explain) {


nit: This calculate method is a bit different than the rest in this service as it's just creating the indicator result. Should we name it something to reflect that? ie. getIndicatorResultOnNoMasterEligibleNodes

Same for calculateOnCannotJoinLeader

...src/main/java/org/elasticsearch/cluster/coordination/StableMasterHealthIndicatorService.java

masseyke · 2022-06-20T14:26:26Z

@elasticmachine update branch

…tor-no-master-or-discovery-problem

andreidan

Thanks for iterating on this Keith.

Left a few more questions

andreidan · 2022-06-20T15:57:30Z

...src/main/java/org/elasticsearch/cluster/coordination/StableMasterHealthIndicatorService.java

                }
            });
+            if (clusterCoordinationMessage != null) {
+                builder.field("cluster_coordination", clusterCoordinationMessage);


should we name this cluster_formation? (together with the java fields)

andreidan · 2022-06-20T16:16:23Z

...src/main/java/org/elasticsearch/cluster/coordination/StableMasterHealthIndicatorService.java

+     */
+    private HealthIndicatorResult getIndicatorResultOnNoMasterEligibleNodes(MasterHistory localMasterHistory, boolean explain) {
+        String summary = "No master eligible nodes found in the cluster";
+        HealthIndicatorDetails details = getDetails(explain, localMasterHistory, coordinator.getClusterFormationState().getDescription());


Do we need more structure in the cluster formation details section here?

ie. we'd like to report a discovery problem - configured addresses and results of contacting those addresses
Should these come as part of PeerFinder and PeerFinder#probeConnectionResult ?
If so, do we want an established structure in the API as opposed to the ClusterFormationState description?

Do we need more structure in the response? Isn't the goal to get something human-readable here? My assumption was that someone has already put a good bit of thought into making an information human-readable description here, so best to use that. The additional structure is useful for making decisions in the indicator, but not necessarily to the end user, right? Or did you have something in mind that would be useful to the end user?

The details section is the more technical field of the API. I think we will indeed need some structure but we don't know now what that would look like - support, SREs, admins, telemetry(?) will want to parse/register this information in a more structured way.

Until we get more requirements, let's keep it free text and we'll add the structure once we get some more information on how we'll use it.

Shall we prepare the cluster_formation field for future structure by making it an object with a description field?

Makes sense. I've just made that change.

Can you please update the PR description with the latest format?

andreidan · 2022-06-20T16:17:57Z

...src/main/java/org/elasticsearch/cluster/coordination/StableMasterHealthIndicatorService.java

+            currentMaster,
+            clusterService.localNode()
        );
+        HealthIndicatorDetails details = getDetails(explain, localMasterHistory, coordinator.getClusterFormationState().getDescription());


Similarly, do we want an established structure in the api details? ie. display ClusterFormationState#inFlightJoinStatuses in a structured way?

…-or-discovery-problem' of github.com:masseyke/elasticsearch into feature/health-api-master-stability-indicator-no-master-or-discovery-problem

andreidan

LGTM (pending test fixes :) ), thanks for iterating on this Keith

…tor-no-master-or-discovery-problem

Initial commit

c2fbaca

masseyke added >enhancement :Data Management/Health v8.4.0 labels Jun 7, 2022

Update docs/changelog/87482.yaml

dbed053

masseyke added 8 commits June 7, 2022 17:29

Adding an integration test

3b85fd6

adding integration tests

bf9211d

Merge branch 'master' into feature/health-api-master-stability-indica…

8e2c975

…tor-no-master-or-discovery-problem

Fixing StableMasterHealthIndicatorServiceTests

aa27f3c

fixing integration tests

c8f415f

adding to unit tests

e42480c

adding javadocs

b1f9bb5

Merge branch 'master' into feature/health-api-master-stability-indica…

0cb9425

…tor-no-master-or-discovery-problem

masseyke marked this pull request as ready for review June 14, 2022 14:30

elasticmachine added the Team:Data Management Meta label for data/management team label Jun 14, 2022

masseyke requested a review from andreidan June 14, 2022 14:30

This was referenced Jun 14, 2022

Cluster coordination indicator - report if the master is stable and an impact/troubleshoot guide otherwise #85624

Closed

Move the master stability logic into its own service separate from the HealthIndicatorService #87672

Merged

andreidan reviewed Jun 17, 2022

View reviewed changes

masseyke added 2 commits June 20, 2022 08:57

code review feedback

9850fcf

code review feedback

8b1abff

masseyke requested a review from andreidan June 20, 2022 14:24

Merge branch 'master' into feature/health-api-master-stability-indica…

4e15b0a

…tor-no-master-or-discovery-problem

andreidan reviewed Jun 20, 2022

View reviewed changes

masseyke added 2 commits June 21, 2022 08:49

merging master

f6d8e60

Merge branch 'feature/health-api-master-stability-indicator-no-master…

d431bda

…-or-discovery-problem' of github.com:masseyke/elasticsearch into feature/health-api-master-stability-indicator-no-master-or-discovery-problem

masseyke added 2 commits June 21, 2022 08:55

code review feedback

5e7b346

turning cluster_formation into an object in details

010c5d8

masseyke requested a review from andreidan June 21, 2022 14:07

masseyke added 3 commits June 21, 2022 18:00

merging master

bf8029d

cleaning up

35a58fc

spotlessApply

b05383f

andreidan approved these changes Jun 22, 2022

View reviewed changes

masseyke added 2 commits June 22, 2022 11:28

fixing unit tests

e5a2aa6

Merge branch 'master' into feature/health-api-master-stability-indica…

d5a1c3c

…tor-no-master-or-discovery-problem

masseyke merged commit b34e1bf into elastic:master Jun 22, 2022

masseyke deleted the feature/health-api-master-stability-indicator-no-master-or-discovery-problem branch June 22, 2022 18:32

masseyke mentioned this pull request Jun 23, 2022

Adding more master_is_stable details #87977

Merged

masseyke mentioned this pull request Jun 29, 2022

Adding logic to master_is_stable indicator to check for discovery problems #88020

Merged

masseyke mentioned this pull request Aug 9, 2022

Adding a check to the master stability health API when there is no master and the current node is not master eligible #89219

Merged

Adding additional capability to the master_is_stable health indicator service #87482

Adding additional capability to the master_is_stable health indicator service #87482

Uh oh!

Conversation

masseyke commented Jun 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Jun 7, 2022

Uh oh!

elasticmachine commented Jun 14, 2022

Uh oh!

masseyke commented Jun 14, 2022

Uh oh!

andreidan left a comment

Choose a reason for hiding this comment

Uh oh!

andreidan Jun 17, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

masseyke commented Jun 20, 2022

Uh oh!

andreidan left a comment

Choose a reason for hiding this comment

Uh oh!

andreidan Jun 20, 2022

Choose a reason for hiding this comment

Uh oh!

andreidan Jun 20, 2022

Choose a reason for hiding this comment

Uh oh!

masseyke Jun 20, 2022

Choose a reason for hiding this comment

Uh oh!

andreidan Jun 21, 2022

Choose a reason for hiding this comment

Uh oh!

masseyke Jun 21, 2022

Choose a reason for hiding this comment

Uh oh!

andreidan Jun 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andreidan Jun 20, 2022

Choose a reason for hiding this comment

Uh oh!

andreidan left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

masseyke commented Jun 7, 2022 •

edited

Loading

andreidan Jun 22, 2022 •

edited

Loading

andreidan left a comment •

edited

Loading