-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Adding additional capability to the master_is_stable health indicator service #87482
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding additional capability to the master_is_stable health indicator service #87482
Conversation
|
Hi @masseyke, I've created a changelog YAML for you. |
…tor-no-master-or-discovery-problem
…tor-no-master-or-discovery-problem
|
Pinging @elastic/es-data-management (Team:Data Management) |
andreidan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this Keith.
This looks great - left a few minor questions
| * @param explain If true, details are returned | ||
| * @return A HealthIndicatorResult with a RED status | ||
| */ | ||
| private HealthIndicatorResult calculateOnNoMasterEligibleNodes(MasterHistory localMasterHistory, boolean explain) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: This calculate method is a bit different than the rest in this service as it's just creating the indicator result. Should we name it something to reflect that? ie. getIndicatorResultOnNoMasterEligibleNodes
Same for calculateOnCannotJoinLeader
...src/main/java/org/elasticsearch/cluster/coordination/StableMasterHealthIndicatorService.java
Outdated
Show resolved
Hide resolved
...src/main/java/org/elasticsearch/cluster/coordination/StableMasterHealthIndicatorService.java
Outdated
Show resolved
Hide resolved
|
@elasticmachine update branch |
…tor-no-master-or-discovery-problem
andreidan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for iterating on this Keith.
Left a few more questions
| } | ||
| }); | ||
| if (clusterCoordinationMessage != null) { | ||
| builder.field("cluster_coordination", clusterCoordinationMessage); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we name this cluster_formation? (together with the java fields)
| */ | ||
| private HealthIndicatorResult getIndicatorResultOnNoMasterEligibleNodes(MasterHistory localMasterHistory, boolean explain) { | ||
| String summary = "No master eligible nodes found in the cluster"; | ||
| HealthIndicatorDetails details = getDetails(explain, localMasterHistory, coordinator.getClusterFormationState().getDescription()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need more structure in the cluster formation details section here?
ie. we'd like to report a discovery problem - configured addresses and results of contacting those addresses
Should these come as part of PeerFinder and PeerFinder#probeConnectionResult ?
If so, do we want an established structure in the API as opposed to the ClusterFormationState description?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need more structure in the response? Isn't the goal to get something human-readable here? My assumption was that someone has already put a good bit of thought into making an information human-readable description here, so best to use that. The additional structure is useful for making decisions in the indicator, but not necessarily to the end user, right? Or did you have something in mind that would be useful to the end user?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The details section is the more technical field of the API. I think we will indeed need some structure but we don't know now what that would look like - support, SREs, admins, telemetry(?) will want to parse/register this information in a more structured way.
Until we get more requirements, let's keep it free text and we'll add the structure once we get some more information on how we'll use it.
Shall we prepare the cluster_formation field for future structure by making it an object with a description field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. I've just made that change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please update the PR description with the latest format?
| currentMaster, | ||
| clusterService.localNode() | ||
| ); | ||
| HealthIndicatorDetails details = getDetails(explain, localMasterHistory, coordinator.getClusterFormationState().getDescription()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, do we want an established structure in the api details? ie. display ClusterFormationState#inFlightJoinStatuses in a structured way?
…-or-discovery-problem' of github.com:masseyke/elasticsearch into feature/health-api-master-stability-indicator-no-master-or-discovery-problem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (pending test fixes :) ), thanks for iterating on this Keith
…tor-no-master-or-discovery-problem

This PR builds on #86524 by supporting two additional conditions, both of which happen when there has been
no elected master for more than 30 seconds (from the queried node's point of view), and both of which return
a RED status:
This PR does NOT cover two additional conditions when there has been no elected master for more than
30 seconds that will be handled in subsequent PRs:
Here is an example response in the case that there are no master-eligible nodes (case 1 above):
And here is an example response when the node sees that a master node has been elected but cannot join it (case 2 above). The exact
cluster_coordinationmessage will likely be different in practice: