-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Move the master stability logic into its own service separate from the HealthIndicatorService #87672
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move the master stability logic into its own service separate from the HealthIndicatorService #87672
Conversation
|
Hi @masseyke, I've created a changelog YAML for you. |
|
Pinging @elastic/es-data-management (Team:Data Management) |
andreidan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this Keith.
I think this looks good, just left a few minor questions and suggestions about having the terminology more targeted towards a standalone service (that could be used outside of the health API context) and modularity
| * Since this service needs to be able to run when there is no master at all, it does not depend on the dedicated health node (which | ||
| * requires the existence of a master). | ||
| */ | ||
| public class StableMasterService implements ClusterStateListener { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this would now be a standalone service, should we name this something a bit more descriptive (this service will look into discovery and cluster formation - and maybe expose some of those analyses via different public methods at some point, even though I do appreciate now it only has one public method)
Would ClusterAnalysisService/ CoordinationAnalysisService / CoordinationDiagnosticsService be a bit more descriptive?
With a method analyseMasterStability/ diagnoseMasterStability instead of calculate ?
In the same context, would StableMasterDetails be better named something along the lines of CoordinationAnalysisDetails?
These are all off-the-cuff suggestions (naming is, well, hard :) )
cc @DaveCTurner
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I haven't thought of any great names. MasterStabilityAndCoordinationService?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My preference would be towards CoordinationDiagnosticsService as "CoordinationService" and the likes sort of point to this service being the one that handles the cluster coordination (which is what Coordinator does)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds as good as anything. I'll go with that unless someone comes up with something better.
| import org.elasticsearch.common.settings.Setting; | ||
| import org.elasticsearch.core.Nullable; | ||
| import org.elasticsearch.core.TimeValue; | ||
| import org.elasticsearch.health.HealthStatus; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This creates a circular dependency - the cluster.coordination package depends on the healthpackage and health depends on cluster.coordination.
I think we should either have a status entity in cluster.coordination and interpret it when this service is consumed, or don't have a status at all but a different way to express the result of this service - a diagnostic /analysis of sorts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh good catch. I'll fix that.
| StableMasterService.IDENTITY_CHANGES_THRESHOLD_SETTING, | ||
| StableMasterService.NO_MASTER_TRANSITIONS_THRESHOLD_SETTING, | ||
| StableMasterService.NODE_HAS_MASTER_LOOKUP_TIMEFRAME_SETTING, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we document these settings? (separate body of work - was just wondering if we should work on documenting these as expert-level settings)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we ought to. I've been trying to figure out where would be a good place to document these without cluttering more important documentation. Have any ideas? Maybe a settings page for the whole health API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we have a page under https://www.elastic.co/guide/en/elasticsearch/reference/current/settings.html that's called something around what this service will end up being called?
ie. cluster diagnostics settings?
These settings will need some big warnings like here
I've added an item on #85624 for this work
andreidan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for working on this Keith 🚀
| * @param explain Whether to calculate and include the details in the result | ||
| * @return The StableMasterResult for the given localMasterHistory | ||
| */ | ||
| private StableMasterResult diagnoseOnHaveNotSeenMasterRecently(MasterHistory localMasterHistory, boolean explain) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: should we name this Diagnostic or something along those lines?
| RED; | ||
| } | ||
|
|
||
| public record StableMasterDetails( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: should we name this DiagnosticDetails?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, yeah. I thought I'd caught all of these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohh for some reason intellij refactoring is silently failing on these! So I only thought I'd changed them!
| return name().toLowerCase(Locale.ROOT); | ||
| } | ||
|
|
||
| public static HealthStatus fromStableMasterStatus( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: should we name this fromDiagnosticStatus ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, yeah. I thought I'd caught all of these.
| * @param remoteHistoryException An exception that was found when retrieving the remote master history. Can be null | ||
| * @return The StableMasterDetails | ||
| */ | ||
| private StableMasterDetails getStableMasterDetailsOnMasterHasFlappedNull( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: would "getDetailsOnMasterHasFlappedNull` be a bit more concise?
jbaiera
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, pending the last few name changes left
| } | ||
| } | ||
|
|
||
| public record StableMasterResult(CoordinationDiagnosticsStatus status, String summary, StableMasterDetails details) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this also be CoordinationDiagonosticResult?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, intellij refactoring failed silently. Search and replace worked though!
…87984) This exposes the CoordinationDiagnosticsService (#87672) through a transport action so that it can be called remotely as part of the health API in the event that: (1) there has been no master recently, (2) there are master-eligible nodes in the cluster, (3) none are elected, and (4) the current node is not master eligible.
For various reasons, we want the logic that decides whether we have had a stable master node separated from the logic that generates the HealthIndicatorResult (which is specific to the health API). This commit separates the logic of the master stability code from the HealthIndicatorService code. Most pressingly, we need to be able to call the stable master logic remotely from within StableMasterHistoryIndicatorService (when there has been no master recently, there is at least one master eligible node, none of those nodes are elected master, and the current node is not master-eligible). This commit splits StableMasterHistoryIndicatorService into StableMasterHistoryIndicatorService and StableMasterService. StableMasterService has a calculate method much like StableMasterHistoryIndicatorService's, but it returns a custom result with only stable-master-specific information. The job of StableMasterHistoryIndicatorService is just to transform this result into a HealthIndicatorResult.
The goal of this change is just to split the logic between StableMasterHistoryIndicatorService and StableMasterService. No externally-visible functionality is being added. A follow-up PR will make StableMasterService available through a transport action (which will include making the output of StableMasterService.calculate() serializable). And a follow-up PR to that will make use of that transport action (in the 1.2.2.3 branch of the diagram at #87482 (comment)).
Relates #85624