Skip to content

Conversation

@masseyke
Copy link
Member

@masseyke masseyke commented Jun 14, 2022

For various reasons, we want the logic that decides whether we have had a stable master node separated from the logic that generates the HealthIndicatorResult (which is specific to the health API). This commit separates the logic of the master stability code from the HealthIndicatorService code. Most pressingly, we need to be able to call the stable master logic remotely from within StableMasterHistoryIndicatorService (when there has been no master recently, there is at least one master eligible node, none of those nodes are elected master, and the current node is not master-eligible). This commit splits StableMasterHistoryIndicatorService into StableMasterHistoryIndicatorService and StableMasterService. StableMasterService has a calculate method much like StableMasterHistoryIndicatorService's, but it returns a custom result with only stable-master-specific information. The job of StableMasterHistoryIndicatorService is just to transform this result into a HealthIndicatorResult.

The goal of this change is just to split the logic between StableMasterHistoryIndicatorService and StableMasterService. No externally-visible functionality is being added. A follow-up PR will make StableMasterService available through a transport action (which will include making the output of StableMasterService.calculate() serializable). And a follow-up PR to that will make use of that transport action (in the 1.2.2.3 branch of the diagram at #87482 (comment)).
Relates #85624

@elasticsearchmachine
Copy link
Collaborator

Hi @masseyke, I've created a changelog YAML for you.

@masseyke masseyke requested review from andreidan and jbaiera June 16, 2022 18:54
@masseyke masseyke marked this pull request as ready for review June 16, 2022 18:55
@elasticmachine elasticmachine added the Team:Data Management Meta label for data/management team label Jun 16, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

Copy link
Contributor

@andreidan andreidan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this Keith.

I think this looks good, just left a few minor questions and suggestions about having the terminology more targeted towards a standalone service (that could be used outside of the health API context) and modularity

* Since this service needs to be able to run when there is no master at all, it does not depend on the dedicated health node (which
* requires the existence of a master).
*/
public class StableMasterService implements ClusterStateListener {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this would now be a standalone service, should we name this something a bit more descriptive (this service will look into discovery and cluster formation - and maybe expose some of those analyses via different public methods at some point, even though I do appreciate now it only has one public method)

Would ClusterAnalysisService/ CoordinationAnalysisService / CoordinationDiagnosticsService be a bit more descriptive?
With a method analyseMasterStability/ diagnoseMasterStability instead of calculate ?
In the same context, would StableMasterDetails be better named something along the lines of CoordinationAnalysisDetails?

These are all off-the-cuff suggestions (naming is, well, hard :) )

cc @DaveCTurner

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I haven't thought of any great names. MasterStabilityAndCoordinationService?

Copy link
Contributor

@andreidan andreidan Jun 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My preference would be towards CoordinationDiagnosticsService as "CoordinationService" and the likes sort of point to this service being the one that handles the cluster coordination (which is what Coordinator does)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds as good as anything. I'll go with that unless someone comes up with something better.

import org.elasticsearch.common.settings.Setting;
import org.elasticsearch.core.Nullable;
import org.elasticsearch.core.TimeValue;
import org.elasticsearch.health.HealthStatus;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This creates a circular dependency - the cluster.coordination package depends on the healthpackage and health depends on cluster.coordination.

I think we should either have a status entity in cluster.coordination and interpret it when this service is consumed, or don't have a status at all but a different way to express the result of this service - a diagnostic /analysis of sorts

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh good catch. I'll fix that.

Comment on lines 517 to 519
StableMasterService.IDENTITY_CHANGES_THRESHOLD_SETTING,
StableMasterService.NO_MASTER_TRANSITIONS_THRESHOLD_SETTING,
StableMasterService.NODE_HAS_MASTER_LOOKUP_TIMEFRAME_SETTING,
Copy link
Contributor

@andreidan andreidan Jun 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we document these settings? (separate body of work - was just wondering if we should work on documenting these as expert-level settings)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we ought to. I've been trying to figure out where would be a good place to document these without cluttering more important documentation. Have any ideas? Maybe a settings page for the whole health API?

Copy link
Contributor

@andreidan andreidan Jun 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we have a page under https://www.elastic.co/guide/en/elasticsearch/reference/current/settings.html that's called something around what this service will end up being called?
ie. cluster diagnostics settings?

These settings will need some big warnings like here

I've added an item on #85624 for this work

@masseyke masseyke requested a review from andreidan June 21, 2022 14:50
Copy link
Contributor

@andreidan andreidan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for working on this Keith 🚀

* @param explain Whether to calculate and include the details in the result
* @return The StableMasterResult for the given localMasterHistory
*/
private StableMasterResult diagnoseOnHaveNotSeenMasterRecently(MasterHistory localMasterHistory, boolean explain) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should we name this Diagnostic or something along those lines?

RED;
}

public record StableMasterDetails(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should we name this DiagnosticDetails?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, yeah. I thought I'd caught all of these.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh for some reason intellij refactoring is silently failing on these! So I only thought I'd changed them!

return name().toLowerCase(Locale.ROOT);
}

public static HealthStatus fromStableMasterStatus(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should we name this fromDiagnosticStatus ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, yeah. I thought I'd caught all of these.

* @param remoteHistoryException An exception that was found when retrieving the remote master history. Can be null
* @return The StableMasterDetails
*/
private StableMasterDetails getStableMasterDetailsOnMasterHasFlappedNull(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: would "getDetailsOnMasterHasFlappedNull` be a bit more concise?

Copy link
Member

@jbaiera jbaiera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, pending the last few name changes left

}
}

public record StableMasterResult(CoordinationDiagnosticsStatus status, String summary, StableMasterDetails details) {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this also be CoordinationDiagonosticResult?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, intellij refactoring failed silently. Search and replace worked though!

@masseyke masseyke merged commit 214bd60 into elastic:master Jun 21, 2022
@masseyke masseyke deleted the feature/make-stable-master-service branch June 21, 2022 20:12
masseyke added a commit that referenced this pull request Jun 29, 2022
…87984)

This exposes the CoordinationDiagnosticsService (#87672) through a transport action so that it can
be called remotely as part of the health API in the event that: (1) there has been no master recently,
(2) there are master-eligible nodes in the cluster, (3) none are elected, and (4) the current node is
not master eligible.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants