Skip to content

Conversation

@HiDAl
Copy link

@HiDAl HiDAl commented Mar 20, 2023

Introduces a new Health Indicator to check the cluster's health from the shards' capacity perspective.

It calculates the amount of available room for data and frozen groups, according to the following rules:

if data or frozen nodes have less than 5 shards -> RED
if data or frozen nodes have less than 10 shards -> YELLOW
otherwise -> GREEN

This is the output in case the cluster is unhealthy:

GET _health_report/shards_capacity
{
  "cluster_name": "runTask",
  "indicators": {
    "shards_capacity": {
      "status": "red",
      "symptom": "Cluster is close to reaching the configured maximum number of shards for data nodes.",
      "details": {
        "data": {
          "max_shards_in_cluster": 14,
          "current_used_shards": 10
        },
        "frozen": {
          "max_shards_in_cluster": 10,
          "current_used_shards": 0
        }
      },
      "impacts": [
        {
          "id": "elasticsearch:health:shards_capacity:impact:upgrade_blocked",
          "severity": 1,
          "description": "The cluster has too many used shards to be able to upgrade.",
          "impact_areas": [
            "deployment_management"
          ]
        },
        {
          "id": "elasticsearch:health:shards_capacity:impact:creation_of_new_indices_blocked",
          "severity": 1,
          "description": "The cluster is running low on room to add new shards. Adding data to new indices is at risk",
          "impact_areas": [
            "ingest"
          ]
        }
      ],
      "diagnosis": [
        {
          "id": "elasticsearch:health:shards_capacity:diagnosis:increase_max_shards_per_node",
          "cause": "Elasticsearch is about to reach the maximum number of shards it can host, based on your current settings.",
          "action": "Increase the value of [cluster.max_shards_per_node] cluster setting or remove data indices to clear up resources.",
          "help_url": "https://ela.st/fix-shards-capacity"
        }
      ]
    }
  }
}

relates #94079 and #91119

@elasticsearchmachine elasticsearchmachine added needs:triage Requires assignment of a team area label v8.8.0 labels Mar 20, 2023
@HiDAl HiDAl added Team:Data Management Meta label for data/management team :Data Management/Health >feature and removed needs:triage Requires assignment of a team area label labels Mar 20, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@elasticsearchmachine
Copy link
Collaborator

Hi @HiDAl, I've created a changelog YAML for you.

@HiDAl HiDAl marked this pull request as draft March 20, 2023 14:13
@HiDAl HiDAl marked this pull request as ready for review March 21, 2023 17:43
@HiDAl HiDAl requested a review from andreidan March 21, 2023 17:43
@HiDAl
Copy link
Author

HiDAl commented Mar 21, 2023

@elasticsearchmachine run elasticsearch-ci/part-3

Copy link
Contributor

@andreidan andreidan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this Pablo

This generally looks great, I left a few rather minor comments

@tylerperk can you please go through the copy (as most of the output of the API is presented in the UI)
@shubhaat would you like to have a go through the copy?

@andreidan
Copy link
Contributor

add tests to public methods
remove method which could lead to confusions
this makes the method generic enough, so can easily test the internal logic
@HiDAl HiDAl requested review from andreidan and tylerperk March 22, 2023 18:46
@HiDAl HiDAl changed the title Add new ShardLimits Health Indicator Service Add new Shards Capacity Health Indicator Mar 23, 2023
@HiDAl
Copy link
Author

HiDAl commented Mar 23, 2023

@andreidan I did rename the indicator to ShardsCapacity

  1. I didn't rename the class ShardLimitsValidator because > 30 files are using the class, hence this PR will easily become a mess. I'll rename it in a separate PR.
  2. didn't rename the record ShardLimitsMetadata because it actually contains the configured limits.

@HiDAl
Copy link
Author

HiDAl commented Mar 23, 2023

@elasticsearchmachine run elasticsearch-ci/part-3

@andreidan
Copy link
Contributor

@HiDAl the ShardLimitsValidator and ShardLimitsMetadata can stay named as they are IMO (they're not user facing and are extensively documented )

Can you please update the PR description to reflect the latest state?

Copy link
Contributor

@andreidan andreidan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating on this Pablo. This LGTM 🚀 - left a few very minor suggestions

@HiDAl
Copy link
Author

HiDAl commented Mar 24, 2023

@andreidan I've applied all the recommended changes :)

@HiDAl
Copy link
Author

HiDAl commented Mar 24, 2023

@elasticmachine update branch

@HiDAl
Copy link
Author

HiDAl commented Mar 24, 2023

@elasticsearchmachine run elasticsearch-ci/part-1

@HiDAl HiDAl merged commit 5c353b0 into elastic:main Mar 24, 2023
@HiDAl HiDAl deleted the new-SL-indicator branch March 24, 2023 14:05
@HiDAl HiDAl added the cloud-deploy Publish cloud docker image for Cloud-First-Testing label Mar 24, 2023
HiDAl pushed a commit to HiDAl/elasticsearch that referenced this pull request Apr 12, 2023
In elastic#94552 was introduced a new Health Service which checks the shards
capacity of the cluster. This method is replacing the Old
`ClusterDeprecationChecks#checkShard` used to validate the feasibility
of upgrading a cluster.
HiDAl pushed a commit that referenced this pull request Jun 27, 2023
In #94552 was introduced a new Health Service which checks the shards
capacity of the cluster. This method is replacing the Old
`ClusterDeprecationChecks#checkShard` used to validate the feasibility
of upgrading a cluster.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cloud-deploy Publish cloud docker image for Cloud-First-Testing :Data Management/Health >feature Team:Data Management Meta label for data/management team v8.8.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants