Skip to content

Conversation

@hanbj
Copy link
Contributor

@hanbj hanbj commented May 31, 2019

Our cluster has encountered a bottleneck in metadata operations. so we optimized it.
In our production environment has been running for about half a year, compared with the previous, there are dozens of times the performance improvement.

@matriv matriv added the :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) label May 31, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@ywelsch ywelsch self-requested a review June 4, 2019 14:59
Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting optimization that we wanted to look into as well at some point. One difference I had in mind when it comes to the implementation in this PR was to not reroute based on a schedule, but do the delayed full reroute by submitting a lower priority cluster state update task, so that other tasks with higher priority can make progress. I also don't think it makes sense to separate out moveShards as a separate step from allocateUnassigned and rebalance, as moveShards can be triggered by more events than just setting changes. For example, disk watermarks going above a certain threshold will trigger a move of shards.

Have you run our existing test suite on this changes?

@hanbj
Copy link
Contributor Author

hanbj commented Jun 6, 2019

image
I think it makes sense to separate . For example: do we need to rebalance immediately to create or delete an index? Do we need to execute the logic of the moveShards () method to create or delete an index? And so on. Each method traverses all indexes and shards, which has no effect when the number of indexes and shards is relatively small, but our cluster has more than 30,000 indexes and more than 300,000 shards. A metadata change takes several minutes.

The task that led to our cluster piled up hundreds of thousands.

submitting a lower priority cluster state update task is right, So I made priority normal. so that other tasks with higher priority can make progress.

@hanbj
Copy link
Contributor Author

hanbj commented Jun 6, 2019

When frequently performing metadata change operations, thread long-time cards are found in balance ByWeights (), shardsWithState (), awaitAllNodes () and other methods by printing stack.
image

@ywelsch
Copy link
Contributor

ywelsch commented Jul 1, 2019

@hanbj I've made some suggestions on how to evolve this PR and also left a specific question about the tests, on which you have not commented. As noted above, we are interested in improving the system along the lines you've suggested, but with some adjustments. Are you interested in exploring that solution and taking this PR forward? If not, I would prefer to close this PR and open an issue instead to track this as an open item.

but our cluster has more than 30,000 indexes and more than 300,000 shards

300,000 shards in a single cluster is perhaps a bit too much. Splitting this cluster up into multiple separate clusters will help with the general cluster health, improve fault-tolerance of the system and make operational tasks such as cluster restarts etc. much faster. With that high number of shards, you will also encounter other issues, not only related to shard balancing. Think for example about shard-level stats/metric collection etc.

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Jul 16, 2019
Today we reroute the cluster as part of the process of starting a shard, which
runs at `URGENT` priority. In large clusters, rerouting may take some time to
complete, and this means that a mere trickle of shard-started events can cause
starvation for other, lower-priority, tasks that are pending on the master.

However, it isn't really necessary to perform a reroute when starting a shard,
as long as one occurs eventually. This commit removes the inline reroute from
the process of starting a shard and replaces it with a deferred one that runs
at `NORMAL` priority, avoiding starvation of higher-priority tasks.

This may improve some of the situations related to elastic#42738 and elastic#42105.
DaveCTurner added a commit that referenced this pull request Jul 18, 2019
* Defer reroute when starting shards

Today we reroute the cluster as part of the process of starting a shard, which
runs at `URGENT` priority. In large clusters, rerouting may take some time to
complete, and this means that a mere trickle of shard-started events can cause
starvation for other, lower-priority, tasks that are pending on the master.

However, it isn't really necessary to perform a reroute when starting a shard,
as long as one occurs eventually. This commit removes the inline reroute from
the process of starting a shard and replaces it with a deferred one that runs
at `NORMAL` priority, avoiding starvation of higher-priority tasks.

This may improve some of the situations related to #42738 and #42105.

* Specific test case for followup priority setting

We cannot set the priority in all InternalTestClusters because the deprecation
warning makes some tests unhappy. This commit adds a specific test instead.

* Checkstyle

* Cluster state always changed here

* Assert consistency of routing nodes

* Restrict setting only to reasonable priorities
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Jul 18, 2019
* Defer reroute when starting shards

Today we reroute the cluster as part of the process of starting a shard, which
runs at `URGENT` priority. In large clusters, rerouting may take some time to
complete, and this means that a mere trickle of shard-started events can cause
starvation for other, lower-priority, tasks that are pending on the master.

However, it isn't really necessary to perform a reroute when starting a shard,
as long as one occurs eventually. This commit removes the inline reroute from
the process of starting a shard and replaces it with a deferred one that runs
at `NORMAL` priority, avoiding starvation of higher-priority tasks.

This may improve some of the situations related to elastic#42738 and elastic#42105.

* Specific test case for followup priority setting

We cannot set the priority in all InternalTestClusters because the deprecation
warning makes some tests unhappy. This commit adds a specific test instead.

* Checkstyle

* Cluster state always changed here

* Assert consistency of routing nodes

* Restrict setting only to reasonable priorities
@hanbj
Copy link
Contributor Author

hanbj commented Jul 31, 2019

@ywelsch It is my pleasure and honor to discuss with you the solution to this problem.
Sorry, I'm too busy to follow up on this PR recently. I'm going to improve the code and add some tests. Mr. Dave CTurner's code is great, But I don't think it's the best way.

@rjernst rjernst added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label May 4, 2020
@ywelsch
Copy link
Contributor

ywelsch commented May 27, 2020

Closing this due to inactivity

@ywelsch ywelsch closed this May 27, 2020
@hanbj
Copy link
Contributor Author

hanbj commented May 29, 2020

@ywelsch Thank you, I found that this is still a problem, we are planning to implement it from another namespace way, ES cluster can be infinitely expanded, so this is the reason why I have not updated this pr for so long. I am very sorry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants