metadata processing optimization, rebalance and allocation separation. #42738

hanbj · 2019-05-31T03:57:52Z

Our cluster has encountered a bottleneck in metadata operations. so we optimized it.
In our production environment has been running for about half a year, compared with the previous, there are dozens of times the performance improvement.

elasticmachine · 2019-05-31T11:21:38Z

Pinging @elastic/es-distributed

ywelsch

This is an interesting optimization that we wanted to look into as well at some point. One difference I had in mind when it comes to the implementation in this PR was to not reroute based on a schedule, but do the delayed full reroute by submitting a lower priority cluster state update task, so that other tasks with higher priority can make progress. I also don't think it makes sense to separate out moveShards as a separate step from allocateUnassigned and rebalance, as moveShards can be triggered by more events than just setting changes. For example, disk watermarks going above a certain threshold will trigger a move of shards.

Have you run our existing test suite on this changes?

hanbj · 2019-06-06T16:17:56Z

I think it makes sense to separate . For example: do we need to rebalance immediately to create or delete an index? Do we need to execute the logic of the moveShards () method to create or delete an index? And so on. Each method traverses all indexes and shards, which has no effect when the number of indexes and shards is relatively small, but our cluster has more than 30,000 indexes and more than 300,000 shards. A metadata change takes several minutes.

The task that led to our cluster piled up hundreds of thousands.

submitting a lower priority cluster state update task is right, So I made priority normal. so that other tasks with higher priority can make progress.

hanbj · 2019-06-06T16:30:06Z

When frequently performing metadata change operations, thread long-time cards are found in balance ByWeights (), shardsWithState (), awaitAllNodes () and other methods by printing stack.

ywelsch · 2019-07-01T11:54:15Z

@hanbj I've made some suggestions on how to evolve this PR and also left a specific question about the tests, on which you have not commented. As noted above, we are interested in improving the system along the lines you've suggested, but with some adjustments. Are you interested in exploring that solution and taking this PR forward? If not, I would prefer to close this PR and open an issue instead to track this as an open item.

but our cluster has more than 30,000 indexes and more than 300,000 shards

300,000 shards in a single cluster is perhaps a bit too much. Splitting this cluster up into multiple separate clusters will help with the general cluster health, improve fault-tolerance of the system and make operational tasks such as cluster restarts etc. much faster. With that high number of shards, you will also encounter other issues, not only related to shard balancing. Think for example about shard-level stats/metric collection etc.

Today we reroute the cluster as part of the process of starting a shard, which runs at `URGENT` priority. In large clusters, rerouting may take some time to complete, and this means that a mere trickle of shard-started events can cause starvation for other, lower-priority, tasks that are pending on the master. However, it isn't really necessary to perform a reroute when starting a shard, as long as one occurs eventually. This commit removes the inline reroute from the process of starting a shard and replaces it with a deferred one that runs at `NORMAL` priority, avoiding starvation of higher-priority tasks. This may improve some of the situations related to elastic#42738 and elastic#42105.

* Defer reroute when starting shards Today we reroute the cluster as part of the process of starting a shard, which runs at `URGENT` priority. In large clusters, rerouting may take some time to complete, and this means that a mere trickle of shard-started events can cause starvation for other, lower-priority, tasks that are pending on the master. However, it isn't really necessary to perform a reroute when starting a shard, as long as one occurs eventually. This commit removes the inline reroute from the process of starting a shard and replaces it with a deferred one that runs at `NORMAL` priority, avoiding starvation of higher-priority tasks. This may improve some of the situations related to #42738 and #42105. * Specific test case for followup priority setting We cannot set the priority in all InternalTestClusters because the deprecation warning makes some tests unhappy. This commit adds a specific test instead. * Checkstyle * Cluster state always changed here * Assert consistency of routing nodes * Restrict setting only to reasonable priorities

* Defer reroute when starting shards Today we reroute the cluster as part of the process of starting a shard, which runs at `URGENT` priority. In large clusters, rerouting may take some time to complete, and this means that a mere trickle of shard-started events can cause starvation for other, lower-priority, tasks that are pending on the master. However, it isn't really necessary to perform a reroute when starting a shard, as long as one occurs eventually. This commit removes the inline reroute from the process of starting a shard and replaces it with a deferred one that runs at `NORMAL` priority, avoiding starvation of higher-priority tasks. This may improve some of the situations related to elastic#42738 and elastic#42105. * Specific test case for followup priority setting We cannot set the priority in all InternalTestClusters because the deprecation warning makes some tests unhappy. This commit adds a specific test instead. * Checkstyle * Cluster state always changed here * Assert consistency of routing nodes * Restrict setting only to reasonable priorities

hanbj · 2019-07-31T14:43:09Z

@ywelsch It is my pleasure and honor to discuss with you the solution to this problem.
Sorry, I'm too busy to follow up on this PR recently. I'm going to improve the code and add some tests. Mr. Dave CTurner's code is great, But I don't think it's the best way.

ywelsch · 2020-05-27T12:16:52Z

Closing this due to inactivity

hanbj · 2020-05-29T08:33:40Z

@ywelsch Thank you, I found that this is still a problem, we are planning to implement it from another namespace way, ES cluster can be infinitely expanded, so this is the reason why I have not updated this pr for so long. I am very sorry.

metadata processing optimization, rebalance and allocation separation.

a8c1137

matriv added the :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) label May 31, 2019

ywelsch self-requested a review June 4, 2019 14:59

ywelsch suggested changes Jun 5, 2019

View reviewed changes

ywelsch mentioned this pull request Jul 1, 2019

Control Cluster Shards Balancing #42739

Closed

DaveCTurner mentioned this pull request Jul 16, 2019

Defer reroute when starting shards #44433

Merged

rjernst added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label May 4, 2020

ywelsch closed this May 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

metadata processing optimization, rebalance and allocation separation. #42738

metadata processing optimization, rebalance and allocation separation. #42738

Uh oh!

hanbj commented May 31, 2019

Uh oh!

elasticmachine commented May 31, 2019

Uh oh!

ywelsch left a comment

Uh oh!

hanbj commented Jun 6, 2019

Uh oh!

hanbj commented Jun 6, 2019

Uh oh!

ywelsch commented Jul 1, 2019

Uh oh!

hanbj commented Jul 31, 2019 •

edited

Loading

Uh oh!

ywelsch commented May 27, 2020

Uh oh!

hanbj commented May 29, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

metadata processing optimization, rebalance and allocation separation. #42738

metadata processing optimization, rebalance and allocation separation. #42738

Uh oh!

Conversation

hanbj commented May 31, 2019

Uh oh!

elasticmachine commented May 31, 2019

Uh oh!

ywelsch left a comment

Choose a reason for hiding this comment

Uh oh!

hanbj commented Jun 6, 2019

Uh oh!

hanbj commented Jun 6, 2019

Uh oh!

ywelsch commented Jul 1, 2019

Uh oh!

hanbj commented Jul 31, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ywelsch commented May 27, 2020

Uh oh!

hanbj commented May 29, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

hanbj commented Jul 31, 2019 •

edited

Loading