Introduce desired-balance allocator #91343

DaveCTurner · 2022-11-07T10:09:47Z

Today when updating the routing table (i.e. within AllocationService#reroute()) Elasticsearch computes the desired balance of shards and then identifies some shard movements that work towards that goal. At the end of the computation it discards the computed desired allocation and recomputes it the next time round. It's kind of inefficient to recompute the desired allocation each time, and it makes it hard to predict how long it will take until the goal is reached. The computation also happens on the critical path for cluster state updates.

With this commit we introduce a new allocator which keeps hold of the desired balance between iterations. It also computes the desired balance asynchronously, allowing other cluster state updates to happen while the computation is ongoing.

Relates #88647, #83777, and many more.

Today when updating the routing table (i.e. within `AllocationService#reroute()`) Elasticsearch computes the desired balance of shards and then identifies some shard movements that work towards that goal. At the end of the computation it discards the computed desired allocation and recomputes it the next time round. It's kind of inefficient to recompute the desired allocation each time, and it makes it hard to predict how long it will take until the goal is reached. The computation also happens on the critical path for cluster state updates. With this commit we introduce a new allocator which keeps hold of the desired balance between iterations. It also computes the desired balance asynchronously, allowing other cluster state updates to happen while the computation is ongoing. Relates elastic#88647, elastic#83777, and many more.

elasticsearchmachine · 2022-11-07T10:10:11Z

Hi @DaveCTurner, I've created a changelog YAML for you.

elasticsearchmachine · 2022-11-07T10:10:12Z

Pinging @elastic/es-distributed (Team:Distributed)

server/src/internalClusterTest/java/org/elasticsearch/index/store/CorruptedFileIT.java

.../src/internalClusterTest/java/org/elasticsearch/cluster/coordination/RareClusterStateIT.java

server/src/main/java/org/elasticsearch/cluster/ClusterModule.java

henningandersen

Focused on the primary changes in reconciler and computer, left a number of comments, most of which can be deferred to follow-ups.

...er/src/main/java/org/elasticsearch/cluster/routing/allocation/decider/AllocationDecider.java

.../main/java/org/elasticsearch/cluster/routing/allocation/decider/ResizeAllocationDecider.java

...rc/main/java/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceInput.java

server/src/main/java/org/elasticsearch/cluster/ClusterInfoSimulator.java

...main/java/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceComputer.java

henningandersen · 2022-11-07T13:52:58Z

...in/java/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceReconciler.java

+            if (o1.primary() ^ o2.primary()) {
+                return o1.primary() ? -1 : 1;
+            }
+            if (o1.getIndexName().compareTo(o2.getIndexName()) == 0) {


I think this is more specific:

Suggested change

if (o1.getIndexName().compareTo(o2.getIndexName()) == 0) {

if (o1.getIndex().equals(o2.getIndex()) {

This is copied from BalancedShardsAllocator so we'd want to change it in both places. Not doing that here, but tracking this in #91386.

...in/java/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceReconciler.java

...main/java/org/elasticsearch/cluster/routing/allocation/allocator/NodeAllocationOrdering.java

.../main/java/org/elasticsearch/cluster/routing/allocation/allocator/PendingListenersQueue.java

henningandersen · 2022-11-07T16:20:15Z

.../main/java/org/elasticsearch/cluster/routing/allocation/allocator/PendingListenersQueue.java

+    }
+
+    public void completeAllAsNotMaster() {
+        completedIndex = -1;


This looks unsafe to me, as in if advance and completeAllAsNotMaster runs on different threads, we risk advance setting completedIndex to an index after we set it to -1 here?

I am not exactly sure why we reset the indexGenerator in DesiredBalanceShardsAllocator, could it not continue where it left in case the node becomes master again? That would avoid the reset here, simplifying I think.

Tracked in #91386. @idegtiarenko could you take a look at this?

It is important co complete the listeners so that we do not have stuck requests if the node is no longer master.
I also think it is worth resetting the desired balance to empty/initial case as there could be a lot of changes to the routing table by the time the node is elected as a master again.I guess it is fine not to reset the index here (the one in desired balance allocator should not be resetted as well)

henningandersen

A few more comments, all optional related to the merge of this (but would like to then see addressed in follow-ups, though not necessarily immediately).

.../main/java/org/elasticsearch/cluster/routing/allocation/allocator/ContinuousComputation.java

...va/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceShardsAllocator.java

server/src/main/java/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalance.java

...in/java/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceReconciler.java

fcofdez

I read all the production code and this makes sense to me. This was mostly to familiarize myself with the changes as I didn't have enough time to review it thoroughly.

henningandersen

LGTM.

DaveCTurner · 2022-11-08T12:54:39Z

@elasticmachine please run elasticsearch-ci/bwc

DaveCTurner added >enhancement :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) v8.6.0 labels Nov 7, 2022

DaveCTurner requested review from arteam, fcofdez, henningandersen and idegtiarenko November 7, 2022 10:09

DaveCTurner mentioned this pull request Nov 7, 2022

[WIP] Desired balance allocator #83777

Closed

elasticsearchmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Nov 7, 2022

Update docs/changelog/91343.yaml

734a9b7

DaveCTurner mentioned this pull request Nov 7, 2022

Ignore test #91345

Merged

idegtiarenko reviewed Nov 7, 2022

View reviewed changes

server/src/internalClusterTest/java/org/elasticsearch/index/store/CorruptedFileIT.java Outdated Show resolved Hide resolved

idegtiarenko reviewed Nov 7, 2022

View reviewed changes

.../src/internalClusterTest/java/org/elasticsearch/cluster/coordination/RareClusterStateIT.java Outdated Show resolved Hide resolved

idegtiarenko reviewed Nov 7, 2022

View reviewed changes

server/src/main/java/org/elasticsearch/cluster/ClusterModule.java Outdated Show resolved Hide resolved

henningandersen reviewed Nov 7, 2022

View reviewed changes

henningandersen reviewed Nov 8, 2022

View reviewed changes

fcofdez reviewed Nov 8, 2022

View reviewed changes

DaveCTurner added 10 commits November 8, 2022 09:32

Merge branch 'main' into 2022-11-07-desired-balance-allocator

72c360e

Re-enable fixed tests

4e4a7b6

New default!

1b67865

Option -> Optional

2abe6ff

Javadoc for DesiredBalanceInput#index

0037ee4

Rename simulate() to simulateShardStarted()

7846fea

Comment

6924aa9

Rename variable

01beed0

Occasional INFO message about excessive iterations

528e3c0

Rename ShardAssignment#of

69ded3c

DaveCTurner added 2 commits November 8, 2022 10:11

Rename arg to PendingListenersQueue#complete() (and inline advance())

2034ac3

Make constant final

e34225b

DaveCTurner mentioned this pull request Nov 8, 2022

Follow-up work for desired balance allocator #91386

Open

33 tasks

DaveCTurner added 2 commits November 8, 2022 10:30

Javadoc on NodeAllocationOrdering

c599478

Use TreeMap rather than broken TreeSet

0e0684f

henningandersen approved these changes Nov 8, 2022

View reviewed changes

DaveCTurner added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Nov 8, 2022

elasticsearchmachine merged commit 07056f5 into elastic:main Nov 8, 2022

DaveCTurner deleted the 2022-11-07-desired-balance-allocator branch November 8, 2022 13:23

mark-vieira mentioned this pull request Nov 10, 2022

[CI] :qa:full-cluster-restart:v7.0.1#upgradedClusterTest failing #91470

Closed

henningandersen added the release highlight label Nov 14, 2022

DaveCTurner mentioned this pull request Dec 9, 2022

[CI] CoordinatorTests testStateRecoveryResetAfterPreviousLeadership failing #91449

Closed

VimCommando mentioned this pull request Jun 24, 2023

Increasing cluster.routing.allocation.cluster_concurrent_rebalance causes redundant shard movements #87279

Open

masseyke mentioned this pull request Apr 8, 2024

Automatic index creation fails with Elasticsearch >= 8.6.0 elastic/elasticsearch-hadoop#2214

Closed

2 tasks

Rassyan mentioned this pull request May 12, 2025

Desired Balance Allocator Fails to Converge with High Shard Counts (8.x) #128021

Closed

	if (o1.getIndexName().compareTo(o2.getIndexName()) == 0) {
	if (o1.getIndex().equals(o2.getIndex()) {

Introduce desired-balance allocator #91343

Introduce desired-balance allocator #91343

Uh oh!

Conversation

DaveCTurner commented Nov 7, 2022

Uh oh!

elasticsearchmachine commented Nov 7, 2022

Uh oh!

elasticsearchmachine commented Nov 7, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

henningandersen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

henningandersen Nov 7, 2022

Choose a reason for hiding this comment

Uh oh!

DaveCTurner Nov 8, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

henningandersen Nov 7, 2022

Choose a reason for hiding this comment

Uh oh!

DaveCTurner Nov 8, 2022

Choose a reason for hiding this comment

Uh oh!

idegtiarenko Nov 8, 2022

Choose a reason for hiding this comment

Uh oh!

henningandersen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fcofdez left a comment

Choose a reason for hiding this comment

Uh oh!

henningandersen left a comment

Choose a reason for hiding this comment

Uh oh!

DaveCTurner commented Nov 8, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants