Balance priorities during reconciliation #95454

idegtiarenko · 2023-04-21T12:07:26Z

When reconciling a balance with a lot of shards on undesired nodes there is a possibility of causing node hot spots due to usage of nodeInterleavedShardIterator. This iterator orders shards based on nodes they are located and order nodes based hash map iteration. This means it tends to pick shards returned first by the iterator.

Depends on #96025

Related to #91386

When reconciling a balance with a lot of shards on undesired nodes there is a possibility of causing node hot spots due to usage of nodeInterleavedShardIterator. This iterator orders shards based on nodes they are located and order nodes based hash map iteration. This means it tends to pick shards returned first by the iterator.

DaveCTurner

I wonder if we should stop with the interleaving altogether and just try all the shards from the least-recently-touched node until we find one to move, then move on to the next-least-recently-touched node and try all its shards, and so on.

DaveCTurner · 2023-04-21T12:27:13Z

...in/java/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceReconciler.java

                                case YES -> {
-                                    if (logger.isTraceEnabled()) {
-                                        logger.trace("Assigned shard [{}] to [{}]", shard, desiredNodeId);
+                                    if (logger.isDebugEnabled()) {


nit: no real need for these checks, they're the first thing that logger.debug() does anyway

idegtiarenko · 2023-04-21T13:08:40Z

I wonder if we should stop with the interleaving altogether and just try all the shards from the least-recently-touched node until we find one to move, then move on to the next-least-recently-touched node and try all its shards, and so on.

We need to handle the case when it is allowed to move more shards then there are nodes in cluster (or somehow to return to the first node again)

DaveCTurner · 2023-04-21T13:31:23Z

We need to handle the case when it is allowed to move more shards then there are nodes in cluster (or somehow to return to the first node again)

Yes, we'd keep iterating until we've considered moving every shard on every node.

# Conflicts: # server/src/test/java/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceReconcilerTests.java

elasticsearchmachine · 2023-05-10T15:01:27Z

Pinging @elastic/es-distributed (Team:Distributed)

elasticsearchmachine · 2023-05-10T15:01:27Z

Hi @idegtiarenko, I've created a changelog YAML for you.

idegtiarenko · 2023-05-10T15:01:33Z

...in/java/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceReconciler.java

+        return new OrderedNodesShardsIterator();
+    }
+
+    private class OrderedNodesShardsIterator implements Iterator<ShardRouting> {


I am planning to move this to a top-level class and add some unit tests for this

DaveCTurner · 2023-05-11T07:12:33Z

...in/java/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceReconciler.java

+            return nextShard;
+        }
+
+        public void dePrioritizeNode(String nodeId) {


nit: IMO we should update moveOrdering here rather than relying on the caller calling both dePrioritizeNode and recordAllocation.

...va/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceReconcilerTests.java

henningandersen

This looks good to me. I left a number of minor comments.

henningandersen · 2023-05-11T18:59:40Z

...in/java/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceReconciler.java

    }

    private void moveShards() {
        // Iterate over the started shards interleaving between nodes, and check if they can remain. In the presence of throttling


I think this comment needs updating.

henningandersen · 2023-05-11T19:01:17Z

...in/java/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceReconciler.java

            return;
        }

        // Iterate over the started shards interleaving between nodes, and try to move any which are on undesired nodes. In the presence of


Also update comment here?

henningandersen · 2023-05-11T19:03:21Z

.../main/java/org/elasticsearch/cluster/routing/allocation/allocator/OrderedShardsIterator.java

+import java.util.NoSuchElementException;
+import java.util.Objects;
+
+public class OrderedShardsIterator implements Iterator<ShardRouting> {


Can we add javadoc here explaining the intended order?

henningandersen · 2023-05-11T19:05:25Z

...va/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceShardsAllocator.java

        }
-        allocationOrdering.retainNodes(getNodeIds(allocation.routingNodes()));
-        recordTime(cumulativeReconciliationTime, new DesiredBalanceReconciler(desiredBalance, allocation, allocationOrdering)::run);
+        allocationOrdering.retainNodes(allocation.routingNodes().getAllNodeIds());


I think we should also do retainNodes on moveOrdering?

henningandersen · 2023-05-11T19:31:10Z

...va/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceReconcilerTests.java

+            }
+
+            var summary = totalOutgoingMoves.values().stream().mapToInt(AtomicInteger::get).summaryStatistics();
+            assertThat(


The text says "similar", whereas the assertion is more precise (max 1 difference). I wonder if you can add a comment explaining how we are sure of max 1 difference here when it works?

My intuition says that if the currentNodeId that is picked randomly originally is unfortunate enough to ensure only one node needs any shard movements, we could get a distance of two here?

I agree - I think we should remove the fully-reconciled nodes at the top of the loop, rather than doing it after the first assertion.

one node needs any shard movements, we could get a distance of two here

Fixing by adding only nodes that require reconciliation to totalOutgoingMoves

henningandersen · 2023-05-12T10:21:51Z

.../java/org/elasticsearch/cluster/routing/allocation/allocator/OrderedShardsIteratorTests.java

+        var ordering = new NodeAllocationOrdering();
+        ordering.recordAllocation("node-1");
+
+        var iterator = OrderedShardsIterator.create(RoutingNodes.mutable(routing, nodes), ordering);


nit: let us use immutable instead, seems more inline with the read-only nature of the iterator:

Suggested change

var iterator = OrderedShardsIterator.create(RoutingNodes.mutable(routing, nodes), ordering);

var iterator = OrderedShardsIterator.create(RoutingNodes.immutable(routing, nodes), ordering);

While iterator is not performing any changes I wanted to test it it with mutable RoutingNodes as this is what we supply it in the real code

By testing it with immutable, we also validate that the iterator does not mutate the routing nodes. We can compromise on randomizing it maybe ;-).

henningandersen · 2023-05-12T10:22:15Z

.../java/org/elasticsearch/cluster/routing/allocation/allocator/OrderedShardsIteratorTests.java

+        var ordering = new NodeAllocationOrdering();
+        ordering.recordAllocation("node-1");
+
+        var iterator = OrderedShardsIterator.create(RoutingNodes.mutable(routing, nodes), ordering);


nit: let us use immutable instead, seems more inline with the read-only nature of the iterator:

Suggested change

var iterator = OrderedShardsIterator.create(RoutingNodes.mutable(routing, nodes), ordering);

var iterator = OrderedShardsIterator.create(RoutingNodes.immutable(routing, nodes), ordering);

henningandersen · 2023-05-12T10:22:29Z

.../java/org/elasticsearch/cluster/routing/allocation/allocator/OrderedShardsIteratorTests.java

+        ordering.recordAllocation("node-3");
+        ordering.recordAllocation("node-2");
+
+        var iterator = OrderedShardsIterator.create(RoutingNodes.mutable(routing, nodes), ordering);


Suggested change

var iterator = OrderedShardsIterator.create(RoutingNodes.mutable(routing, nodes), ordering);

var iterator = OrderedShardsIterator.create(RoutingNodes.immutable(routing, nodes), ordering);

henningandersen · 2023-05-12T10:26:37Z

.../java/org/elasticsearch/cluster/routing/allocation/allocator/OrderedShardsIteratorTests.java

+        var routing = RoutingTable.builder()
+            .add(index("index-1a", "node-1"))
+            .add(index("index-1b", "node-1"))
+            .add(index("index-2", "node-2"))


Can we add an extra index for node-2 to see that that extra index comes out before the shards on node-1? I.e., that it does not return one shard from each node like the interleaved one.

henningandersen

LGTM.

henningandersen · 2023-05-15T17:07:51Z

...va/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceReconcilerTests.java

+            // ensure that we do not cause hotspots by round-robin unreconciled source nodes when picking next rebalance
+            // (already reconciled nodes are excluded as they are no longer causing new moves)
+            assertThat(
+                "Every node expect to have similar amount of outgoing rebalances: " + totalOutgoingMoves,


Suggested change

"Every node expect to have similar amount of outgoing rebalances: " + totalOutgoingMoves,

"Reconciling nodes should all have same amount (max 1 delta) of moves, since we allow only 2 outgoing recoveries by default: " + totalOutgoingMoves,

No action needed, it's only an assertion message, but I don't think this is right:

since we allow only 2 outgoing recoveries by default

The max 1 delta is because we round-robin through the nodes.

Sounds good, I updated the message to Reconciling nodes should all have same amount (max 1 delta) of moves

I think that if we allowed 3 outgoing recoveries, we risked one node having just one shard left to move and another having 3 shards left to move. In that case, we'd have a delta of 2 even when we round-robin. Did I misunderstand something or does the check here not rely on max doing 2 outgoing recoveries at a time?

In that case we'd move one shard from each node and then remove the completed node from consideration, so from then on the delta would always be zero.

If we allowed 3 outgoing recoveries I think we would risk one node having one outgoing relocation and the other node having 3 outgoing relocations, if these were the very last pending relocations needed. This would happen in one DesiredBalanceReconciler.run round, i.e., we would not remove from the totalOutgoingMoves before after incrementing the counts and validating - and I think that would make the test fail?

I should probably try this out....

henningandersen · 2023-05-16T07:27:26Z

...va/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceReconcilerTests.java

+    public void testRebalanceDoesNotCauseHotSpots() {
+
+        int numberOfNodes = randomIntBetween(5, 9);
+        int shardsPerNode = randomIntBetween(4, 15);


Is there any reason not to go to 1 here? I.e.:

Suggested change

int shardsPerNode = randomIntBetween(4, 15);

int shardsPerNode = randomIntBetween(1, 15);

DaveCTurner

LGTM2

Did we also agree to backport this to 8.8? I think we should.

idegtiarenko · 2023-05-16T08:37:25Z

Did we also agree to backport this to 8.8? I think we should.

I think it is worth backporting while we might look for other allocator improvements.
I will open a backport pr, but will not merge it automatically until we make a final decision.

idegtiarenko · 2023-05-16T08:52:06Z

@elasticsearchmachine please run elasticsearch-ci/bwc

elasticsearchmachine · 2023-05-16T10:37:38Z

💔 Backport failed

Status	Branch	Result
❌	8.8	Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 95454

When reconciling a balance with a lot of shards on undesired nodes there is a possibility of causing node hot spots due to usage of nodeInterleavedShardIterator. This iterator orders shards based on nodes they are located and order nodes based hash map iteration. This means it tends to pick shards returned first by the iterator. This change uses OrderedShardsIterator that applies custom shards order based on allocation recency. (cherry picked from commit 6ecd74d)

idegtiarenko added 2 commits April 21, 2023 11:50

fix

a77077f

idegtiarenko requested a review from DaveCTurner April 21, 2023 12:07

DaveCTurner reviewed Apr 21, 2023

View reviewed changes

idegtiarenko mentioned this pull request Apr 21, 2023

Follow-up work for desired balance allocator #91386

Open

33 tasks

gmarouli added v8.9.0 and removed v8.8.0 labels Apr 26, 2023

idegtiarenko added 7 commits May 10, 2023 15:38

update iterator

14018ea

Merge branch 'main' into balance_priorities_during_reconciliation

d2a60f5

# Conflicts: # server/src/test/java/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceReconcilerTests.java

fix merge

d85faf9

cleanup

a230450

fix debug level

bd70962

simplify

0843255

do not allocate array if empty

a5039aa

idegtiarenko marked this pull request as ready for review May 10, 2023 15:01

Update docs/changelog/95454.yaml

48b3597

idegtiarenko commented May 10, 2023

View reviewed changes

DaveCTurner reviewed May 11, 2023

View reviewed changes

idegtiarenko added 3 commits May 11, 2023 09:44

add iterator unit test

5147530

update test to verify outgoing moves

b92cd21

update test to verify outgoing moves

209b314

idegtiarenko requested a review from DaveCTurner May 11, 2023 14:42

henningandersen reviewed May 12, 2023

View reviewed changes

idegtiarenko added 5 commits May 15, 2023 11:34

fix comments

109400e

Add doc

2ebbfb9

randomize routing nodes

633d97d

update test

8f680f8

Merge branch 'main' into balance_priorities_during_reconciliation

c0513de

idegtiarenko requested a review from henningandersen May 15, 2023 14:23

henningandersen approved these changes May 16, 2023

View reviewed changes

idegtiarenko added 2 commits May 16, 2023 10:08

fix comments

c012778

Merge branch 'main' into balance_priorities_during_reconciliation

f9062ba

DaveCTurner approved these changes May 16, 2023

View reviewed changes

idegtiarenko added auto-backport Automatically create backport pull requests when merged v8.8.0 labels May 16, 2023

Merge branch 'main' into balance_priorities_during_reconciliation

1d61853

idegtiarenko merged commit 6ecd74d into elastic:main May 16, 2023

idegtiarenko deleted the balance_priorities_during_reconciliation branch May 16, 2023 10:36

elasticsearchmachine added the backport pending label May 16, 2023

idegtiarenko mentioned this pull request May 16, 2023

[8.8] Balance priorities during reconciliation (#95454) #96148

Merged

idegtiarenko removed the backport pending label May 16, 2023

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request May 31, 2023

Known-issue docs for elastic#95454

fb1eb42

elasticsearchmachine pushed a commit that referenced this pull request May 31, 2023

Known-issue docs for #95454 (#96459)

9a8bb61

DaveCTurner added a commit that referenced this pull request May 31, 2023

Known-issue docs for #95454 (#96459)

bde8681

DaveCTurner added a commit that referenced this pull request May 31, 2023

Known-issue docs for #95454 (#96459)

346494f

DaveCTurner added a commit that referenced this pull request May 31, 2023

Known-issue docs for #95454 (#96459)

5cf974c

	var iterator = OrderedShardsIterator.create(RoutingNodes.mutable(routing, nodes), ordering);
	var iterator = OrderedShardsIterator.create(RoutingNodes.immutable(routing, nodes), ordering);

	"Every node expect to have similar amount of outgoing rebalances: " + totalOutgoingMoves,
	"Reconciling nodes should all have same amount (max 1 delta) of moves, since we allow only 2 outgoing recoveries by default: " + totalOutgoingMoves,

	int shardsPerNode = randomIntBetween(4, 15);
	int shardsPerNode = randomIntBetween(1, 15);

Balance priorities during reconciliation #95454

Balance priorities during reconciliation #95454

Uh oh!

Conversation

idegtiarenko commented Apr 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DaveCTurner left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

idegtiarenko commented Apr 21, 2023

Uh oh!

DaveCTurner commented Apr 21, 2023

Uh oh!

elasticsearchmachine commented May 10, 2023

Uh oh!

elasticsearchmachine commented May 10, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

henningandersen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

henningandersen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DaveCTurner left a comment

Choose a reason for hiding this comment

Uh oh!

idegtiarenko commented May 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

idegtiarenko commented May 16, 2023

Uh oh!

elasticsearchmachine commented May 16, 2023

idegtiarenko commented Apr 21, 2023 •

edited

Loading

idegtiarenko commented May 16, 2023 •

edited

Loading