-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Balance priorities during reconciliation #95454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Balance priorities during reconciliation #95454
Conversation
When reconciling a balance with a lot of shards on undesired nodes there is a possibility of causing node hot spots due to usage of nodeInterleavedShardIterator. This iterator orders shards based on nodes they are located and order nodes based hash map iteration. This means it tends to pick shards returned first by the iterator.
DaveCTurner
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should stop with the interleaving altogether and just try all the shards from the least-recently-touched node until we find one to move, then move on to the next-least-recently-touched node and try all its shards, and so on.
| case YES -> { | ||
| if (logger.isTraceEnabled()) { | ||
| logger.trace("Assigned shard [{}] to [{}]", shard, desiredNodeId); | ||
| if (logger.isDebugEnabled()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: no real need for these checks, they're the first thing that logger.debug() does anyway
We need to handle the case when it is allowed to move more shards then there are nodes in cluster (or somehow to return to the first node again) |
Yes, we'd keep iterating until we've considered moving every shard on every node. |
# Conflicts: # server/src/test/java/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceReconcilerTests.java
|
Pinging @elastic/es-distributed (Team:Distributed) |
|
Hi @idegtiarenko, I've created a changelog YAML for you. |
| return new OrderedNodesShardsIterator(); | ||
| } | ||
|
|
||
| private class OrderedNodesShardsIterator implements Iterator<ShardRouting> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am planning to move this to a top-level class and add some unit tests for this
| return nextShard; | ||
| } | ||
|
|
||
| public void dePrioritizeNode(String nodeId) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: IMO we should update moveOrdering here rather than relying on the caller calling both dePrioritizeNode and recordAllocation.
...va/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceReconcilerTests.java
Show resolved
Hide resolved
henningandersen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. I left a number of minor comments.
| } | ||
|
|
||
| private void moveShards() { | ||
| // Iterate over the started shards interleaving between nodes, and check if they can remain. In the presence of throttling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment needs updating.
| return; | ||
| } | ||
|
|
||
| // Iterate over the started shards interleaving between nodes, and try to move any which are on undesired nodes. In the presence of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also update comment here?
| import java.util.NoSuchElementException; | ||
| import java.util.Objects; | ||
|
|
||
| public class OrderedShardsIterator implements Iterator<ShardRouting> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add javadoc here explaining the intended order?
| } | ||
| allocationOrdering.retainNodes(getNodeIds(allocation.routingNodes())); | ||
| recordTime(cumulativeReconciliationTime, new DesiredBalanceReconciler(desiredBalance, allocation, allocationOrdering)::run); | ||
| allocationOrdering.retainNodes(allocation.routingNodes().getAllNodeIds()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should also do retainNodes on moveOrdering?
| } | ||
|
|
||
| var summary = totalOutgoingMoves.values().stream().mapToInt(AtomicInteger::get).summaryStatistics(); | ||
| assertThat( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The text says "similar", whereas the assertion is more precise (max 1 difference). I wonder if you can add a comment explaining how we are sure of max 1 difference here when it works?
My intuition says that if the currentNodeId that is picked randomly originally is unfortunate enough to ensure only one node needs any shard movements, we could get a distance of two here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree - I think we should remove the fully-reconciled nodes at the top of the loop, rather than doing it after the first assertion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one node needs any shard movements, we could get a distance of two here
Fixing by adding only nodes that require reconciliation to totalOutgoingMoves
| var ordering = new NodeAllocationOrdering(); | ||
| ordering.recordAllocation("node-1"); | ||
|
|
||
| var iterator = OrderedShardsIterator.create(RoutingNodes.mutable(routing, nodes), ordering); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: let us use immutable instead, seems more inline with the read-only nature of the iterator:
| var iterator = OrderedShardsIterator.create(RoutingNodes.mutable(routing, nodes), ordering); | |
| var iterator = OrderedShardsIterator.create(RoutingNodes.immutable(routing, nodes), ordering); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While iterator is not performing any changes I wanted to test it it with mutable RoutingNodes as this is what we supply it in the real code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By testing it with immutable, we also validate that the iterator does not mutate the routing nodes. We can compromise on randomizing it maybe ;-).
| var ordering = new NodeAllocationOrdering(); | ||
| ordering.recordAllocation("node-1"); | ||
|
|
||
| var iterator = OrderedShardsIterator.create(RoutingNodes.mutable(routing, nodes), ordering); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: let us use immutable instead, seems more inline with the read-only nature of the iterator:
| var iterator = OrderedShardsIterator.create(RoutingNodes.mutable(routing, nodes), ordering); | |
| var iterator = OrderedShardsIterator.create(RoutingNodes.immutable(routing, nodes), ordering); |
| ordering.recordAllocation("node-3"); | ||
| ordering.recordAllocation("node-2"); | ||
|
|
||
| var iterator = OrderedShardsIterator.create(RoutingNodes.mutable(routing, nodes), ordering); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| var iterator = OrderedShardsIterator.create(RoutingNodes.mutable(routing, nodes), ordering); | |
| var iterator = OrderedShardsIterator.create(RoutingNodes.immutable(routing, nodes), ordering); |
| var routing = RoutingTable.builder() | ||
| .add(index("index-1a", "node-1")) | ||
| .add(index("index-1b", "node-1")) | ||
| .add(index("index-2", "node-2")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add an extra index for node-2 to see that that extra index comes out before the shards on node-1? I.e., that it does not return one shard from each node like the interleaved one.
henningandersen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
| // ensure that we do not cause hotspots by round-robin unreconciled source nodes when picking next rebalance | ||
| // (already reconciled nodes are excluded as they are no longer causing new moves) | ||
| assertThat( | ||
| "Every node expect to have similar amount of outgoing rebalances: " + totalOutgoingMoves, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| "Every node expect to have similar amount of outgoing rebalances: " + totalOutgoingMoves, | |
| "Reconciling nodes should all have same amount (max 1 delta) of moves, since we allow only 2 outgoing recoveries by default: " + totalOutgoingMoves, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No action needed, it's only an assertion message, but I don't think this is right:
since we allow only 2 outgoing recoveries by default
The max 1 delta is because we round-robin through the nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, I updated the message to Reconciling nodes should all have same amount (max 1 delta) of moves
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that if we allowed 3 outgoing recoveries, we risked one node having just one shard left to move and another having 3 shards left to move. In that case, we'd have a delta of 2 even when we round-robin. Did I misunderstand something or does the check here not rely on max doing 2 outgoing recoveries at a time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case we'd move one shard from each node and then remove the completed node from consideration, so from then on the delta would always be zero.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we allowed 3 outgoing recoveries I think we would risk one node having one outgoing relocation and the other node having 3 outgoing relocations, if these were the very last pending relocations needed. This would happen in one DesiredBalanceReconciler.run round, i.e., we would not remove from the totalOutgoingMoves before after incrementing the counts and validating - and I think that would make the test fail?
I should probably try this out....
| public void testRebalanceDoesNotCauseHotSpots() { | ||
|
|
||
| int numberOfNodes = randomIntBetween(5, 9); | ||
| int shardsPerNode = randomIntBetween(4, 15); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any reason not to go to 1 here? I.e.:
| int shardsPerNode = randomIntBetween(4, 15); | |
| int shardsPerNode = randomIntBetween(1, 15); |
DaveCTurner
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM2
Did we also agree to backport this to 8.8? I think we should.
I think it is worth backporting while we might look for other allocator improvements. |
|
@elasticsearchmachine please run elasticsearch-ci/bwc |
💔 Backport failed
You can use sqren/backport to manually backport by running |
When reconciling a balance with a lot of shards on undesired nodes there is a possibility of causing node hot spots due to usage of nodeInterleavedShardIterator. This iterator orders shards based on nodes they are located and order nodes based hash map iteration. This means it tends to pick shards returned first by the iterator. This change uses OrderedShardsIterator that applies custom shards order based on allocation recency. (cherry picked from commit 6ecd74d)
When reconciling a balance with a lot of shards on undesired nodes there is a possibility of causing node hot spots due to usage of nodeInterleavedShardIterator. This iterator orders shards based on nodes they are located and order nodes based hash map iteration. This means it tends to pick shards returned first by the iterator. This change uses OrderedShardsIterator that applies custom shards order based on allocation recency. (cherry picked from commit 6ecd74d)
When reconciling a balance with a lot of shards on undesired nodes there is a possibility of causing node hot spots due to usage of nodeInterleavedShardIterator. This iterator orders shards based on nodes they are located and order nodes based hash map iteration. This means it tends to pick shards returned first by the iterator.
Depends on #96025
Related to #91386