-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Allows failing shards without marking as stale #28054
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Currently when failing a shard, we also remove its allocationId from the InSyncSet if the unassigned reason is not NODE_LEFT. This commit adds an option to mark a failing shard as stale or not explicitly. This is a preparatory change for the resync PR.
ywelsch
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are many places, where you can set markAsStale to false (because the shard is not active). I wonder if it's nicer to have a separate removeInSyncId method on the RoutingAllocation class (delegating to IndexMetaDataUpdater) instead of adding this additional parameter to the shardFailed method. There are only two places where we would need to call the removeInSyncId method (At the moment, it would be in CancelAllocationCommand (could go away later), and in AllocationService.applyFailedShards). WDYT?
| remove(routing); | ||
| routingChangesObserver.shardFailed(routing, | ||
| new UnassignedInfo(UnassignedInfo.Reason.REINITIALIZED, "primary changed")); | ||
| new UnassignedInfo(UnassignedInfo.Reason.REINITIALIZED, "primary changed"), true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the value here does not matter (as the shard is initializing)
| "primary failed while replica initializing", null, 0, unassignedInfo.getUnassignedTimeInNanos(), | ||
| unassignedInfo.getUnassignedTimeInMillis(), false, AllocationStatus.NO_ATTEMPT); | ||
| failShard(logger, replicaShard, primaryFailedUnassignedInfo, indexMetaData, routingChangesObserver); | ||
| failShard(logger, replicaShard, primaryFailedUnassignedInfo, true, indexMetaData, routingChangesObserver); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
an initializing shard, so marking as stale does not matter.
| // cancel and remove target shard | ||
| remove(targetShard); | ||
| routingChangesObserver.shardFailed(targetShard, unassignedInfo); | ||
| routingChangesObserver.shardFailed(targetShard, unassignedInfo, markAsStale); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
an initializing shard, so marking as stale does not matter.
| remove(failedShard); | ||
| } | ||
| routingChangesObserver.shardFailed(failedShard, unassignedInfo); | ||
| routingChangesObserver.shardFailed(failedShard, unassignedInfo, markAsStale); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
an initializing shard, so marking as stale does not matter.
| null, 0, allocation.getCurrentNanoTime(), System.currentTimeMillis(), false, UnassignedInfo.AllocationStatus.NO_ATTEMPT); | ||
| // don't cancel shard in the loop as it will cause a ConcurrentModificationException | ||
| shardCancellationActions.add(() -> routingNodes.failShard(logger, shard, unassignedInfo, metaData.getIndexSafe(shard.index()), allocation.changes())); | ||
| shardCancellationActions.add(() -> routingNodes.failShard(logger, shard, unassignedInfo, true, metaData.getIndexSafe(shard.index()), allocation.changes())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
an initializing shard, so marking as stale does not matter.
|
@ywelsch, I've replaced the markAsStale argument by a separate method. It's nicer than the previous one. Would you please have another look? Thank you. |
ywelsch
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this better than the previous version. I've left a few more suggestions. I wonder how we can better test that we're marking the correct shards as stale so that when we add failShards calls in the future, we're sure to also correctly call markAsStale.
| */ | ||
| private void removeAllocationId(ShardRouting shardRouting) { | ||
| changes(shardRouting.shardId()).removedAllocationIds.add(shardRouting.allocationId().getId()); | ||
| void removeAllocationId(ShardRouting shardRouting) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could add another method that removes the allocation of a shard that's not in the routing table and then call that one in AllocationService.removeStaleIdsWithoutRoutingChanges instead of the complex logic there.
If it's not a straight-forward change, then that could be a follow-up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to do this in a follow-up.
| } | ||
| routingNodes.failShard(Loggers.getLogger(CancelAllocationCommand.class), shardRouting, | ||
| new UnassignedInfo(UnassignedInfo.Reason.REROUTE_CANCELLED, null), indexMetaData, allocation.changes()); | ||
| allocation.removeAllocationId(shardRouting); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a TODO here that this could go away in future versions?
| if (failure != null) { | ||
| components.add("failure [" + ExceptionsHelper.detailedMessage(failure) + "]"); | ||
| } | ||
| components.add("markAsStale [" + markAsStale + "]"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's a bit odd that shard started now also has this component. Maybe a good time to separate ShardEntry into StartedShardEntry and FailedShardEntry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while doing that, you can also convert that class to use final fields by not implementing readFrom but instead creating a constructor that takes StreamInput as parameter.
|
@ywelsch I've split the ShardEntry into started and failed entries; I will make a follow-up for the method |
# Conflicts: # server/src/test/java/org/elasticsearch/cluster/action/shard/ShardStateActionTests.java
|
@bleskes, I talked to Yannick; he would like you to have a look on this. |
ywelsch
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left two more asks. Please also update the PR description to say that this allows shards to be failed without marking them as stale.
| waitForNewMasterAndRetry(actionName, observer, request, listener, changePredicate); | ||
| } else { | ||
| logger.warn((Supplier<?>) () -> new ParameterizedMessage("{} unexpected failure while sending request [{}] to [{}] for shard entry [{}]", shardEntry.shardId, actionName, masterNode, shardEntry), exp); | ||
| logger.warn("unexpected failure while sending request [{}] to [{}] for shard entry [{}]", actionName, masterNode, request); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not log the exception here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| Updates updates = changes(shardRouting.shardId()); | ||
| if (updates.firstFailedPrimary == null) { | ||
| // more than one primary can be failed (because of batching, primary can be failed, replica promoted and then failed...) | ||
| updates.firstFailedPrimary = shardRouting; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should leave the "firstFailedPrimary" logic in the shardFailed method for now. I think that this logic can go away later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I put it back to the shardFailed method.
|
Thanks @ywelsch for your helpful review. I pushed another commit to address your last comments. |
Currently when failing a shard we also mark it as stale (eg. remove its allocationId from from the InSync set). However in some cases, we need to be able to fail shards but keep them InSync set. This commit adds such capacity. This is a preparatory change to make the primary-replica resync less lenient. Relates #24841
Today, failures from the primary-replica resync are ignored as the best effort to not mark shards as stale during the cluster restart. However this can be problematic if replicas failed to execute resync operations but just fine in the subsequent write operations. When this happens, replica will miss some operations from the new primary. There are some implications if the local checkpoint on replica can't advance because of the missing operations. 1. The global checkpoint won't advance - this causes both primary and replicas keep many index commits 2. Engine on replica won't flush periodically because uncommitted stats is calculated based on the local checkpoint 3. Replica can use a large number of bitsets to keep track operations seqno However we can prevent this issue but still reserve the best-effort by failing replicas which fail to execute resync operations but not mark them as stale. We have prepared to the required infrastructure in #28049 and #28054 for this change. Relates #24841
Today, failures from the primary-replica resync are ignored as the best effort to not mark shards as stale during the cluster restart. However this can be problematic if replicas failed to execute resync operations but just fine in the subsequent write operations. When this happens, replica will miss some operations from the new primary. There are some implications if the local checkpoint on replica can't advance because of the missing operations. 1. The global checkpoint won't advance - this causes both primary and replicas keep many index commits 2. Engine on replica won't flush periodically because uncommitted stats is calculated based on the local checkpoint 3. Replica can use a large number of bitsets to keep track operations seqno However we can prevent this issue but still reserve the best-effort by failing replicas which fail to execute resync operations but not mark them as stale. We have prepared to the required infrastructure in #28049 and #28054 for this change. Relates #24841
Today, failures from the primary-replica resync are ignored as the best effort to not mark shards as stale during the cluster restart. However this can be problematic if replicas failed to execute resync operations but just fine in the subsequent write operations. When this happens, replica will miss some operations from the new primary. There are some implications if the local checkpoint on replica can't advance because of the missing operations. 1. The global checkpoint won't advance - this causes both primary and replicas keep many index commits 2. Engine on replica won't flush periodically because uncommitted stats is calculated based on the local checkpoint 3. Replica can use a large number of bitsets to keep track operations seqno However we can prevent this issue but still reserve the best-effort by failing replicas which fail to execute resync operations but not mark them as stale. We have prepared to the required infrastructure in elastic#28049 and elastic#28054 for this change. Relates elastic#24841
Currently when failing a shard we also mark it as stale (eg. remove its
allocationId from from the InSync set). However in some cases, we need
to be able to fail shards but keep them InSync set. This commit adds
such capability. This is a preparatory change to make the primary-replica
resync less lenient.
Relates #24841