Skip to content

Conversation

@dnhatn
Copy link
Member

@dnhatn dnhatn commented Jan 2, 2018

Currently when failing a shard we also mark it as stale (eg. remove its
allocationId from from the InSync set). However in some cases, we need
to be able to fail shards but keep them InSync set. This commit adds
such capability. This is a preparatory change to make the primary-replica
resync less lenient.

Relates #24841

Currently when failing a shard, we also remove its allocationId from the
InSyncSet if the unassigned reason is not NODE_LEFT. This commit adds an
option to mark a failing shard as stale or not explicitly. This is a
preparatory change for the resync PR.
@dnhatn dnhatn requested review from jasontedor and ywelsch January 2, 2018 22:54
@dnhatn dnhatn changed the title add mark as stale option when failing shard Add mark as stale option when failing shard Jan 2, 2018
Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are many places, where you can set markAsStale to false (because the shard is not active). I wonder if it's nicer to have a separate removeInSyncId method on the RoutingAllocation class (delegating to IndexMetaDataUpdater) instead of adding this additional parameter to the shardFailed method. There are only two places where we would need to call the removeInSyncId method (At the moment, it would be in CancelAllocationCommand (could go away later), and in AllocationService.applyFailedShards). WDYT?

remove(routing);
routingChangesObserver.shardFailed(routing,
new UnassignedInfo(UnassignedInfo.Reason.REINITIALIZED, "primary changed"));
new UnassignedInfo(UnassignedInfo.Reason.REINITIALIZED, "primary changed"), true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the value here does not matter (as the shard is initializing)

"primary failed while replica initializing", null, 0, unassignedInfo.getUnassignedTimeInNanos(),
unassignedInfo.getUnassignedTimeInMillis(), false, AllocationStatus.NO_ATTEMPT);
failShard(logger, replicaShard, primaryFailedUnassignedInfo, indexMetaData, routingChangesObserver);
failShard(logger, replicaShard, primaryFailedUnassignedInfo, true, indexMetaData, routingChangesObserver);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an initializing shard, so marking as stale does not matter.

// cancel and remove target shard
remove(targetShard);
routingChangesObserver.shardFailed(targetShard, unassignedInfo);
routingChangesObserver.shardFailed(targetShard, unassignedInfo, markAsStale);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an initializing shard, so marking as stale does not matter.

remove(failedShard);
}
routingChangesObserver.shardFailed(failedShard, unassignedInfo);
routingChangesObserver.shardFailed(failedShard, unassignedInfo, markAsStale);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an initializing shard, so marking as stale does not matter.

null, 0, allocation.getCurrentNanoTime(), System.currentTimeMillis(), false, UnassignedInfo.AllocationStatus.NO_ATTEMPT);
// don't cancel shard in the loop as it will cause a ConcurrentModificationException
shardCancellationActions.add(() -> routingNodes.failShard(logger, shard, unassignedInfo, metaData.getIndexSafe(shard.index()), allocation.changes()));
shardCancellationActions.add(() -> routingNodes.failShard(logger, shard, unassignedInfo, true, metaData.getIndexSafe(shard.index()), allocation.changes()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an initializing shard, so marking as stale does not matter.

@dnhatn
Copy link
Member Author

dnhatn commented Jan 3, 2018

@ywelsch, I've replaced the markAsStale argument by a separate method. It's nicer than the previous one. Would you please have another look? Thank you.

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this better than the previous version. I've left a few more suggestions. I wonder how we can better test that we're marking the correct shards as stale so that when we add failShards calls in the future, we're sure to also correctly call markAsStale.

*/
private void removeAllocationId(ShardRouting shardRouting) {
changes(shardRouting.shardId()).removedAllocationIds.add(shardRouting.allocationId().getId());
void removeAllocationId(ShardRouting shardRouting) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add another method that removes the allocation of a shard that's not in the routing table and then call that one in AllocationService.removeStaleIdsWithoutRoutingChanges instead of the complex logic there.
If it's not a straight-forward change, then that could be a follow-up

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to do this in a follow-up.

}
routingNodes.failShard(Loggers.getLogger(CancelAllocationCommand.class), shardRouting,
new UnassignedInfo(UnassignedInfo.Reason.REROUTE_CANCELLED, null), indexMetaData, allocation.changes());
allocation.removeAllocationId(shardRouting);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a TODO here that this could go away in future versions?

if (failure != null) {
components.add("failure [" + ExceptionsHelper.detailedMessage(failure) + "]");
}
components.add("markAsStale [" + markAsStale + "]");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a bit odd that shard started now also has this component. Maybe a good time to separate ShardEntry into StartedShardEntry and FailedShardEntry?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while doing that, you can also convert that class to use final fields by not implementing readFrom but instead creating a constructor that takes StreamInput as parameter.

@dnhatn
Copy link
Member Author

dnhatn commented Jan 8, 2018

@ywelsch I've split the ShardEntry into started and failed entries; I will make a follow-up for the method removeStaleIdsWithoutRouting. Would you please take another look? Thank you.

# Conflicts:
#	server/src/test/java/org/elasticsearch/cluster/action/shard/ShardStateActionTests.java
@dnhatn dnhatn added v6.3.0 and removed v6.2.0 labels Jan 18, 2018
@dnhatn dnhatn closed this Jan 21, 2018
@dnhatn dnhatn deleted the mark-shard-stale branch January 21, 2018 15:08
@dnhatn dnhatn restored the mark-shard-stale branch January 21, 2018 15:09
@dnhatn dnhatn reopened this Jan 21, 2018
@dnhatn dnhatn requested a review from bleskes January 26, 2018 14:39
@dnhatn
Copy link
Member Author

dnhatn commented Jan 26, 2018

@bleskes, I talked to Yannick; he would like you to have a look on this.

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left two more asks. Please also update the PR description to say that this allows shards to be failed without marking them as stale.

waitForNewMasterAndRetry(actionName, observer, request, listener, changePredicate);
} else {
logger.warn((Supplier<?>) () -> new ParameterizedMessage("{} unexpected failure while sending request [{}] to [{}] for shard entry [{}]", shardEntry.shardId, actionName, masterNode, shardEntry), exp);
logger.warn("unexpected failure while sending request [{}] to [{}] for shard entry [{}]", actionName, masterNode, request);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not log the exception here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Updates updates = changes(shardRouting.shardId());
if (updates.firstFailedPrimary == null) {
// more than one primary can be failed (because of batching, primary can be failed, replica promoted and then failed...)
updates.firstFailedPrimary = shardRouting;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should leave the "firstFailedPrimary" logic in the shardFailed method for now. I think that this logic can go away later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I put it back to the shardFailed method.

@dnhatn dnhatn changed the title Add mark as stale option when failing shard Allows failing shards without marking as stale Feb 2, 2018
@dnhatn
Copy link
Member Author

dnhatn commented Feb 2, 2018

Thanks @ywelsch for your helpful review. I pushed another commit to address your last comments.

@dnhatn dnhatn merged commit 965efa5 into elastic:master Feb 3, 2018
@dnhatn dnhatn deleted the mark-shard-stale branch February 3, 2018 14:41
dnhatn added a commit that referenced this pull request Feb 3, 2018
Currently when failing a shard we also mark it as stale (eg. remove its
allocationId from from the InSync set). However in some cases, we need 
to be able to fail shards but keep them InSync set. This commit adds
such capacity. This is a preparatory change to make the primary-replica
resync less lenient.

Relates #24841
dnhatn added a commit that referenced this pull request Feb 3, 2018
dnhatn added a commit that referenced this pull request Feb 3, 2018
@lcawl lcawl added :Search/Search Search-related issues that do not fall into other categories and removed :Allocation labels Feb 13, 2018
@clintongormley clintongormley added :Distributed Indexing/Distributed A catch all label for anything in the Distributed Indexing Area. Please avoid if you can. and removed :Search/Search Search-related issues that do not fall into other categories labels Feb 13, 2018
dnhatn added a commit that referenced this pull request Mar 9, 2018
Today, failures from the primary-replica resync are ignored as the best 
effort to not mark shards as stale during the cluster restart. However
this can be problematic if replicas failed to execute resync operations
but just fine in the subsequent write operations. When this happens,
replica will miss some operations from the new primary. There are some
implications if the local checkpoint on replica can't advance because of
the missing operations.

1. The global checkpoint won't advance - this causes both primary and 
replicas keep many index commits

2. Engine on replica won't flush periodically because uncommitted stats
is calculated based on the local checkpoint

3. Replica can use a large number of bitsets to keep track operations seqno

However we can prevent this issue but still reserve the best-effort by 
failing replicas which fail to execute resync operations but not mark
them as stale. We have prepared to the required infrastructure in #28049
and #28054 for this change.

Relates #24841
dnhatn added a commit that referenced this pull request Mar 10, 2018
Today, failures from the primary-replica resync are ignored as the best 
effort to not mark shards as stale during the cluster restart. However
this can be problematic if replicas failed to execute resync operations
but just fine in the subsequent write operations. When this happens,
replica will miss some operations from the new primary. There are some
implications if the local checkpoint on replica can't advance because of
the missing operations.

1. The global checkpoint won't advance - this causes both primary and 
replicas keep many index commits

2. Engine on replica won't flush periodically because uncommitted stats
is calculated based on the local checkpoint

3. Replica can use a large number of bitsets to keep track operations seqno

However we can prevent this issue but still reserve the best-effort by 
failing replicas which fail to execute resync operations but not mark
them as stale. We have prepared to the required infrastructure in #28049
and #28054 for this change.

Relates #24841
sebasjm pushed a commit to sebasjm/elasticsearch that referenced this pull request Mar 10, 2018
Today, failures from the primary-replica resync are ignored as the best 
effort to not mark shards as stale during the cluster restart. However
this can be problematic if replicas failed to execute resync operations
but just fine in the subsequent write operations. When this happens,
replica will miss some operations from the new primary. There are some
implications if the local checkpoint on replica can't advance because of
the missing operations.

1. The global checkpoint won't advance - this causes both primary and 
replicas keep many index commits

2. Engine on replica won't flush periodically because uncommitted stats
is calculated based on the local checkpoint

3. Replica can use a large number of bitsets to keep track operations seqno

However we can prevent this issue but still reserve the best-effort by 
failing replicas which fail to execute resync operations but not mark
them as stale. We have prepared to the required infrastructure in elastic#28049
and elastic#28054 for this change.

Relates elastic#24841
@jimczi jimczi added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Indexing/Distributed A catch all label for anything in the Distributed Indexing Area. Please avoid if you can. >enhancement v6.3.0 v7.0.0-beta1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants