Allows failing shards without marking as stale #28054

dnhatn · 2018-01-02T22:54:41Z

Currently when failing a shard we also mark it as stale (eg. remove its
allocationId from from the InSync set). However in some cases, we need
to be able to fail shards but keep them InSync set. This commit adds
such capability. This is a preparatory change to make the primary-replica
resync less lenient.

Relates #24841

Currently when failing a shard, we also remove its allocationId from the InSyncSet if the unassigned reason is not NODE_LEFT. This commit adds an option to mark a failing shard as stale or not explicitly. This is a preparatory change for the resync PR.

ywelsch

There are many places, where you can set markAsStale to false (because the shard is not active). I wonder if it's nicer to have a separate removeInSyncId method on the RoutingAllocation class (delegating to IndexMetaDataUpdater) instead of adding this additional parameter to the shardFailed method. There are only two places where we would need to call the removeInSyncId method (At the moment, it would be in CancelAllocationCommand (could go away later), and in AllocationService.applyFailedShards). WDYT?

ywelsch · 2018-01-03T11:44:35Z

core/src/main/java/org/elasticsearch/cluster/routing/RoutingNodes.java

                            remove(routing);
                            routingChangesObserver.shardFailed(routing,
-                                new UnassignedInfo(UnassignedInfo.Reason.REINITIALIZED, "primary changed"));
+                                new UnassignedInfo(UnassignedInfo.Reason.REINITIALIZED, "primary changed"), true);


the value here does not matter (as the shard is initializing)

ywelsch · 2018-01-03T11:45:58Z

core/src/main/java/org/elasticsearch/cluster/routing/RoutingNodes.java

                            "primary failed while replica initializing", null, 0, unassignedInfo.getUnassignedTimeInNanos(),
                            unassignedInfo.getUnassignedTimeInMillis(), false, AllocationStatus.NO_ATTEMPT);
-                        failShard(logger, replicaShard, primaryFailedUnassignedInfo, indexMetaData, routingChangesObserver);
+                        failShard(logger, replicaShard, primaryFailedUnassignedInfo, true, indexMetaData, routingChangesObserver);


an initializing shard, so marking as stale does not matter.

ywelsch · 2018-01-03T11:46:46Z

core/src/main/java/org/elasticsearch/cluster/routing/RoutingNodes.java

                // cancel and remove target shard
                remove(targetShard);
-                routingChangesObserver.shardFailed(targetShard, unassignedInfo);
+                routingChangesObserver.shardFailed(targetShard, unassignedInfo, markAsStale);


an initializing shard, so marking as stale does not matter.

ywelsch · 2018-01-03T11:48:05Z

core/src/main/java/org/elasticsearch/cluster/routing/RoutingNodes.java

                remove(failedShard);
            }
-            routingChangesObserver.shardFailed(failedShard, unassignedInfo);
+            routingChangesObserver.shardFailed(failedShard, unassignedInfo, markAsStale);


an initializing shard, so marking as stale does not matter.

ywelsch · 2018-01-03T11:52:52Z

core/src/main/java/org/elasticsearch/gateway/ReplicaShardAllocator.java

                            null, 0, allocation.getCurrentNanoTime(), System.currentTimeMillis(), false, UnassignedInfo.AllocationStatus.NO_ATTEMPT);
                        // don't cancel shard in the loop as it will cause a ConcurrentModificationException
-                        shardCancellationActions.add(() -> routingNodes.failShard(logger, shard, unassignedInfo, metaData.getIndexSafe(shard.index()), allocation.changes()));
+                        shardCancellationActions.add(() -> routingNodes.failShard(logger, shard, unassignedInfo, true, metaData.getIndexSafe(shard.index()), allocation.changes()));


an initializing shard, so marking as stale does not matter.

dnhatn · 2018-01-03T16:42:56Z

@ywelsch, I've replaced the markAsStale argument by a separate method. It's nicer than the previous one. Would you please have another look? Thank you.

ywelsch

I like this better than the previous version. I've left a few more suggestions. I wonder how we can better test that we're marking the correct shards as stale so that when we add failShards calls in the future, we're sure to also correctly call markAsStale.

ywelsch · 2018-01-04T17:47:03Z

core/src/main/java/org/elasticsearch/cluster/routing/allocation/IndexMetaDataUpdater.java

     */
-    private void removeAllocationId(ShardRouting shardRouting) {
-        changes(shardRouting.shardId()).removedAllocationIds.add(shardRouting.allocationId().getId());
+    void removeAllocationId(ShardRouting shardRouting) {


We could add another method that removes the allocation of a shard that's not in the routing table and then call that one in AllocationService.removeStaleIdsWithoutRoutingChanges instead of the complex logic there.
If it's not a straight-forward change, then that could be a follow-up

I would prefer to do this in a follow-up.

ywelsch · 2018-01-04T17:48:54Z

.../main/java/org/elasticsearch/cluster/routing/allocation/command/CancelAllocationCommand.java

        }
        routingNodes.failShard(Loggers.getLogger(CancelAllocationCommand.class), shardRouting,
            new UnassignedInfo(UnassignedInfo.Reason.REROUTE_CANCELLED, null), indexMetaData, allocation.changes());
+        allocation.removeAllocationId(shardRouting);


can you add a TODO here that this could go away in future versions?

ywelsch · 2018-01-04T18:17:11Z

core/src/main/java/org/elasticsearch/cluster/action/shard/ShardStateAction.java

            if (failure != null) {
                components.add("failure [" + ExceptionsHelper.detailedMessage(failure) + "]");
            }
+            components.add("markAsStale [" + markAsStale + "]");


it's a bit odd that shard started now also has this component. Maybe a good time to separate ShardEntry into StartedShardEntry and FailedShardEntry?

while doing that, you can also convert that class to use final fields by not implementing readFrom but instead creating a constructor that takes StreamInput as parameter.

dnhatn · 2018-01-08T15:46:41Z

@ywelsch I've split the ShardEntry into started and failed entries; I will make a follow-up for the method removeStaleIdsWithoutRouting. Would you please take another look? Thank you.

# Conflicts: # server/src/test/java/org/elasticsearch/cluster/action/shard/ShardStateActionTests.java

dnhatn · 2018-01-26T19:12:00Z

@bleskes, I talked to Yannick; he would like you to have a look on this.

ywelsch

I've left two more asks. Please also update the PR description to say that this allows shards to be failed without marking them as stale.

ywelsch · 2018-02-02T16:22:14Z

server/src/main/java/org/elasticsearch/cluster/action/shard/ShardStateAction.java

+                            waitForNewMasterAndRetry(actionName, observer, request, listener, changePredicate);
                        } else {
-                            logger.warn((Supplier<?>) () -> new ParameterizedMessage("{} unexpected failure while sending request [{}] to [{}] for shard entry [{}]", shardEntry.shardId, actionName, masterNode, shardEntry), exp);
+                            logger.warn("unexpected failure while sending request [{}] to [{}] for shard entry [{}]", actionName, masterNode, request);


why not log the exception here?

ywelsch · 2018-02-02T16:29:33Z

server/src/main/java/org/elasticsearch/cluster/routing/allocation/IndexMetaDataUpdater.java

+                Updates updates = changes(shardRouting.shardId());
+                if (updates.firstFailedPrimary == null) {
+                    // more than one primary can be failed (because of batching, primary can be failed, replica promoted and then failed...)
+                    updates.firstFailedPrimary = shardRouting;


I think we should leave the "firstFailedPrimary" logic in the shardFailed method for now. I think that this logic can go away later.

Yep, I put it back to the shardFailed method.

dnhatn · 2018-02-02T17:34:30Z

Thanks @ywelsch for your helpful review. I pushed another commit to address your last comments.

Currently when failing a shard we also mark it as stale (eg. remove its allocationId from from the InSync set). However in some cases, we need to be able to fail shards but keep them InSync set. This commit adds such capacity. This is a preparatory change to make the primary-replica resync less lenient. Relates #24841

Relates #28054

Today, failures from the primary-replica resync are ignored as the best effort to not mark shards as stale during the cluster restart. However this can be problematic if replicas failed to execute resync operations but just fine in the subsequent write operations. When this happens, replica will miss some operations from the new primary. There are some implications if the local checkpoint on replica can't advance because of the missing operations. 1. The global checkpoint won't advance - this causes both primary and replicas keep many index commits 2. Engine on replica won't flush periodically because uncommitted stats is calculated based on the local checkpoint 3. Replica can use a large number of bitsets to keep track operations seqno However we can prevent this issue but still reserve the best-effort by failing replicas which fail to execute resync operations but not mark them as stale. We have prepared to the required infrastructure in #28049 and #28054 for this change. Relates #24841

Today, failures from the primary-replica resync are ignored as the best effort to not mark shards as stale during the cluster restart. However this can be problematic if replicas failed to execute resync operations but just fine in the subsequent write operations. When this happens, replica will miss some operations from the new primary. There are some implications if the local checkpoint on replica can't advance because of the missing operations. 1. The global checkpoint won't advance - this causes both primary and replicas keep many index commits 2. Engine on replica won't flush periodically because uncommitted stats is calculated based on the local checkpoint 3. Replica can use a large number of bitsets to keep track operations seqno However we can prevent this issue but still reserve the best-effort by failing replicas which fail to execute resync operations but not mark them as stale. We have prepared to the required infrastructure in elastic#28049 and elastic#28054 for this change. Relates elastic#24841

dnhatn added :Allocation >enhancement v6.2.0 v7.0.0 labels Jan 2, 2018

dnhatn requested review from jasontedor and ywelsch January 2, 2018 22:54

dnhatn changed the title ~~add mark as stale option when failing shard~~ Add mark as stale option when failing shard Jan 2, 2018

test both cases

d416d72

ywelsch reviewed Jan 3, 2018

View reviewed changes

add removeAllocationId method

19fb405

ywelsch reviewed Jan 4, 2018

View reviewed changes

dnhatn added 6 commits January 4, 2018 15:02

Merge branch 'master' into mark-shard-stale

c82cba8

Separate to Started and Failed shard entries

363bb69

remove primary term from started shard entry

944fb5a

Merge branch 'master' into mark-shard-stale

025d998

Add todo

0a18365

add shard entry bwc serialization

d2daf4b

Merge branch 'master' into mark-shard-stale

dca8582

# Conflicts: # server/src/test/java/org/elasticsearch/cluster/action/shard/ShardStateActionTests.java

dnhatn added v6.3.0 and removed v6.2.0 labels Jan 18, 2018

dnhatn closed this Jan 21, 2018

dnhatn deleted the mark-shard-stale branch January 21, 2018 15:08

dnhatn restored the mark-shard-stale branch January 21, 2018 15:09

dnhatn reopened this Jan 21, 2018

Merge branch 'master' into mark-shard-stale

96c9bca

dnhatn requested a review from bleskes January 26, 2018 14:39

Merge branch 'master' into mark-shard-stale

22c73ed

ywelsch approved these changes Feb 2, 2018

View reviewed changes

dnhatn changed the title ~~Add mark as stale option when failing shard~~ Allows failing shards without marking as stale Feb 2, 2018

dnhatn added 2 commits February 2, 2018 12:15

log excpetion when sending shard action

3011de8

put back firstFailedPrimary to shardFailed

8487edc

dnhatn added 2 commits February 2, 2018 14:42

Mark as stale for active shards only

c0b3eb4

Merge branch 'master' into mark-shard-stale

92f748b

dnhatn merged commit 965efa5 into elastic:master Feb 3, 2018

dnhatn deleted the mark-shard-stale branch February 3, 2018 14:41

dnhatn added the backport pending label Feb 3, 2018

dnhatn added a commit that referenced this pull request Feb 3, 2018

Backport fail shard w/o marking as stale PR to v6.3

031fcaf

Relates #28054

dnhatn added a commit that referenced this pull request Feb 3, 2018

Backport fail shard w/o marking as stale PR to v6.3

de6d31e

Relates #28054

dnhatn removed the backport pending label Feb 3, 2018

dnhatn mentioned this pull request Feb 6, 2018

Make primary-replica resync failures less lenient #28534

Merged

lcawl added :Search/Search Search-related issues that do not fall into other categories and removed :Allocation labels Feb 13, 2018

clintongormley added :Distributed Indexing/Distributed A catch all label for anything in the Distributed Indexing Area. Please avoid if you can. and removed :Search/Search Search-related issues that do not fall into other categories labels Feb 13, 2018

jimczi added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

Allows failing shards without marking as stale #28054

Allows failing shards without marking as stale #28054

Uh oh!

Conversation

dnhatn commented Jan 2, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ywelsch left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dnhatn commented Jan 3, 2018

Uh oh!

ywelsch left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dnhatn commented Jan 8, 2018

Uh oh!

dnhatn commented Jan 26, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ywelsch left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dnhatn commented Feb 2, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

dnhatn commented Jan 2, 2018 •

edited

Loading

dnhatn commented Jan 26, 2018 •

edited

Loading