Recovery with syncId should verify seqno infos #41265

dnhatn · 2019-04-16T16:21:33Z

A peer recovery can get stuck after upgrading from 5.x to 6.x in the following scenario:

Have at least three 5.x data nodes in the cluster
Upgrade one node from 5.x to 6.x
Primary is relocated from 5.x to 6.x node (can trigger by the allocation balancer or users)
Index some documents - these documents have sequence numbers on the primary but not on the replica
Wait for 12 hours (the default translog retention policy) for translog to be trimmed
Issue a synced-flush
Upgrade the node with replica to 6.x
The primary is stuck to wait for the advancement for the local checkpoint on the replica.

       org.elasticsearch.index.seqno.ReplicationTracker.markAllocationIdAsInSync(ReplicationTracker.java:647)
       org.elasticsearch.index.shard.IndexShard.markAllocationIdAsInSync(IndexShard.java:1884)
       org.elasticsearch.indices.recovery.RecoverySourceHandler.lambda$finalizeRecovery$12(RecoverySourceHandler.java:499)
       org.elasticsearch.indices.recovery.RecoverySourceHandler$$Lambda$3416/1201078460.run(Unknown Source)
       org.elasticsearch.indices.recovery.RecoverySourceHandler.lambda$runUnderPrimaryPermit$5(RecoverySourceHandler.java:264)
       org.elasticsearch.indices.recovery.RecoverySourceHandler$$Lambda$3407/1470901957.run(Unknown Source)
       org.elasticsearch.common.util.CancellableThreads.executeIO(CancellableThreads.java:105)
       org.elasticsearch.common.util.CancellableThreads.execute(CancellableThreads.java:86)
       org.elasticsearch.indices.recovery.RecoverySourceHandler.runUnderPrimaryPermit(RecoverySourceHandler.java:242)
       org.elasticsearch.indices.recovery.RecoverySourceHandler.finalizeRecovery(RecoverySourceHandler.java:499)
       org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:228)
       org.elasticsearch.indices.recovery.PeerRecoverySourceService.recover(PeerRecoverySourceService.java:98)
       org.elasticsearch.indices.recovery.PeerRecoverySourceService.access$000(PeerRecoverySourceService.java:50)
       org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:107)
       org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:104)

As both primary and replica have the same syncId, recovery skips copying files. The problem is the index commit on replica does not sequence numbers, so we bootstrap -1 for its local checkpoint although the commit (with the same syncId) on the primary has a higher checkpoint. The primary will wait for the advancement of the checkpoint on replica which never advances.

I think we should not allow primaries to relocate to a newer node if some replicas are still on old nodes.

elasticmachine · 2019-04-16T16:21:34Z

Pinging @elastic/es-distributed

henningandersen

Thanks @dnhatn , I left an inline comment that I need your input on.

henningandersen · 2019-04-23T13:40:26Z

server/src/main/java/org/elasticsearch/indices/recovery/RecoverySourceHandler.java

        }
    }

+    boolean hasSameSyncId(Version indexCreatedVersion, Store.MetadataSnapshot source, Store.MetadataSnapshot target) {


Is there a chance that this could mean more file based recoveries during rolling upgrades. I think it could be reasonable to upgrade in this manner:

Stop indexing

Do synced flush

Upgrade one node

Wait for green and rebalancing to complete.

Resume indexing until queue is empty (queue is some external queue of messages to index)

Start from 1 again, upgrading the next node.

I may be mistaken, but I think this could lead to many file based recoveries, that would have been skipped due to identical sync-id without this change? Let me know your thoughts on this.

Would it be an option to instead include localCheckpoint of last safe commit in the prepareForTranslogOperations message and store/validate this on the target node? Only if isSequenceNumberBasedRecovery==false though, since then we know that we must have an identical last safe-commit on target too.

If the commit on the replica does not have sequence number yet, its local checkpoint and max_seq_no are -1; thus recovery still utilizes the syncId if we haven't processed any operation with sequence numbers.

Would it be an option to instead include localCheckpoint of last safe commit in the prepareForTranslogOperations message and store/validate this on the target node? Only if isSequenceNumberBasedRecovery==false though, since then we know that we must have an identical last safe-commit on target too.

Yes but handling BWC between 6.7.x and 7.0.0 would not be easy.

If the commit on the replica does not have sequence number yet, its local checkpoint and max_seq_no are -1; thus recovery still utilizes the syncId if we haven't processed any operation with sequence numbers.

As far as I can see, this is only true if no indexing occurred on the primary such that it also has local checkpoint and max_seq_no = -1. In step 4, we assume some primary is relocated to newer version (bound to happen at some point during the upgrade) and in step 5, we add more data to the shard(s), such that any primary on upgraded nodes have sequence numbers (but replicas on non-upgraded do not). I may have missed a detail, but I think we cannot then do sync-id based recovery when a node holding a replica is upgraded in this scenario?

I think we cannot then do sync-id based recovery when a node holding a replica is upgraded in this scenario.

Yes, this is the intention of this PR.

good observation @henningandersen. I'm not sure how common that upgrade scenario is (I still think we should test for it), but I also can't think of any other solution than the one you proposed. That one has crazy back- and forward-compatibility implications though as we already released 7.0, and I would rather like to avoid too much craziness in 6.x, given that we will probably have to maintain that version for quite a while. My current inclination is to rather live with this full file-sync when upgrading from 5.x and proceed with the solution proposed in this PR.

@dnhatn can you add a test for @henningandersen's upgrade scenario?

dnhatn · 2019-04-26T02:38:05Z

@ywelsch Henning and I discussed this but we felt we need your input here. Can you please take a look? Thank you!

ywelsch

Great find @dnhatn. I've left some comments and my initial thoughts.

ywelsch · 2019-04-25T06:54:21Z

qa/rolling-upgrade/src/test/java/org/elasticsearch/upgrades/RecoveryIT.java

        ensureGreen(index);
    }
+
+    public void testRecoveryWithSyncIdVerifySeqNoStats() throws Exception {


can you add javadocs describing what situation we want to test here and why, given that the test is very specialized?

ywelsch · 2019-04-26T10:29:15Z

server/src/main/java/org/elasticsearch/indices/recovery/RecoverySourceHandler.java

-                logger.trace("skipping [phase1]- identical sync id [{}] found on both source and target", recoverySourceSyncId);
-            } else {
+            final Version indexVersionCreated = shard.indexSettings().getIndexVersionCreated();
+            if (hasSameSyncId(indexVersionCreated, recoverySourceMetadata, request.metadataSnapshot()) == false) {


this checks more than just the sync IDs, perhaps call this method canSkipPhase1.

ywelsch · 2019-04-26T10:31:43Z

server/src/main/java/org/elasticsearch/indices/recovery/RecoverySourceHandler.java

+            assert false : message;
+            throw new IllegalStateException(message);
+        }
+        logger.trace("skipping [phase1]- identical sync id [{}] found on both source and target", source.getSyncId());


Nit: I would prefer to have this log message at the call site instead of in this hasSameSyncId method. Yes, it means having both a then and else branch for the if statement, but it's more symmetric

ywelsch · 2019-04-26T10:32:59Z

server/src/main/java/org/elasticsearch/indices/recovery/RecoverySourceHandler.java

+            }
+            final String message = "try to recover " + request.shardId() + " with sync id but " +
+                "seq_no stats are mismatched: [" + source.getCommitUserData() + "] vs [" + target.getCommitUserData() + "]";
+            assert false : message;


should we forward-port this to newer branches as well? i.e. check that sync flush guarantees that max seq nos and local checkpoints match?

Yeah, we will forward-port without the index version leniency.

ywelsch · 2019-04-26T10:34:17Z

server/src/main/java/org/elasticsearch/indices/recovery/RecoverySourceHandler.java

+            if (indexCreatedVersion.before(Version.V_6_0_0) &&
+                target.getCommitUserData().containsKey(SequenceNumbers.LOCAL_CHECKPOINT_KEY) == false &&
+                target.getCommitUserData().containsKey(SequenceNumbers.MAX_SEQ_NO) == false) {
+                return false;


this will need a comment in the code explaining what situation is being addressed here that can no longer occur in 6.0+.

ywelsch · 2019-04-26T11:38:16Z

server/src/main/java/org/elasticsearch/indices/recovery/RecoverySourceHandler.java

        }
    }

+    boolean hasSameSyncId(Version indexCreatedVersion, Store.MetadataSnapshot source, Store.MetadataSnapshot target) {


good observation @henningandersen. I'm not sure how common that upgrade scenario is (I still think we should test for it), but I also can't think of any other solution than the one you proposed. That one has crazy back- and forward-compatibility implications though as we already released 7.0, and I would rather like to avoid too much craziness in 6.x, given that we will probably have to maintain that version for quite a while. My current inclination is to rather live with this full file-sync when upgrading from 5.x and proceed with the solution proposed in this PR.

@dnhatn can you add a test for @henningandersen's upgrade scenario?

henningandersen

Thanks @dnhatn . Looking good. I added a couple comments to the tests.

henningandersen · 2019-05-01T11:35:24Z

qa/rolling-upgrade/src/test/java/org/elasticsearch/upgrades/RecoveryIT.java

+            syncedFlush(index);
+        } else {
+            ensureGreen(index);
+            assertNoFileBasedRecovery(index);


I think this assertion could fail, if a primary was relocated to one of the new nodes in one of the mixed phases? Could we maybe force such a relocation to happen in first mixed round and then assert that we do file based recoveries in the second mixed round and here too? May need to fix replicas to 2 for that to hold.

Also, it would be nice to verify that we have all docs at the end.

I disabled the allocation rebalancing in this test and modified testRecoveryWithSyncIdVerifySeqNoStats to cover that scenario.

henningandersen · 2019-05-01T12:01:43Z

qa/rolling-upgrade/src/test/java/org/elasticsearch/upgrades/RecoveryIT.java

+    }
+
+    private void syncedFlush(String index) throws Exception {
+        // We have to spin synced-flush requests here because we fire the global checkpoint sync for the last write operation.


I think there is a very small chance of a race condition here. If both the synced flush and node stop runs before the global checkpoint sync, the node will come up with an older safe commit and will revert to file based recovery. I could be wrong of course. The likelihood seems very small though. I did not find a good way to check if the global checkpoint sync has started/completed, maybe you have an idea?

Well spotted. I have an idea and will try it out.

dnhatn · 2019-05-03T17:05:53Z

@henningandersen @ywelsch I have addressed your comments. Can you please take another look? Thank you!

henningandersen

Looking good, thanks for the addtional work on this dnhatn. I have a single question/comment.

henningandersen · 2019-05-05T19:44:23Z

qa/rolling-upgrade/src/test/java/org/elasticsearch/upgrades/RecoveryIT.java

+        } else {
+            // If we are upgrading from 5.x and there're some documents with sequence numbers, then we must ignore syncId
+            // and perform file-based recovery for upgraded-node-2; otherwise peer recovery should utilize syncId.
+            final boolean forcedFileBasedRecovery = UPGRADE_FROM_VERSION.before(Version.V_6_0_0) &&


I think we only have forcedFileBasedRecovery=true in the case where the primary was on one of the upgraded nodes in one of the mixed steps. If it is on node-2 that would not happen.

I think we rely on the allocator randomly picking a node for the primary here then? That is probably OK, if it is not deterministically choosing the same primary for every run of the test. You probably validated that and it is likely OK but wanted to ask to be sure.

Yes, here we reply on the allocator to choose the primary randomly. Are you okay with this approach?

ywelsch

LGTM

dnhatn · 2019-05-21T21:20:09Z

@henningandersen @ywelsch Thanks so much for useful inputs.

This change verifies and aborts recovery if source and target have the same syncId but different sequenceId. This commit also adds an upgrade test to ensure that we always utilize syncId.

Backport of elastic/elasticsearch#41265

Recovery with syncId should verify seqno infos

6ce12e7

dnhatn added >bug :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. v7.0.0 v6.7.0 v8.0.0 v7.2.0 labels Apr 16, 2019

dnhatn requested review from henningandersen and ywelsch April 16, 2019 16:21

dnhatn added v6.7.2 v7.0.1 and removed v7.0.0 v6.7.0 labels Apr 16, 2019

colings86 added v6.7.3 and removed v6.7.2 labels Apr 17, 2019

henningandersen reviewed Apr 23, 2019

View reviewed changes

jaymode added v7.0.2 and removed v7.0.1 labels Apr 24, 2019

ywelsch suggested changes Apr 26, 2019

View reviewed changes

dnhatn added 4 commits April 29, 2019 21:53

Merge branch '6.7' into syncid-check-seqno

afded5a

yannick’s feedback

41070b7

henning’s test

7a84989

add comment for the test

4bf7db6

henningandersen reviewed May 1, 2019

View reviewed changes

dnhatn added v6.8.0 and removed v7.0.2 v6.7.3 labels May 2, 2019

wait for the global checkpoint

b0ab140

dnhatn requested review from henningandersen and ywelsch May 3, 2019 17:05

dnhatn changed the base branch from 6.7 to 6.8 May 3, 2019 17:08

dnhatn added 4 commits May 3, 2019 13:09

Merge branch '6.8' into syncid-check-seqno

29b427f

fix 5.x condition

70aa935

use upgrade from version

b5d465f

single test

d505bc3

henningandersen reviewed May 5, 2019

View reviewed changes

jakelandis added v6.8.1 and removed v6.8.0 labels May 19, 2019

dnhatn requested a review from henningandersen May 20, 2019 16:03

ywelsch approved these changes May 21, 2019

View reviewed changes

Merge branch '6.8' into syncid-check-seqno

bf31409

dnhatn merged commit c359d67 into elastic:6.8 May 21, 2019

dnhatn deleted the syncid-check-seqno branch May 21, 2019 21:20

dnhatn added the backport pending label May 21, 2019

dnhatn removed the backport pending label May 24, 2019

kovrus added a commit to crate/crate that referenced this pull request Sep 12, 2019

Recovery with syncId should verify seqno infos.

330629d

Backport of elastic/elasticsearch#41265

kovrus added a commit to crate/crate that referenced this pull request Sep 12, 2019

Recovery with syncId should verify seqno infos.

c846e98

Backport of elastic/elasticsearch#41265

mergify bot pushed a commit to crate/crate that referenced this pull request Sep 17, 2019

Recovery with syncId should verify seqno infos.

0298784

Backport of elastic/elasticsearch#41265

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Recovery with syncId should verify seqno infos #41265

Recovery with syncId should verify seqno infos #41265

Uh oh!

Conversation

dnhatn commented Apr 16, 2019

Uh oh!

elasticmachine commented Apr 16, 2019

Uh oh!

henningandersen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dnhatn Apr 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dnhatn commented Apr 26, 2019

Uh oh!

ywelsch left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

henningandersen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dnhatn commented May 3, 2019

Uh oh!

henningandersen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ywelsch left a comment

Choose a reason for hiding this comment

Uh oh!

dnhatn commented May 21, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

dnhatn Apr 24, 2019 •

edited

Loading