Skip to content

Conversation

@dnhatn
Copy link
Member

@dnhatn dnhatn commented Apr 16, 2019

A peer recovery can get stuck after upgrading from 5.x to 6.x in the following scenario:

  1. Have at least three 5.x data nodes in the cluster
  2. Upgrade one node from 5.x to 6.x
  3. Primary is relocated from 5.x to 6.x node (can trigger by the allocation balancer or users)
  4. Index some documents - these documents have sequence numbers on the primary but not on the replica
  5. Wait for 12 hours (the default translog retention policy) for translog to be trimmed
  6. Issue a synced-flush
  7. Upgrade the node with replica to 6.x
  8. The primary is stuck to wait for the advancement for the local checkpoint on the replica.
       org.elasticsearch.index.seqno.ReplicationTracker.markAllocationIdAsInSync(ReplicationTracker.java:647)
       org.elasticsearch.index.shard.IndexShard.markAllocationIdAsInSync(IndexShard.java:1884)
       org.elasticsearch.indices.recovery.RecoverySourceHandler.lambda$finalizeRecovery$12(RecoverySourceHandler.java:499)
       org.elasticsearch.indices.recovery.RecoverySourceHandler$$Lambda$3416/1201078460.run(Unknown Source)
       org.elasticsearch.indices.recovery.RecoverySourceHandler.lambda$runUnderPrimaryPermit$5(RecoverySourceHandler.java:264)
       org.elasticsearch.indices.recovery.RecoverySourceHandler$$Lambda$3407/1470901957.run(Unknown Source)
       org.elasticsearch.common.util.CancellableThreads.executeIO(CancellableThreads.java:105)
       org.elasticsearch.common.util.CancellableThreads.execute(CancellableThreads.java:86)
       org.elasticsearch.indices.recovery.RecoverySourceHandler.runUnderPrimaryPermit(RecoverySourceHandler.java:242)
       org.elasticsearch.indices.recovery.RecoverySourceHandler.finalizeRecovery(RecoverySourceHandler.java:499)
       org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:228)
       org.elasticsearch.indices.recovery.PeerRecoverySourceService.recover(PeerRecoverySourceService.java:98)
       org.elasticsearch.indices.recovery.PeerRecoverySourceService.access$000(PeerRecoverySourceService.java:50)
       org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:107)
       org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:104)

As both primary and replica have the same syncId, recovery skips copying files. The problem is the index commit on replica does not sequence numbers, so we bootstrap -1 for its local checkpoint although the commit (with the same syncId) on the primary has a higher checkpoint. The primary will wait for the advancement of the checkpoint on replica which never advances.

I think we should not allow primaries to relocate to a newer node if some replicas are still on old nodes.

@dnhatn dnhatn added >bug :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. v7.0.0 v6.7.0 v8.0.0 v7.2.0 labels Apr 16, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dnhatn , I left an inline comment that I need your input on.

}
}

boolean hasSameSyncId(Version indexCreatedVersion, Store.MetadataSnapshot source, Store.MetadataSnapshot target) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a chance that this could mean more file based recoveries during rolling upgrades. I think it could be reasonable to upgrade in this manner:

  1. Stop indexing
  2. Do synced flush
  3. Upgrade one node
  4. Wait for green and rebalancing to complete.
  5. Resume indexing until queue is empty (queue is some external queue of messages to index)
  6. Start from 1 again, upgrading the next node.

I may be mistaken, but I think this could lead to many file based recoveries, that would have been skipped due to identical sync-id without this change? Let me know your thoughts on this.

Would it be an option to instead include localCheckpoint of last safe commit in the prepareForTranslogOperations message and store/validate this on the target node? Only if isSequenceNumberBasedRecovery==false though, since then we know that we must have an identical last safe-commit on target too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the commit on the replica does not have sequence number yet, its local checkpoint and max_seq_no are -1; thus recovery still utilizes the syncId if we haven't processed any operation with sequence numbers.

Would it be an option to instead include localCheckpoint of last safe commit in the prepareForTranslogOperations message and store/validate this on the target node? Only if isSequenceNumberBasedRecovery==false though, since then we know that we must have an identical last safe-commit on target too.

Yes but handling BWC between 6.7.x and 7.0.0 would not be easy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the commit on the replica does not have sequence number yet, its local checkpoint and max_seq_no are -1; thus recovery still utilizes the syncId if we haven't processed any operation with sequence numbers.

As far as I can see, this is only true if no indexing occurred on the primary such that it also has local checkpoint and max_seq_no = -1. In step 4, we assume some primary is relocated to newer version (bound to happen at some point during the upgrade) and in step 5, we add more data to the shard(s), such that any primary on upgraded nodes have sequence numbers (but replicas on non-upgraded do not). I may have missed a detail, but I think we cannot then do sync-id based recovery when a node holding a replica is upgraded in this scenario?

Copy link
Member Author

@dnhatn dnhatn Apr 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we cannot then do sync-id based recovery when a node holding a replica is upgraded in this scenario.

Yes, this is the intention of this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good observation @henningandersen. I'm not sure how common that upgrade scenario is (I still think we should test for it), but I also can't think of any other solution than the one you proposed. That one has crazy back- and forward-compatibility implications though as we already released 7.0, and I would rather like to avoid too much craziness in 6.x, given that we will probably have to maintain that version for quite a while. My current inclination is to rather live with this full file-sync when upgrading from 5.x and proceed with the solution proposed in this PR.

@dnhatn can you add a test for @henningandersen's upgrade scenario?

@jaymode jaymode added v7.0.2 and removed v7.0.1 labels Apr 24, 2019
@dnhatn
Copy link
Member Author

dnhatn commented Apr 26, 2019

@ywelsch Henning and I discussed this but we felt we need your input here. Can you please take a look? Thank you!

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great find @dnhatn. I've left some comments and my initial thoughts.

ensureGreen(index);
}

public void testRecoveryWithSyncIdVerifySeqNoStats() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add javadocs describing what situation we want to test here and why, given that the test is very specialized?

logger.trace("skipping [phase1]- identical sync id [{}] found on both source and target", recoverySourceSyncId);
} else {
final Version indexVersionCreated = shard.indexSettings().getIndexVersionCreated();
if (hasSameSyncId(indexVersionCreated, recoverySourceMetadata, request.metadataSnapshot()) == false) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this checks more than just the sync IDs, perhaps call this method canSkipPhase1.

assert false : message;
throw new IllegalStateException(message);
}
logger.trace("skipping [phase1]- identical sync id [{}] found on both source and target", source.getSyncId());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I would prefer to have this log message at the call site instead of in this hasSameSyncId method. Yes, it means having both a then and else branch for the if statement, but it's more symmetric

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

}
final String message = "try to recover " + request.shardId() + " with sync id but " +
"seq_no stats are mismatched: [" + source.getCommitUserData() + "] vs [" + target.getCommitUserData() + "]";
assert false : message;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we forward-port this to newer branches as well? i.e. check that sync flush guarantees that max seq nos and local checkpoints match?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we will forward-port without the index version leniency.

if (indexCreatedVersion.before(Version.V_6_0_0) &&
target.getCommitUserData().containsKey(SequenceNumbers.LOCAL_CHECKPOINT_KEY) == false &&
target.getCommitUserData().containsKey(SequenceNumbers.MAX_SEQ_NO) == false) {
return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will need a comment in the code explaining what situation is being addressed here that can no longer occur in 6.0+.

}
}

boolean hasSameSyncId(Version indexCreatedVersion, Store.MetadataSnapshot source, Store.MetadataSnapshot target) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good observation @henningandersen. I'm not sure how common that upgrade scenario is (I still think we should test for it), but I also can't think of any other solution than the one you proposed. That one has crazy back- and forward-compatibility implications though as we already released 7.0, and I would rather like to avoid too much craziness in 6.x, given that we will probably have to maintain that version for quite a while. My current inclination is to rather live with this full file-sync when upgrading from 5.x and proceed with the solution proposed in this PR.

@dnhatn can you add a test for @henningandersen's upgrade scenario?

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dnhatn . Looking good. I added a couple comments to the tests.

syncedFlush(index);
} else {
ensureGreen(index);
assertNoFileBasedRecovery(index);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this assertion could fail, if a primary was relocated to one of the new nodes in one of the mixed phases? Could we maybe force such a relocation to happen in first mixed round and then assert that we do file based recoveries in the second mixed round and here too? May need to fix replicas to 2 for that to hold.

Also, it would be nice to verify that we have all docs at the end.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disabled the allocation rebalancing in this test and modified testRecoveryWithSyncIdVerifySeqNoStats to cover that scenario.

}

private void syncedFlush(String index) throws Exception {
// We have to spin synced-flush requests here because we fire the global checkpoint sync for the last write operation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a very small chance of a race condition here. If both the synced flush and node stop runs before the global checkpoint sync, the node will come up with an older safe commit and will revert to file based recovery. I could be wrong of course. The likelihood seems very small though. I did not find a good way to check if the global checkpoint sync has started/completed, maybe you have an idea?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well spotted. I have an idea and will try it out.

@dnhatn
Copy link
Member Author

dnhatn commented May 3, 2019

@henningandersen @ywelsch I have addressed your comments. Can you please take another look? Thank you!

@dnhatn dnhatn requested review from henningandersen and ywelsch May 3, 2019 17:05
@dnhatn dnhatn changed the base branch from 6.7 to 6.8 May 3, 2019 17:08
Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, thanks for the addtional work on this dnhatn. I have a single question/comment.

} else {
// If we are upgrading from 5.x and there're some documents with sequence numbers, then we must ignore syncId
// and perform file-based recovery for upgraded-node-2; otherwise peer recovery should utilize syncId.
final boolean forcedFileBasedRecovery = UPGRADE_FROM_VERSION.before(Version.V_6_0_0) &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we only have forcedFileBasedRecovery=true in the case where the primary was on one of the upgraded nodes in one of the mixed steps. If it is on node-2 that would not happen.

I think we rely on the allocator randomly picking a node for the primary here then? That is probably OK, if it is not deterministically choosing the same primary for every run of the test. You probably validated that and it is likely OK but wanted to ask to be sure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, here we reply on the allocator to choose the primary randomly. Are you okay with this approach?

@jakelandis jakelandis added v6.8.1 and removed v6.8.0 labels May 19, 2019
@dnhatn dnhatn requested a review from henningandersen May 20, 2019 16:03
Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dnhatn
Copy link
Member Author

dnhatn commented May 21, 2019

@henningandersen @ywelsch Thanks so much for useful inputs.

@dnhatn dnhatn merged commit c359d67 into elastic:6.8 May 21, 2019
@dnhatn dnhatn deleted the syncid-check-seqno branch May 21, 2019 21:20
dnhatn added a commit that referenced this pull request May 22, 2019
This change verifies and aborts recovery if source and target have the
same syncId but different sequenceId. This commit also adds an upgrade
test to ensure that we always utilize syncId.
dnhatn added a commit that referenced this pull request May 24, 2019
This change verifies and aborts recovery if source and target have the
same syncId but different sequenceId. This commit also adds an upgrade
test to ensure that we always utilize syncId.
gurkankaymak pushed a commit to gurkankaymak/elasticsearch that referenced this pull request May 27, 2019
This change verifies and aborts recovery if source and target have the
same syncId but different sequenceId. This commit also adds an upgrade
test to ensure that we always utilize syncId.
kovrus added a commit to crate/crate that referenced this pull request Sep 12, 2019
kovrus added a commit to crate/crate that referenced this pull request Sep 12, 2019
mergify bot pushed a commit to crate/crate that referenced this pull request Sep 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>bug :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. v6.8.1 v7.2.0 v8.0.0-alpha1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants