-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Recovery with syncId should verify seqno infos #41265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Pinging @elastic/es-distributed |
henningandersen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dnhatn , I left an inline comment that I need your input on.
| } | ||
| } | ||
|
|
||
| boolean hasSameSyncId(Version indexCreatedVersion, Store.MetadataSnapshot source, Store.MetadataSnapshot target) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a chance that this could mean more file based recoveries during rolling upgrades. I think it could be reasonable to upgrade in this manner:
- Stop indexing
- Do synced flush
- Upgrade one node
- Wait for green and rebalancing to complete.
- Resume indexing until queue is empty (queue is some external queue of messages to index)
- Start from 1 again, upgrading the next node.
I may be mistaken, but I think this could lead to many file based recoveries, that would have been skipped due to identical sync-id without this change? Let me know your thoughts on this.
Would it be an option to instead include localCheckpoint of last safe commit in the prepareForTranslogOperations message and store/validate this on the target node? Only if isSequenceNumberBasedRecovery==false though, since then we know that we must have an identical last safe-commit on target too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the commit on the replica does not have sequence number yet, its local checkpoint and max_seq_no are -1; thus recovery still utilizes the syncId if we haven't processed any operation with sequence numbers.
Would it be an option to instead include localCheckpoint of last safe commit in the prepareForTranslogOperations message and store/validate this on the target node? Only if isSequenceNumberBasedRecovery==false though, since then we know that we must have an identical last safe-commit on target too.
Yes but handling BWC between 6.7.x and 7.0.0 would not be easy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the commit on the replica does not have sequence number yet, its local checkpoint and max_seq_no are -1; thus recovery still utilizes the syncId if we haven't processed any operation with sequence numbers.
As far as I can see, this is only true if no indexing occurred on the primary such that it also has local checkpoint and max_seq_no = -1. In step 4, we assume some primary is relocated to newer version (bound to happen at some point during the upgrade) and in step 5, we add more data to the shard(s), such that any primary on upgraded nodes have sequence numbers (but replicas on non-upgraded do not). I may have missed a detail, but I think we cannot then do sync-id based recovery when a node holding a replica is upgraded in this scenario?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we cannot then do sync-id based recovery when a node holding a replica is upgraded in this scenario.
Yes, this is the intention of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good observation @henningandersen. I'm not sure how common that upgrade scenario is (I still think we should test for it), but I also can't think of any other solution than the one you proposed. That one has crazy back- and forward-compatibility implications though as we already released 7.0, and I would rather like to avoid too much craziness in 6.x, given that we will probably have to maintain that version for quite a while. My current inclination is to rather live with this full file-sync when upgrading from 5.x and proceed with the solution proposed in this PR.
@dnhatn can you add a test for @henningandersen's upgrade scenario?
|
@ywelsch Henning and I discussed this but we felt we need your input here. Can you please take a look? Thank you! |
ywelsch
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great find @dnhatn. I've left some comments and my initial thoughts.
| ensureGreen(index); | ||
| } | ||
|
|
||
| public void testRecoveryWithSyncIdVerifySeqNoStats() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add javadocs describing what situation we want to test here and why, given that the test is very specialized?
| logger.trace("skipping [phase1]- identical sync id [{}] found on both source and target", recoverySourceSyncId); | ||
| } else { | ||
| final Version indexVersionCreated = shard.indexSettings().getIndexVersionCreated(); | ||
| if (hasSameSyncId(indexVersionCreated, recoverySourceMetadata, request.metadataSnapshot()) == false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this checks more than just the sync IDs, perhaps call this method canSkipPhase1.
| assert false : message; | ||
| throw new IllegalStateException(message); | ||
| } | ||
| logger.trace("skipping [phase1]- identical sync id [{}] found on both source and target", source.getSyncId()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I would prefer to have this log message at the call site instead of in this hasSameSyncId method. Yes, it means having both a then and else branch for the if statement, but it's more symmetric
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++
| } | ||
| final String message = "try to recover " + request.shardId() + " with sync id but " + | ||
| "seq_no stats are mismatched: [" + source.getCommitUserData() + "] vs [" + target.getCommitUserData() + "]"; | ||
| assert false : message; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we forward-port this to newer branches as well? i.e. check that sync flush guarantees that max seq nos and local checkpoints match?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we will forward-port without the index version leniency.
| if (indexCreatedVersion.before(Version.V_6_0_0) && | ||
| target.getCommitUserData().containsKey(SequenceNumbers.LOCAL_CHECKPOINT_KEY) == false && | ||
| target.getCommitUserData().containsKey(SequenceNumbers.MAX_SEQ_NO) == false) { | ||
| return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will need a comment in the code explaining what situation is being addressed here that can no longer occur in 6.0+.
| } | ||
| } | ||
|
|
||
| boolean hasSameSyncId(Version indexCreatedVersion, Store.MetadataSnapshot source, Store.MetadataSnapshot target) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good observation @henningandersen. I'm not sure how common that upgrade scenario is (I still think we should test for it), but I also can't think of any other solution than the one you proposed. That one has crazy back- and forward-compatibility implications though as we already released 7.0, and I would rather like to avoid too much craziness in 6.x, given that we will probably have to maintain that version for quite a while. My current inclination is to rather live with this full file-sync when upgrading from 5.x and proceed with the solution proposed in this PR.
@dnhatn can you add a test for @henningandersen's upgrade scenario?
henningandersen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dnhatn . Looking good. I added a couple comments to the tests.
| syncedFlush(index); | ||
| } else { | ||
| ensureGreen(index); | ||
| assertNoFileBasedRecovery(index); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this assertion could fail, if a primary was relocated to one of the new nodes in one of the mixed phases? Could we maybe force such a relocation to happen in first mixed round and then assert that we do file based recoveries in the second mixed round and here too? May need to fix replicas to 2 for that to hold.
Also, it would be nice to verify that we have all docs at the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disabled the allocation rebalancing in this test and modified testRecoveryWithSyncIdVerifySeqNoStats to cover that scenario.
| } | ||
|
|
||
| private void syncedFlush(String index) throws Exception { | ||
| // We have to spin synced-flush requests here because we fire the global checkpoint sync for the last write operation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is a very small chance of a race condition here. If both the synced flush and node stop runs before the global checkpoint sync, the node will come up with an older safe commit and will revert to file based recovery. I could be wrong of course. The likelihood seems very small though. I did not find a good way to check if the global checkpoint sync has started/completed, maybe you have an idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well spotted. I have an idea and will try it out.
|
@henningandersen @ywelsch I have addressed your comments. Can you please take another look? Thank you! |
henningandersen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good, thanks for the addtional work on this dnhatn. I have a single question/comment.
| } else { | ||
| // If we are upgrading from 5.x and there're some documents with sequence numbers, then we must ignore syncId | ||
| // and perform file-based recovery for upgraded-node-2; otherwise peer recovery should utilize syncId. | ||
| final boolean forcedFileBasedRecovery = UPGRADE_FROM_VERSION.before(Version.V_6_0_0) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we only have forcedFileBasedRecovery=true in the case where the primary was on one of the upgraded nodes in one of the mixed steps. If it is on node-2 that would not happen.
I think we rely on the allocator randomly picking a node for the primary here then? That is probably OK, if it is not deterministically choosing the same primary for every run of the test. You probably validated that and it is likely OK but wanted to ask to be sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, here we reply on the allocator to choose the primary randomly. Are you okay with this approach?
ywelsch
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
@henningandersen @ywelsch Thanks so much for useful inputs. |
This change verifies and aborts recovery if source and target have the same syncId but different sequenceId. This commit also adds an upgrade test to ensure that we always utilize syncId.
This change verifies and aborts recovery if source and target have the same syncId but different sequenceId. This commit also adds an upgrade test to ensure that we always utilize syncId.
This change verifies and aborts recovery if source and target have the same syncId but different sequenceId. This commit also adds an upgrade test to ensure that we always utilize syncId.
A peer recovery can get stuck after upgrading from 5.x to 6.x in the following scenario:
As both primary and replica have the same syncId, recovery skips copying files. The problem is the index commit on replica does not sequence numbers, so we bootstrap -1 for its local checkpoint although the commit (with the same syncId) on the primary has a higher checkpoint. The primary will wait for the advancement of the checkpoint on replica which never advances.
I think we should not allow primaries to relocate to a newer node if some replicas are still on old nodes.