-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Fix global checkpoints test bug #18755
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix global checkpoints test bug #18755
Conversation
This commit fixes a test bug in a global checkpoints integration test. Namely, if the replica shard is slow to start and is peer recovered from the primary, it will not have the expected global checkpoint due to these not being persisted and transferred on recovery.
|
I've added the logs from a failing test run to gist. This failure does not reproduce. However, if I patch diff --git a/core/src/main/java/org/elasticsearch/indices/recovery/RecoveryTargetService.java b/core/src/main/java/org/elasticsearch/indices/recovery/RecoveryTargetService.java
index bf5d1cb..bdd116e 100644
--- a/core/src/main/java/org/elasticsearch/indices/recovery/RecoveryTargetService.java
+++ b/core/src/main/java/org/elasticsearch/indices/recovery/RecoveryTargetService.java
@@ -160,6 +160,12 @@ public class RecoveryTargetService extends AbstractComponent implements IndexEve
private void doRecovery(final RecoveryTarget recoveryTarget) {
assert recoveryTarget.sourceNode() != null : "can't do a recovery without a source node";
+ try {
+ Thread.sleep(175);
+ } catch (InterruptedException e) {
+ throw new RuntimeException(e);
+ }
+
logger.trace("collecting local files for {}", recoveryTarget);
Store.MetadataSnapshot metadataSnapshot = null;
try { |
| } | ||
| assertThat(shardStats.getShardRouting() + " local checkpoint mismatch", | ||
| shardStats.getSeqNoStats().getLocalCheckpoint(), localCheckpointRule); | ||
| shardStats.getSeqNoStats().getLocalCheckpoint(), equalTo(numDocs - 1L)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we still need the old leniency for localCheckpoints as well? if recovery completes after all docs were indexed, we will not have a local checkpoint on the replica?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if recovery completes after all docs were indexed, we will not have a local checkpoint on the replica?
Correct, if there was no translog replay component. In this case, no indexing operations will have been performed on the recovery target and the local checkpoint there will not have advanced.
Note that the logs here are for a case where the recovery completes after the docs were indexed, but there was a translog replay.
I pushed 2545a35.
This commit reverts the removal of the local checkpoint rule in CheckpointsIT#testCheckpointsAdvance. This rule is needed in case a peer recovery that does not result in indexing operations (i.e., there is no translog recovery) is performed. In this case, the local sequence number will not have advanced.
| final Matcher<Long> globalCheckpointRule; | ||
| if (shardStats.getShardRouting().primary()) { | ||
| localCheckpointRule = equalTo(numDocs - 1L); | ||
| globalCheckpointRule = equalTo(numDocs - 1L); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens when indexing completes, relocation happens and then (and only then) the global updater kicks in. I think we still have an issue, if true, let's just relax global checkpoint for now. I'm good with doing all of that as a separate fix in the interest of getting this in.
|
LGTM |
This commit fixes a test bug in a global checkpoints integration
test. Namely, if the replica shard is slow to start and is peer
recovered from the primary, it will not have the expected global
checkpoint due to these not being persisted and transferred on recovery.