Skip to content

Conversation

@jasontedor
Copy link
Member

This commit fixes a test bug in a global checkpoints integration
test. Namely, if the replica shard is slow to start and is peer
recovered from the primary, it will not have the expected global
checkpoint due to these not being persisted and transferred on recovery.

This commit fixes a test bug in a global checkpoints integration
test. Namely, if the replica shard is slow to start and is peer
recovered from the primary, it will not have the expected global
checkpoint due to these not being persisted and transferred on recovery.
@jasontedor jasontedor added >test Issues or PRs that are addressing/adding tests review v5.0.0-alpha4 labels Jun 6, 2016
@jasontedor
Copy link
Member Author

jasontedor commented Jun 6, 2016

I've added the logs from a failing test run to gist. This failure does not reproduce. However, if I patch RecoveryTargetService with the following patch to artificially delay peer recovery, the failure reliably reproduces:

diff --git a/core/src/main/java/org/elasticsearch/indices/recovery/RecoveryTargetService.java b/core/src/main/java/org/elasticsearch/indices/recovery/RecoveryTargetService.java
index bf5d1cb..bdd116e 100644
--- a/core/src/main/java/org/elasticsearch/indices/recovery/RecoveryTargetService.java
+++ b/core/src/main/java/org/elasticsearch/indices/recovery/RecoveryTargetService.java
@@ -160,6 +160,12 @@ public class RecoveryTargetService extends AbstractComponent implements IndexEve
     private void doRecovery(final RecoveryTarget recoveryTarget) {
         assert recoveryTarget.sourceNode() != null : "can't do a recovery without a source node";

+        try {
+            Thread.sleep(175);
+        } catch (InterruptedException e) {
+            throw new RuntimeException(e);
+        }
+
         logger.trace("collecting local files for {}", recoveryTarget);
         Store.MetadataSnapshot metadataSnapshot = null;
         try {

}
assertThat(shardStats.getShardRouting() + " local checkpoint mismatch",
shardStats.getSeqNoStats().getLocalCheckpoint(), localCheckpointRule);
shardStats.getSeqNoStats().getLocalCheckpoint(), equalTo(numDocs - 1L));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we still need the old leniency for localCheckpoints as well? if recovery completes after all docs were indexed, we will not have a local checkpoint on the replica?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if recovery completes after all docs were indexed, we will not have a local checkpoint on the replica?

Correct, if there was no translog replay component. In this case, no indexing operations will have been performed on the recovery target and the local checkpoint there will not have advanced.

Note that the logs here are for a case where the recovery completes after the docs were indexed, but there was a translog replay.

I pushed 2545a35.

This commit reverts the removal of the local checkpoint rule in
CheckpointsIT#testCheckpointsAdvance. This rule is needed in case a peer
recovery that does not result in indexing operations (i.e., there is no
translog recovery) is performed. In this case, the local sequence number
will not have advanced.
final Matcher<Long> globalCheckpointRule;
if (shardStats.getShardRouting().primary()) {
localCheckpointRule = equalTo(numDocs - 1L);
globalCheckpointRule = equalTo(numDocs - 1L);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens when indexing completes, relocation happens and then (and only then) the global updater kicks in. I think we still have an issue, if true, let's just relax global checkpoint for now. I'm good with doing all of that as a separate fix in the interest of getting this in.

@bleskes
Copy link
Contributor

bleskes commented Jun 8, 2016

LGTM

@jasontedor jasontedor merged commit 601c056 into elastic:feature/seq_no Jun 8, 2016
@jasontedor jasontedor deleted the checkpoints-advance-test-bug branch June 8, 2016 12:22
@clintongormley clintongormley added :Engine :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. and removed :Sequence IDs labels Feb 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. >test Issues or PRs that are addressing/adding tests v5.0.0-alpha4

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants