Fix global checkpoints test bug #18755

jasontedor · 2016-06-06T21:07:33Z

This commit fixes a test bug in a global checkpoints integration
test. Namely, if the replica shard is slow to start and is peer
recovered from the primary, it will not have the expected global
checkpoint due to these not being persisted and transferred on recovery.

This commit fixes a test bug in a global checkpoints integration test. Namely, if the replica shard is slow to start and is peer recovered from the primary, it will not have the expected global checkpoint due to these not being persisted and transferred on recovery.

jasontedor · 2016-06-06T21:10:39Z

I've added the logs from a failing test run to gist. This failure does not reproduce. However, if I patch RecoveryTargetService with the following patch to artificially delay peer recovery, the failure reliably reproduces:

diff --git a/core/src/main/java/org/elasticsearch/indices/recovery/RecoveryTargetService.java b/core/src/main/java/org/elasticsearch/indices/recovery/RecoveryTargetService.java
index bf5d1cb..bdd116e 100644
--- a/core/src/main/java/org/elasticsearch/indices/recovery/RecoveryTargetService.java
+++ b/core/src/main/java/org/elasticsearch/indices/recovery/RecoveryTargetService.java
@@ -160,6 +160,12 @@ public class RecoveryTargetService extends AbstractComponent implements IndexEve
     private void doRecovery(final RecoveryTarget recoveryTarget) {
         assert recoveryTarget.sourceNode() != null : "can't do a recovery without a source node";

+        try {
+            Thread.sleep(175);
+        } catch (InterruptedException e) {
+            throw new RuntimeException(e);
+        }
+
         logger.trace("collecting local files for {}", recoveryTarget);
         Store.MetadataSnapshot metadataSnapshot = null;
         try {

bleskes · 2016-06-07T09:21:56Z

core/src/test/java/org/elasticsearch/index/seqno/CheckpointsIT.java

                }
                assertThat(shardStats.getShardRouting() + " local checkpoint mismatch",
-                    shardStats.getSeqNoStats().getLocalCheckpoint(), localCheckpointRule);
+                    shardStats.getSeqNoStats().getLocalCheckpoint(), equalTo(numDocs - 1L));


I think we still need the old leniency for localCheckpoints as well? if recovery completes after all docs were indexed, we will not have a local checkpoint on the replica?

if recovery completes after all docs were indexed, we will not have a local checkpoint on the replica?

Correct, if there was no translog replay component. In this case, no indexing operations will have been performed on the recovery target and the local checkpoint there will not have advanced.

Note that the logs here are for a case where the recovery completes after the docs were indexed, but there was a translog replay.

I pushed 2545a35.

This commit reverts the removal of the local checkpoint rule in CheckpointsIT#testCheckpointsAdvance. This rule is needed in case a peer recovery that does not result in indexing operations (i.e., there is no translog recovery) is performed. In this case, the local sequence number will not have advanced.

bleskes · 2016-06-08T10:12:41Z

core/src/test/java/org/elasticsearch/index/seqno/CheckpointsIT.java

+                final Matcher<Long> globalCheckpointRule;
                if (shardStats.getShardRouting().primary()) {
                    localCheckpointRule = equalTo(numDocs - 1L);
+                    globalCheckpointRule = equalTo(numDocs - 1L);


what happens when indexing completes, relocation happens and then (and only then) the global updater kicks in. I think we still have an issue, if true, let's just relax global checkpoint for now. I'm good with doing all of that as a separate fix in the interest of getting this in.

bleskes · 2016-06-08T10:12:56Z

LGTM

jasontedor added >test Issues or PRs that are addressing/adding tests review v5.0.0-alpha4 labels Jun 6, 2016

jasontedor assigned bleskes Jun 6, 2016

bleskes reviewed Jun 7, 2016
View reviewed changes

bleskes reviewed Jun 8, 2016
View reviewed changes

jasontedor merged commit 601c056 into elastic:feature/seq_no Jun 8, 2016

jasontedor deleted the checkpoints-advance-test-bug branch June 8, 2016 12:22

clintongormley added :Engine :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. and removed :Sequence IDs labels Feb 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix global checkpoints test bug #18755

Fix global checkpoints test bug #18755

Uh oh!

jasontedor commented Jun 6, 2016

Uh oh!

jasontedor commented Jun 6, 2016 •

edited

Loading

Uh oh!

bleskes Jun 7, 2016

Uh oh!

jasontedor Jun 7, 2016

Uh oh!

bleskes Jun 8, 2016

Uh oh!

bleskes commented Jun 8, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix global checkpoints test bug #18755

Fix global checkpoints test bug #18755

Uh oh!

Conversation

jasontedor commented Jun 6, 2016

Uh oh!

jasontedor commented Jun 6, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bleskes Jun 7, 2016

Choose a reason for hiding this comment

Uh oh!

jasontedor Jun 7, 2016

Choose a reason for hiding this comment

Uh oh!

bleskes Jun 8, 2016

Choose a reason for hiding this comment

Uh oh!

bleskes commented Jun 8, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jasontedor commented Jun 6, 2016 •

edited

Loading