CCR: Use single global checkpoint to normalize range #33545

dnhatn · 2018-09-09T03:24:31Z

We may use different global checkpoint values to validate/normalize the range of a change request if the global checkpoint is advanced between these calls. If this is the case, then we generate an invalid request range and cause the follow task aborted.

  1> Caused by: java.lang.IllegalArgumentException: Invalid range; from_seqno [17], to_seqno [16]
  1>     at org.elasticsearch.index.engine.LuceneChangesSnapshot.<init>(LuceneChangesSnapshot.java:86) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
  1>     at org.elasticsearch.index.engine.InternalEngine.newChangesSnapshot(InternalEngine.java:2421) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
  1>     at org.elasticsearch.index.shard.IndexShard.newChangesSnapshot(IndexShard.java:1673) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
  1>     at org.elasticsearch.xpack.ccr.action.ShardChangesAction.getOperations(ShardChangesAction.java:307) ~[main/:?]

CI:

We may use different global checkpoints to validate/normalize the range of a change request if the global checkpoint is advanced between these calls. If this is the case, then we generate an invalid request range.

elasticmachine · 2018-09-09T03:24:33Z

Pinging @elastic/es-distributed

martijnvg · 2018-09-09T04:42:09Z

x-pack/plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/action/ShardChangesAction.java

        if (indexShard.state() != IndexShardState.STARTED) {
            throw new IndexShardNotStartedException(indexShard.shardId(), indexShard.state());
        }
-        if (fromSeqNo > indexShard.getGlobalCheckpoint()) {


Question: so indexShard.getGlobalCheckpoint() may return a lower seqno than acquired from indexShard.seqNoStats().getGlobalCheckpoint()? I always assumed that the seqno acquired from IndexShard could not go backwards.

The global checkpoint in the stats object and directly from the index shard are sourced from the same place, the replication tracker. The problem here, as I understand it, is that the global checkpoint could have advanced after capturing the stats. Here is what can happen then:

suppose that fromSeqNo is 17

suppose that the global checkpoint in the stats instance is 16

suppose that the global checkpoint advances to 17 after the stats object is captured

the fromSeqNo > indexShard.getGlobalCheckpoint() check will fail (because of the advance), meaning that we skip returning an empty operations response

we then calculate toSeqNo = Math.min(globalCheckpoint, (fromSeqNo + maxOperationCount) - 1) where globalCheckpoint is from the stats instance; this would give toSeqNo == 16

now we have [fromSeqNo, toSeqNo] == [17, 16] which produces the invalid range error message

This all happened because we allowed the global checkpoint advancing to become visible to this logic. Had we reused globalCheckpoint from the stats object then fromSeqNo > globalCheckpoint would have succeeded and we would have returned an empty operations response.

@jasontedor Thanks for the explanation!

Great explanation!

martijnvg

I think this looks good. I left a question for my understanding.

jasontedor

LGTM.

martijnvg

👍

jasontedor · 2018-09-09T17:18:23Z

I'm going to have merge conflicts with this PR @dnhatn so I am going to merge it now.

We may use different global checkpoints to validate/normalize the range of a change request if the global checkpoint is advanced between these calls. If this is the case, then we generate an invalid request range.

dnhatn · 2018-09-09T17:25:41Z

Thanks @jasontedor and @martijnvg.

* master: Remove underscore from auto-follow API (elastic#33550) CCR: Use single global checkpoint to normalize range (elastic#33545)

CCR: Use single global checkpoint to normalize range

949fc68

We may use different global checkpoints to validate/normalize the range of a change request if the global checkpoint is advanced between these calls. If this is the case, then we generate an invalid request range.

dnhatn added >feature v7.0.0 :Distributed Indexing/CCR Issues around the Cross Cluster State Replication features v6.5.0 labels Sep 9, 2018

dnhatn requested review from jasontedor and martijnvg September 9, 2018 03:24

martijnvg reviewed Sep 9, 2018

View reviewed changes

jasontedor approved these changes Sep 9, 2018

View reviewed changes

martijnvg approved these changes Sep 9, 2018

View reviewed changes

jasontedor merged commit 902d20c into elastic:master Sep 9, 2018

dnhatn deleted the ccr-consistent-checkpoint branch September 9, 2018 17:25

jasontedor added >bug and removed v6.5.0 v7.0.0 >feature labels Sep 9, 2018

jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Sep 9, 2018

Merge branch 'master' into upgrade-settings

ba38010

* master: Remove underscore from auto-follow API (elastic#33550) CCR: Use single global checkpoint to normalize range (elastic#33545)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CCR: Use single global checkpoint to normalize range #33545

CCR: Use single global checkpoint to normalize range #33545

Uh oh!

dnhatn commented Sep 9, 2018

Uh oh!

elasticmachine commented Sep 9, 2018

Uh oh!

martijnvg Sep 9, 2018

Uh oh!

jasontedor Sep 9, 2018

Uh oh!

martijnvg Sep 9, 2018

Uh oh!

dnhatn Sep 9, 2018

Uh oh!

martijnvg left a comment

Uh oh!

jasontedor left a comment

Uh oh!

martijnvg left a comment

Uh oh!

jasontedor commented Sep 9, 2018

Uh oh!

dnhatn commented Sep 9, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

CCR: Use single global checkpoint to normalize range #33545

CCR: Use single global checkpoint to normalize range #33545

Uh oh!

Conversation

dnhatn commented Sep 9, 2018

Uh oh!

elasticmachine commented Sep 9, 2018

Uh oh!

martijnvg Sep 9, 2018

Choose a reason for hiding this comment

Uh oh!

jasontedor Sep 9, 2018

Choose a reason for hiding this comment

Uh oh!

martijnvg Sep 9, 2018

Choose a reason for hiding this comment

Uh oh!

dnhatn Sep 9, 2018

Choose a reason for hiding this comment

Uh oh!

martijnvg left a comment

Choose a reason for hiding this comment

Uh oh!

jasontedor left a comment

Choose a reason for hiding this comment

Uh oh!

martijnvg left a comment

Choose a reason for hiding this comment

Uh oh!

jasontedor commented Sep 9, 2018

Uh oh!

dnhatn commented Sep 9, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants