Skip to content

Conversation

@dnhatn
Copy link
Member

@dnhatn dnhatn commented Sep 9, 2018

We may use different global checkpoint values to validate/normalize the range of a change request if the global checkpoint is advanced between these calls. If this is the case, then we generate an invalid request range and cause the follow task aborted.

  1> Caused by: java.lang.IllegalArgumentException: Invalid range; from_seqno [17], to_seqno [16]
  1>     at org.elasticsearch.index.engine.LuceneChangesSnapshot.<init>(LuceneChangesSnapshot.java:86) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
  1>     at org.elasticsearch.index.engine.InternalEngine.newChangesSnapshot(InternalEngine.java:2421) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
  1>     at org.elasticsearch.index.shard.IndexShard.newChangesSnapshot(IndexShard.java:1673) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
  1>     at org.elasticsearch.xpack.ccr.action.ShardChangesAction.getOperations(ShardChangesAction.java:307) ~[main/:?]

CI:

We may use different global checkpoints to validate/normalize the range
of a change request if the global checkpoint is advanced between these
calls. If this is the case, then we generate an invalid request range.
@dnhatn dnhatn added >feature v7.0.0 :Distributed Indexing/CCR Issues around the Cross Cluster State Replication features v6.5.0 labels Sep 9, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

if (indexShard.state() != IndexShardState.STARTED) {
throw new IndexShardNotStartedException(indexShard.shardId(), indexShard.state());
}
if (fromSeqNo > indexShard.getGlobalCheckpoint()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: so indexShard.getGlobalCheckpoint() may return a lower seqno than acquired from indexShard.seqNoStats().getGlobalCheckpoint()? I always assumed that the seqno acquired from IndexShard could not go backwards.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The global checkpoint in the stats object and directly from the index shard are sourced from the same place, the replication tracker. The problem here, as I understand it, is that the global checkpoint could have advanced after capturing the stats. Here is what can happen then:

  • suppose that fromSeqNo is 17
  • suppose that the global checkpoint in the stats instance is 16
  • suppose that the global checkpoint advances to 17 after the stats object is captured
  • the fromSeqNo > indexShard.getGlobalCheckpoint() check will fail (because of the advance), meaning that we skip returning an empty operations response
  • we then calculate toSeqNo = Math.min(globalCheckpoint, (fromSeqNo + maxOperationCount) - 1) where globalCheckpoint is from the stats instance; this would give toSeqNo == 16
  • now we have [fromSeqNo, toSeqNo] == [17, 16] which produces the invalid range error message

This all happened because we allowed the global checkpoint advancing to become visible to this logic. Had we reused globalCheckpoint from the stats object then fromSeqNo > globalCheckpoint would have succeeded and we would have returned an empty operations response.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jasontedor Thanks for the explanation!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great explanation!

Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good. I left a question for my understanding.

Copy link
Member

@jasontedor jasontedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@jasontedor
Copy link
Member

I'm going to have merge conflicts with this PR @dnhatn so I am going to merge it now.

@jasontedor jasontedor merged commit 902d20c into elastic:master Sep 9, 2018
jasontedor pushed a commit that referenced this pull request Sep 9, 2018
We may use different global checkpoints to validate/normalize the range
of a change request if the global checkpoint is advanced between these
calls. If this is the case, then we generate an invalid request range.
@dnhatn
Copy link
Member Author

dnhatn commented Sep 9, 2018

Thanks @jasontedor and @martijnvg.

@dnhatn dnhatn deleted the ccr-consistent-checkpoint branch September 9, 2018 17:25
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Sep 9, 2018
* master:
  Remove underscore from auto-follow API (elastic#33550)
  CCR: Use single global checkpoint to normalize range (elastic#33545)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>bug :Distributed Indexing/CCR Issues around the Cross Cluster State Replication features

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants