-
Notifications
You must be signed in to change notification settings - Fork 25.6k
CCR: Use single global checkpoint to normalize range #33545
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
We may use different global checkpoints to validate/normalize the range of a change request if the global checkpoint is advanced between these calls. If this is the case, then we generate an invalid request range.
|
Pinging @elastic/es-distributed |
| if (indexShard.state() != IndexShardState.STARTED) { | ||
| throw new IndexShardNotStartedException(indexShard.shardId(), indexShard.state()); | ||
| } | ||
| if (fromSeqNo > indexShard.getGlobalCheckpoint()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: so indexShard.getGlobalCheckpoint() may return a lower seqno than acquired from indexShard.seqNoStats().getGlobalCheckpoint()? I always assumed that the seqno acquired from IndexShard could not go backwards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The global checkpoint in the stats object and directly from the index shard are sourced from the same place, the replication tracker. The problem here, as I understand it, is that the global checkpoint could have advanced after capturing the stats. Here is what can happen then:
- suppose that
fromSeqNois 17 - suppose that the global checkpoint in the stats instance is 16
- suppose that the global checkpoint advances to 17 after the stats object is captured
- the
fromSeqNo > indexShard.getGlobalCheckpoint()check will fail (because of the advance), meaning that we skip returning an empty operations response - we then calculate
toSeqNo = Math.min(globalCheckpoint, (fromSeqNo + maxOperationCount) - 1)whereglobalCheckpointis from the stats instance; this would givetoSeqNo == 16 - now we have
[fromSeqNo, toSeqNo] == [17, 16]which produces the invalid range error message
This all happened because we allowed the global checkpoint advancing to become visible to this logic. Had we reused globalCheckpoint from the stats object then fromSeqNo > globalCheckpoint would have succeeded and we would have returned an empty operations response.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jasontedor Thanks for the explanation!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great explanation!
martijnvg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks good. I left a question for my understanding.
jasontedor
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
martijnvg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
|
I'm going to have merge conflicts with this PR @dnhatn so I am going to merge it now. |
We may use different global checkpoints to validate/normalize the range of a change request if the global checkpoint is advanced between these calls. If this is the case, then we generate an invalid request range.
|
Thanks @jasontedor and @martijnvg. |
* master: Remove underscore from auto-follow API (elastic#33550) CCR: Use single global checkpoint to normalize range (elastic#33545)
We may use different global checkpoint values to validate/normalize the range of a change request if the global checkpoint is advanced between these calls. If this is the case, then we generate an invalid request range and cause the follow task aborted.
CI: