Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -297,12 +297,13 @@ static Translog.Operation[] getOperations(IndexShard indexShard, long globalChec
if (indexShard.state() != IndexShardState.STARTED) {
throw new IndexShardNotStartedException(indexShard.shardId(), indexShard.state());
}
if (fromSeqNo > indexShard.getGlobalCheckpoint()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: so indexShard.getGlobalCheckpoint() may return a lower seqno than acquired from indexShard.seqNoStats().getGlobalCheckpoint()? I always assumed that the seqno acquired from IndexShard could not go backwards.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The global checkpoint in the stats object and directly from the index shard are sourced from the same place, the replication tracker. The problem here, as I understand it, is that the global checkpoint could have advanced after capturing the stats. Here is what can happen then:

  • suppose that fromSeqNo is 17
  • suppose that the global checkpoint in the stats instance is 16
  • suppose that the global checkpoint advances to 17 after the stats object is captured
  • the fromSeqNo > indexShard.getGlobalCheckpoint() check will fail (because of the advance), meaning that we skip returning an empty operations response
  • we then calculate toSeqNo = Math.min(globalCheckpoint, (fromSeqNo + maxOperationCount) - 1) where globalCheckpoint is from the stats instance; this would give toSeqNo == 16
  • now we have [fromSeqNo, toSeqNo] == [17, 16] which produces the invalid range error message

This all happened because we allowed the global checkpoint advancing to become visible to this logic. Had we reused globalCheckpoint from the stats object then fromSeqNo > globalCheckpoint would have succeeded and we would have returned an empty operations response.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jasontedor Thanks for the explanation!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great explanation!

if (fromSeqNo > globalCheckpoint) {
return EMPTY_OPERATIONS_ARRAY;
}
int seenBytes = 0;
// - 1 is needed, because toSeqNo is inclusive
long toSeqNo = Math.min(globalCheckpoint, (fromSeqNo + maxOperationCount) - 1);
assert fromSeqNo <= toSeqNo : "invalid range from_seqno[" + fromSeqNo + "] > to_seqno[" + toSeqNo + "]";
final List<Translog.Operation> operations = new ArrayList<>();
try (Translog.Snapshot snapshot = indexShard.newChangesSnapshot("ccr", fromSeqNo, toSeqNo, true)) {
Translog.Operation op;
Expand Down