Skip to content

Safe commit is not safe #28405

@ywelsch

Description

@ywelsch

Since #28181, a safe commit (instead of the latest commit) is used during peer recovery, which allows sequence-number based recoveries in more cases. A recent test failure of testAckedIndexing shows, that under certain failure scenarios, an unsafe commit can be subsequently (mistakenly) considered as a safe commit, resulting in the loss of acknowledged writes:

FAILURE 59.4s J0 | ClusterDisruptionIT.testAckedIndexing <<< FAILURES!
03:26:44    > Throwable #1: java.lang.AssertionError: doc [30] indexed via node [node_t2] not found (checked via node [node_t1]
...

The scenario in the test is as follows:

  • Three nodes (node_t0, node_t1, node_t2). The problematic replication group is [test][1]. The primary is on node_t1, and the replicas on node_t0 and node_t2.
  • Three documents are indexed into the primary, getting sequence number 0, 1, and 2, respectively:
[2018-01-14T16:24:47,113][TRACE][o.e.i.s.IndexShard       ] [node_t1] [test][1] index [type][8] (seq# [-2])
[2018-01-14T16:24:47,157][TRACE][o.e.i.s.IndexShard       ] [node_t1] [test][1] index [type][12] (seq# [-2])
[2018-01-14T16:24:48,155][TRACE][o.e.i.s.IndexShard       ] [node_t1] [test][1] index [type][15] (seq# [-2])
  • There's a network disruption between node_t1 and the other nodes, so it cannot replicate the last two writes:
[2018-01-14T16:24:47,151][INFO ][o.e.t.d.NetworkDisruption] start disrupting (disruption type: network unresponsive, disrupted links: two partitions (partition 1: [node_t2, node_t0] and partition 2: [node_t1]))
  • The master (node_t0) fails node_t1, and the replica on node_t2 gets promoted to primary, Cluster state:
...
nodes: 
    {node_t2}{SXrylrbnSOqpFvbIzKORgg}{F3Xy1h1NRSObstz-8vueVg}{127.0.0.1}{127.0.0.1:30111}, local
    {node_t0}{MTKu7d8xS6CnqxssfDgQnQ}{9whlsxtSRVSd4Mg1kN4YaA}{127.0.0.1}{127.0.0.1:30109}, master
routing_table (version 10):
--------[test][1], node[SXrylrbnSOqpFvbIzKORgg], [P], s[STARTED], a[id=zWo3jzSXS7agaUBIroIMbQ]
--------[test][1], node[MTKu7d8xS6CnqxssfDgQnQ], [R], s[STARTED], a[id=5-xVWGFuTXmWgg9nzGoBdw]
--------[test][1], node[null], [R], recovery_source[peer recovery], s[UNASSIGNED], unassigned_info[[reason=NODE_LEFT], 
...
  • A new operation comes in (that we will later lose) and get's indexed into the newly promoted primary and is successfully replicated to the single remaining replica
[2018-01-14T16:24:53,749][TRACE][o.e.i.s.IndexShard       ] [node_t2] [test][1] index [type][30] (seq# [-2])
[2018-01-14T16:24:53,754][TRACE][o.e.i.s.IndexShard       ] [node_t0] [test][1] index [type][30] (seq# [1])

Note that the newly promoted primary assigns the sequence number 1 to this operation, as it only got operation 0 from the previous primary.

  • The previously failed node node_t1 rejoins the cluster, and a replica is assigned to this node. It does a sequence-number based recovery from the new primary on node_t2 (which in the meanwhile has received more operations):
[2018-01-14T16:25:00,816][DEBUG][o.e.i.t.Translog         ] [node_t1] [test][1] open uncommitted translog checkpoint Checkpoint{offset=43, numOps=0, generation=2, minSeqNo=-1, maxSeqNo=-1, globalCheckpoint=0, minTranslogGeneration=1}
[2018-01-14T16:25:00,816][DEBUG][o.e.i.t.Translog         ] [node_t1] [test][1] recovered local translog from checkpoint Checkpoint{offset=43, numOps=0, generation=2, minSeqNo=-1, maxSeqNo=-1, globalCheckpoint=0, minTranslogGeneration=1}
[2018-01-14T16:25:00,821][TRACE][o.e.i.r.RecoverySourceHandler] [node_t2] [test][1][recover to node_t1] recovery [phase2]: sending transaction log operations (seq# from [0], required [0:3]
[2018-01-14T16:25:00,822][TRACE][o.e.i.s.IndexShard       ] [node_t1] [test][1] index [type][30] (seq# [1])
[2018-01-14T16:25:00,824][TRACE][o.e.i.s.IndexShard       ] [node_t1] [test][1] index [type][14] (seq# [2])
[2018-01-14T16:25:00,824][TRACE][o.e.i.s.IndexShard       ] [node_t1] [test][1] index [type][35] (seq# [3])
[2018-01-14T16:25:00,824][TRACE][o.e.i.s.IndexShard       ] [node_t1] [test][1] index [type][8] (seq# [0])
  • From the above logs, we can see that node_t1 had a global checkpoint 0, and from the previous logs, we also know that might have had an unsafe IndexCommit C_problematic with local checkpoint = max sequence number = 2. The safe index commit that it used for recovery here was the initial empty commit C_init, as it requested all operations from 0 onwards.
  • The recovery completes, all is good:
nodes: 
    {node_t1}{ArcKsewYTwm7IlIhKZVWtQ}{diIIW2TmROeYCHx484oJkg}{127.0.0.1}{127.0.0.1:30110}
    {node_t2}{SXrylrbnSOqpFvbIzKORgg}{F3Xy1h1NRSObstz-8vueVg}{127.0.0.1}{127.0.0.1:30111}
    {node_t0}{MTKu7d8xS6CnqxssfDgQnQ}{9whlsxtSRVSd4Mg1kN4YaA}{127.0.0.1}{127.0.0.1:30109}, local, master
routing_table (version 13):
--------[test][1], node[SXrylrbnSOqpFvbIzKORgg], [P], s[STARTED], a[id=zWo3jzSXS7agaUBIroIMbQ]
--------[test][1], node[ArcKsewYTwm7IlIhKZVWtQ], [R], s[STARTED], a[id=9eD9fk2gQ6qdMwq5twqJ8w]
--------[test][1], node[MTKu7d8xS6CnqxssfDgQnQ], [R], s[STARTED], a[id=5-xVWGFuTXmWgg9nzGoBdw]
  • Another failure of node_t1 happens:
nodes: 
    {node_t2}{SXrylrbnSOqpFvbIzKORgg}{F3Xy1h1NRSObstz-8vueVg}{127.0.0.1}{127.0.0.1:30111}
    {node_t0}{MTKu7d8xS6CnqxssfDgQnQ}{9whlsxtSRVSd4Mg1kN4YaA}{127.0.0.1}{127.0.0.1:30109}, local, master
routing_table (version 16):
--------[test][1], node[SXrylrbnSOqpFvbIzKORgg], [P], s[STARTED], a[id=zWo3jzSXS7agaUBIroIMbQ]
--------[test][1], node[MTKu7d8xS6CnqxssfDgQnQ], [R], s[STARTED], a[id=5-xVWGFuTXmWgg9nzGoBdw]
--------[test][1], node[null], [R], recovery_source[peer recovery], s[UNASSIGNED], unassigned_info[[reason=NODE_LEFT], at[2018-01-15T03:25:02.784Z], delayed=true, details[node_left[ArcKsewYTwm7IlIhKZVWtQ]], allocation_status[no_attempt]]
  • And again, node_t1 rejoins and is assigned a replica:
nodes: 
    {node_t1}{ArcKsewYTwm7IlIhKZVWtQ}{diIIW2TmROeYCHx484oJkg}{127.0.0.1}{127.0.0.1:30110}
    {node_t2}{SXrylrbnSOqpFvbIzKORgg}{F3Xy1h1NRSObstz-8vueVg}{127.0.0.1}{127.0.0.1:30111}, local
    {node_t0}{MTKu7d8xS6CnqxssfDgQnQ}{9whlsxtSRVSd4Mg1kN4YaA}{127.0.0.1}{127.0.0.1:30109}, master
routing_table (version 20):
--------[test][1], node[SXrylrbnSOqpFvbIzKORgg], [P], s[STARTED], a[id=zWo3jzSXS7agaUBIroIMbQ]
--------[test][1], node[ArcKsewYTwm7IlIhKZVWtQ], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=iacEaYncREiChRxXc7HFEA], unassigned_info[[reason=NODE_LEFT], at[2018-01-15T03:25:02.784Z], delayed=true, details[node_left[ArcKsewYTwm7IlIhKZVWtQ]], allocation_status[no_attempt]], expected_shard_size[7272]
--------[test][1], node[MTKu7d8xS6CnqxssfDgQnQ], [R], s[STARTED], a[id=5-xVWGFuTXmWgg9nzGoBdw]
  • This is where it gets interesting now. The global checkpoint that's available in the translog on the replica (before recovering) is 3 at this point:
[2018-01-14T16:25:08,936][DEBUG][o.e.i.t.Translog         ] [node_t1] [test][1] open uncommitted translog checkpoint Checkpoint{offset=266, numOps=4, generation=3, minSeqNo=0, maxSeqNo=3, globalCheckpoint=3, minTranslogGeneration=1}
  • The replica selects a safe commit for peer recovery, which has a local checkpoint that's below the current global checkpoint. In particular, C_problematic qualifies as such a commit now, whereas the operations that it contains are invalid. There is no log line which would directly show that C_problematicwas chosen (we don't log that decision, unfortunately), but the sequence-number based recovery hints at the fact that a "safe" commit with local checkpoint <= max sequence number <= 2 was chosen:
[2018-01-14T16:25:08,944][TRACE][o.e.i.r.RecoverySourceHandler] [node_t2] [test][1][recover to node_t1] recovery [phase2]: sending transaction log operations (seq# from [3], required [3:4]
[2018-01-14T16:25:08,945][TRACE][o.e.i.s.IndexShard       ] [node_t1] [test][1] index [type][35] (seq# [3])
[2018-01-14T16:25:08,946][TRACE][o.e.i.s.IndexShard       ] [node_t1] [test][1] index [type][56] (seq# [4])
  • The recovery successfully completes, but the replica on node_t1 is now out-of-sync with the primary on node_t2, which results in the test failure:
FAILURE 59.4s J0 | ClusterDisruptionIT.testAckedIndexing <<< FAILURES!
03:26:44    > Throwable #1: java.lang.AssertionError: doc [30] indexed via node [node_t2] not found (checked via node [node_t1]
...

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions