Safe commit is not safe

Since #28181, a safe commit (instead of the latest commit) is used during peer recovery, which allows sequence-number based recoveries in more cases. A [recent test failure](https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+multijob-unix-compatibility/os=oraclelinux/541/consoleFull) of `testAckedIndexing` shows, that under certain failure scenarios, an unsafe commit can be subsequently (mistakenly) considered as a safe commit, resulting in the loss of acknowledged writes:

```
FAILURE 59.4s J0 | ClusterDisruptionIT.testAckedIndexing <<< FAILURES!
03:26:44    > Throwable #1: java.lang.AssertionError: doc [30] indexed via node [node_t2] not found (checked via node [node_t1]
...
```

The scenario in the test is as follows:

- Three nodes (`node_t0, node_t1, node_t2`). The problematic replication group is `[test][1]`. The primary is on `node_t1`, and the replicas on `node_t0` and `node_t2`.
- Three documents are indexed into the primary, getting sequence number 0, 1, and 2, respectively:
```
[2018-01-14T16:24:47,113][TRACE][o.e.i.s.IndexShard       ] [node_t1] [test][1] index [type][8] (seq# [-2])
[2018-01-14T16:24:47,157][TRACE][o.e.i.s.IndexShard       ] [node_t1] [test][1] index [type][12] (seq# [-2])
[2018-01-14T16:24:48,155][TRACE][o.e.i.s.IndexShard       ] [node_t1] [test][1] index [type][15] (seq# [-2])
```
- There's a network disruption between `node_t1` and the other nodes, so it cannot replicate the last two writes:
```
[2018-01-14T16:24:47,151][INFO ][o.e.t.d.NetworkDisruption] start disrupting (disruption type: network unresponsive, disrupted links: two partitions (partition 1: [node_t2, node_t0] and partition 2: [node_t1]))
```
- The master (`node_t0`) fails `node_t1`, and the replica on `node_t2` gets promoted to primary, Cluster state:
```
...
nodes: 
    {node_t2}{SXrylrbnSOqpFvbIzKORgg}{F3Xy1h1NRSObstz-8vueVg}{127.0.0.1}{127.0.0.1:30111}, local
    {node_t0}{MTKu7d8xS6CnqxssfDgQnQ}{9whlsxtSRVSd4Mg1kN4YaA}{127.0.0.1}{127.0.0.1:30109}, master
routing_table (version 10):
--------[test][1], node[SXrylrbnSOqpFvbIzKORgg], [P], s[STARTED], a[id=zWo3jzSXS7agaUBIroIMbQ]
--------[test][1], node[MTKu7d8xS6CnqxssfDgQnQ], [R], s[STARTED], a[id=5-xVWGFuTXmWgg9nzGoBdw]
--------[test][1], node[null], [R], recovery_source[peer recovery], s[UNASSIGNED], unassigned_info[[reason=NODE_LEFT], 
...
```
- A new operation comes in (that we will later lose) and get's indexed into the newly promoted primary and is successfully replicated to the single remaining replica
```
[2018-01-14T16:24:53,749][TRACE][o.e.i.s.IndexShard       ] [node_t2] [test][1] index [type][30] (seq# [-2])
[2018-01-14T16:24:53,754][TRACE][o.e.i.s.IndexShard       ] [node_t0] [test][1] index [type][30] (seq# [1])
```
Note that the newly promoted primary assigns the sequence number 1 to this operation, as it only got operation 0 from the previous primary.
- The previously failed node `node_t1` rejoins the cluster, and a replica is assigned to this node. It does a sequence-number based recovery from the new primary on `node_t2` (which in the meanwhile has received more operations):
```
[2018-01-14T16:25:00,816][DEBUG][o.e.i.t.Translog         ] [node_t1] [test][1] open uncommitted translog checkpoint Checkpoint{offset=43, numOps=0, generation=2, minSeqNo=-1, maxSeqNo=-1, globalCheckpoint=0, minTranslogGeneration=1}
[2018-01-14T16:25:00,816][DEBUG][o.e.i.t.Translog         ] [node_t1] [test][1] recovered local translog from checkpoint Checkpoint{offset=43, numOps=0, generation=2, minSeqNo=-1, maxSeqNo=-1, globalCheckpoint=0, minTranslogGeneration=1}
[2018-01-14T16:25:00,821][TRACE][o.e.i.r.RecoverySourceHandler] [node_t2] [test][1][recover to node_t1] recovery [phase2]: sending transaction log operations (seq# from [0], required [0:3]
[2018-01-14T16:25:00,822][TRACE][o.e.i.s.IndexShard       ] [node_t1] [test][1] index [type][30] (seq# [1])
[2018-01-14T16:25:00,824][TRACE][o.e.i.s.IndexShard       ] [node_t1] [test][1] index [type][14] (seq# [2])
[2018-01-14T16:25:00,824][TRACE][o.e.i.s.IndexShard       ] [node_t1] [test][1] index [type][35] (seq# [3])
[2018-01-14T16:25:00,824][TRACE][o.e.i.s.IndexShard       ] [node_t1] [test][1] index [type][8] (seq# [0])
```
- From the above logs, we can see that `node_t1` had a global checkpoint 0, and from the previous logs, we also know that might have had an unsafe IndexCommit `C_problematic` with local checkpoint = max sequence number = 2. The safe index commit that it used for recovery here was the initial empty commit `C_init`, as it requested all operations from 0 onwards.
- The recovery completes, all is good:
```
nodes: 
    {node_t1}{ArcKsewYTwm7IlIhKZVWtQ}{diIIW2TmROeYCHx484oJkg}{127.0.0.1}{127.0.0.1:30110}
    {node_t2}{SXrylrbnSOqpFvbIzKORgg}{F3Xy1h1NRSObstz-8vueVg}{127.0.0.1}{127.0.0.1:30111}
    {node_t0}{MTKu7d8xS6CnqxssfDgQnQ}{9whlsxtSRVSd4Mg1kN4YaA}{127.0.0.1}{127.0.0.1:30109}, local, master
routing_table (version 13):
--------[test][1], node[SXrylrbnSOqpFvbIzKORgg], [P], s[STARTED], a[id=zWo3jzSXS7agaUBIroIMbQ]
--------[test][1], node[ArcKsewYTwm7IlIhKZVWtQ], [R], s[STARTED], a[id=9eD9fk2gQ6qdMwq5twqJ8w]
--------[test][1], node[MTKu7d8xS6CnqxssfDgQnQ], [R], s[STARTED], a[id=5-xVWGFuTXmWgg9nzGoBdw]
```
- Another failure of `node_t1` happens:
```
nodes: 
    {node_t2}{SXrylrbnSOqpFvbIzKORgg}{F3Xy1h1NRSObstz-8vueVg}{127.0.0.1}{127.0.0.1:30111}
    {node_t0}{MTKu7d8xS6CnqxssfDgQnQ}{9whlsxtSRVSd4Mg1kN4YaA}{127.0.0.1}{127.0.0.1:30109}, local, master
routing_table (version 16):
--------[test][1], node[SXrylrbnSOqpFvbIzKORgg], [P], s[STARTED], a[id=zWo3jzSXS7agaUBIroIMbQ]
--------[test][1], node[MTKu7d8xS6CnqxssfDgQnQ], [R], s[STARTED], a[id=5-xVWGFuTXmWgg9nzGoBdw]
--------[test][1], node[null], [R], recovery_source[peer recovery], s[UNASSIGNED], unassigned_info[[reason=NODE_LEFT], at[2018-01-15T03:25:02.784Z], delayed=true, details[node_left[ArcKsewYTwm7IlIhKZVWtQ]], allocation_status[no_attempt]]
```
- And again, `node_t1` rejoins and is assigned a replica:
```
nodes: 
    {node_t1}{ArcKsewYTwm7IlIhKZVWtQ}{diIIW2TmROeYCHx484oJkg}{127.0.0.1}{127.0.0.1:30110}
    {node_t2}{SXrylrbnSOqpFvbIzKORgg}{F3Xy1h1NRSObstz-8vueVg}{127.0.0.1}{127.0.0.1:30111}, local
    {node_t0}{MTKu7d8xS6CnqxssfDgQnQ}{9whlsxtSRVSd4Mg1kN4YaA}{127.0.0.1}{127.0.0.1:30109}, master
routing_table (version 20):
--------[test][1], node[SXrylrbnSOqpFvbIzKORgg], [P], s[STARTED], a[id=zWo3jzSXS7agaUBIroIMbQ]
--------[test][1], node[ArcKsewYTwm7IlIhKZVWtQ], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=iacEaYncREiChRxXc7HFEA], unassigned_info[[reason=NODE_LEFT], at[2018-01-15T03:25:02.784Z], delayed=true, details[node_left[ArcKsewYTwm7IlIhKZVWtQ]], allocation_status[no_attempt]], expected_shard_size[7272]
--------[test][1], node[MTKu7d8xS6CnqxssfDgQnQ], [R], s[STARTED], a[id=5-xVWGFuTXmWgg9nzGoBdw]
```
- This is where it gets interesting now. The global checkpoint that's available in the translog on the replica (before recovering) is 3 at this point:
```
[2018-01-14T16:25:08,936][DEBUG][o.e.i.t.Translog         ] [node_t1] [test][1] open uncommitted translog checkpoint Checkpoint{offset=266, numOps=4, generation=3, minSeqNo=0, maxSeqNo=3, globalCheckpoint=3, minTranslogGeneration=1}
```
- The replica selects a safe commit for peer recovery, which has a local checkpoint that's below the current global checkpoint. In particular, `C_problematic` qualifies as such a commit now, whereas the operations that it contains are invalid. There is no log line which would directly show that `C_problematic`was chosen (we don't log that decision, unfortunately), but the sequence-number based recovery hints at the fact that a "safe" commit with local checkpoint <= max sequence number <= 2 was chosen:
```
[2018-01-14T16:25:08,944][TRACE][o.e.i.r.RecoverySourceHandler] [node_t2] [test][1][recover to node_t1] recovery [phase2]: sending transaction log operations (seq# from [3], required [3:4]
[2018-01-14T16:25:08,945][TRACE][o.e.i.s.IndexShard       ] [node_t1] [test][1] index [type][35] (seq# [3])
[2018-01-14T16:25:08,946][TRACE][o.e.i.s.IndexShard       ] [node_t1] [test][1] index [type][56] (seq# [4])
```
- The recovery successfully completes, but the replica on `node_t1` is now out-of-sync with the primary on `node_t2`, which results in the test failure:
```
FAILURE 59.4s J0 | ClusterDisruptionIT.testAckedIndexing <<< FAILURES!
03:26:44    > Throwable #1: java.lang.AssertionError: doc [30] indexed via node [node_t2] not found (checked via node [node_t1]
...
```




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Safe commit is not safe #28405

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Safe commit is not safe #28405

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions