Goodbye, Translog Views #25962

bleskes · 2017-07-30T11:46:43Z

During peer recoveries, we need to copy over lucene files and replay the operations they miss from the source translog. Guaranteeing that translog files are not cleaned up has seen many iterations overtime. Back in the old 1.0 days, recoveries went through the Engine and actively prevented both translog cleaning and lucene commits. We then moved to a notion called Translog Views, which allowed the recovery code to "acquire" a view into the translog which is then guaranteed to be kept around until the view is closed. The Engine code was free to commit lucene and do what it ever it wanted without coordinating with recoveries. Translog file deletion logic was based on reference counting on the file level. Those counters were incremented when a view was acquired but also when the view was used to create a Snapshot that allowed you to read operations from the files. At some point we removed the file based counting complexity in favor of constructs on the Translog level that just keep track of "open" views and the minimum translog generation they refer to. To do so, Views had to be kept around until the last snapshot that was made from them was consumed. This was fine in recovery code but lead to a subtle bug in the Primary Replica Resyncer.

Concurrently, we have developed the notion of a TranslogDeletionPolicy which is responsible for the liveness aspect of translog files. This class makes it very simple to take translog Snapshot into account for keep translog files around, allowing people that just need a snapshot to just take a snapshot and not worry about views and such. Recovery code which actually does need a view can now prevent trimming by acquiring a simple retention lock (a Closable). This removes the need for the notion of a View.

…_snaps

jasontedor

In general it looks good. I think that main point of contention is abstraction for the chained listeners in the primary/replica resync code. I think it's unnecessary and makes the code hard to follow, how it was (and what I propose) is clearer. Also, I left a comment about the InternalEngineTests although I'm not sure if anything can be done (I leave that to you to figure out). The rest I think are minor comments.

jasontedor · 2017-07-31T02:00:42Z

core/src/main/java/org/elasticsearch/action/ActionListener.java

        };
    }

+    static <Response> ActionListener<Response> chain(ActionListener<Response> first, ActionListener<Response> second) {


Javadocs please (although I'm unsure if this is really needed, see also my comment at the call site).

jasontedor · 2017-07-31T02:12:25Z

core/src/main/java/org/elasticsearch/index/shard/PrimaryReplicaSyncer.java

            final long startingSeqNo = indexShard.getGlobalCheckpoint() + 1;
-            Translog.Snapshot snapshot = view.snapshot(startingSeqNo);
+            Translog.Snapshot snapshot = indexShard.getTranslog().newSnapshotFromMinSeqNo(startingSeqNo);
+            listener = chain(wrap(r -> snapshot.close(), e -> {


I'm not sure, I recoil when I see this wrapping and wrapping, I find it harder than necessary to follow. I'd like to avoid adding an abstraction used in exactly one place, we can add it later if we see the same pattern arise a few more times. Right now, I think what is already there is more straightforward and it makes it immediately clear what is going on:

diff --git a/core/src/main/java/org/elasticsearch/index/shard/PrimaryReplicaSyncer.java b/core/src/main/java/org/elasticsearch/index/shard/PrimaryReplicaSyncer.java index 9313176d9c..ce0d36c7cc 100644 --- a/core/src/main/java/org/elasticsearch/index/shard/PrimaryReplicaSyncer.java +++ b/core/src/main/java/org/elasticsearch/index/shard/PrimaryReplicaSyncer.java @@ -80,17 +80,33 @@ public class PrimaryReplicaSyncer extends AbstractComponent { this.chunkSize = chunkSize; } - public void resync(IndexShard indexShard, ActionListener<ResyncTask> listener) { + public void resync(IndexShard indexShard, final ActionListener<ResyncTask> listener) { + ActionListener<ResyncTask> resyncListener = null; try { final long startingSeqNo = indexShard.getGlobalCheckpoint() + 1; Translog.Snapshot snapshot = indexShard.getTranslog().newSnapshotFromMinSeqNo(startingSeqNo); - listener = chain(wrap(r -> snapshot.close(), e -> { - try { - snapshot.close(); - } catch (IOException e1) { - e.addSuppressed(e1); + resyncListener = new ActionListener<ResyncTask>() { + @Override + public void onResponse(final ResyncTask resyncTask) { + try { + snapshot.close(); + listener.onResponse(resyncTask); + } catch (final Exception e) { + onFailure(e); + } + } + + @Override + public void onFailure(final Exception e) { + try { + snapshot.close(); + } catch (final IOException inner) { + e.addSuppressed(inner); + } finally { + listener.onFailure(e); + } } - }), listener); + }; ShardId shardId = indexShard.shardId(); // Wrap translog snapshot to make it synchronized as it is accessed by different threads through SnapshotSender. @@ -120,9 +136,13 @@ public class PrimaryReplicaSyncer extends AbstractComponent { } }; resync(shardId, indexShard.routingEntry().allocationId().getId(), indexShard.getPrimaryTerm(), wrappedSnapshot, - startingSeqNo, listener); + startingSeqNo, resyncListener); } catch (Exception e) { - listener.onFailure(e); + if (resyncListener != null) { + resyncListener.onFailure(e); + } else { + listener.onFailure(e); + } } }

This is what I would prefer to see, basically undoing the change that you're proposing here. It makes it straightforward to see what is happening here.

Note also that I do not like the reassignment to listener, that also makes the code hard to follow.

These things are subjecting. I personally prefer the wrapping because it allows to ignore the crud of wrapping and focus on the functionality. Same goes for not re-assigning the listener (I prefer my version as it the complexities are dealt with in the same place, when we wrap). I don't feel strongly about it and will happily go along with your version.

I disagree it allows focusing on the functionality, to chase down what is really being executed here you have to run off and grok two methods, it's really not straightforward at all. With the listener defined front and center you can immediately see what is happening, no need to chase anything down. Sorry, I feel very strongly about this one.

jasontedor · 2017-07-31T02:24:44Z

core/src/main/java/org/elasticsearch/index/shard/PrimaryReplicaSyncer.java

            // Also fail the resync early if the shard is shutting down
            Translog.Snapshot wrappedSnapshot = new Translog.Snapshot() {

+                @Override


Should this be synchronized too?

jasontedor · 2017-07-31T02:31:53Z

core/src/main/java/org/elasticsearch/index/translog/Translog.java

+        if (snapshots.length == 0) {
+            onClose = () -> {};
+        } else {
+            assert Arrays.stream(snapshots).map(BaseTranslogReader::getGeneration).min(Long::compareTo).get()


Do we want a stronger condition here? That the snapshot generations are in sorted order?

I don't think so? acquiring the min gen from the translog deletion policy will keep all the other ones around which is what we care about here?

jasontedor · 2017-07-31T02:32:19Z

core/src/main/java/org/elasticsearch/index/translog/Translog.java

+            onClose = () -> {};
+        } else {
+            assert Arrays.stream(snapshots).map(BaseTranslogReader::getGeneration).min(Long::compareTo).get()
+                == snapshots[0].generation : "first reader generation of " + snapshots[0].generation + " is not the smallest";


Instead output the full array in the assertion message?

jasontedor · 2017-07-31T07:07:23Z

core/src/main/java/org/elasticsearch/index/translog/TranslogDeletionPolicy.java

     */
    synchronized long minTranslogGenRequired(List<TranslogReader> readers, TranslogWriter writer) throws IOException {
-        long minByView = getMinTranslogGenRequiredByViews();
+        long minByView = getMinTranslogGenRequiredByLocks();


Remove mention of views?

jasontedor · 2017-07-31T07:09:56Z

core/src/main/java/org/elasticsearch/index/translog/Translog.java

        try (ReleasableLock ignored = writeLock.acquire()) {
-            if (closed.get() && deletionPolicy.pendingViewsCount() == 0) {
+            if (closed.get() && deletionPolicy.pendingTranslogRefCount() == 0) {
                logger.trace("closing files. translog is closed and there are no pending views");


Remove mention of views from this trace message: closing files; translog is closed and there are no pending retention locks

good catch.

jasontedor · 2017-07-31T07:13:50Z

core/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java

                SnapshotsInProgress snapshots = currentState.custom(SnapshotsInProgress.TYPE);
                if (snapshots == null || snapshots.entries().isEmpty()) {
-                    // Store newSnapshot here to be processed in clusterStateProcessed
+                    // Store newSnapshotFromGen here to be processed in clusterStateProcessed


I don't think so. 😄

jasontedor · 2017-07-31T07:16:37Z

core/src/main/java/org/elasticsearch/indices/recovery/RecoverySourceHandler.java


            try {
-                prepareTargetForTranslog(translogView.estimateTotalOperations(startingSeqNo));
+                prepareTargetForTranslog(translog.estimateTotalOperationsFromMinSeq(startingSeqNo));


Let's avoid invoking Translog#estimateTotalOperationsFromMinSeq twice here?

This could have changed? prepare target has a network call in it?

jasontedor · 2017-07-31T07:20:03Z

core/src/main/java/org/elasticsearch/indices/recovery/RecoverySourceHandler.java

-            while ((operation = snapshot.next()) != null) {
-                if (operation.seqNo() != SequenceNumbersService.UNASSIGNED_SEQ_NO) {
-                    tracker.markSeqNoAsCompleted(operation.seqNo());
+            try(Translog.Snapshot snapshot = shard.getTranslog().newSnapshotFromMinSeqNo(startingSeqNo)) {


Nit: space between try and (.

…_snaps

bleskes · 2017-07-31T11:34:42Z

@jasontedor thx. I addressed all your feedback. Can you take another look?

jasontedor

LGTM.

bleskes · 2017-07-31T15:30:28Z

Thanks @jasontedor

During peer recoveries, we need to copy over lucene files and replay the operations they miss from the source translog. Guaranteeing that translog files are not cleaned up has seen many iterations overtime. Back in the old 1.0 days, recoveries went through the Engine and actively prevented both translog cleaning and lucene commits. We then moved to a notion called Translog Views, which allowed the recovery code to "acquire" a view into the translog which is then guaranteed to be kept around until the view is closed. The Engine code was free to commit lucene and do what it ever it wanted without coordinating with recoveries. Translog file deletion logic was based on reference counting on the file level. Those counters were incremented when a view was acquired but also when the view was used to create a `Snapshot` that allowed you to read operations from the files. At some point we removed the file based counting complexity in favor of constructs on the Translog level that just keep track of "open" views and the minimum translog generation they refer to. To do so, Views had to be kept around until the last snapshot that was made from them was consumed. This was fine in recovery code but lead to [a subtle bug](#25862) in the [Primary Replica Resyncer](#25862). Concurrently, we have developed the notion of a `TranslogDeletionPolicy` which is responsible for the liveness aspect of translog files. This class makes it very simple to take translog Snapshot into account for keep translog files around, allowing people that just need a snapshot to just take a snapshot and not worry about views and such. Recovery code which actually does need a view can now prevent trimming by acquiring a simple retention lock (a `Closable`). This removes the need for the notion of a View.

bleskes added 10 commits July 26, 2017 10:51

wip

8f495be

Merge remote-tracking branch 'upstream/master' into translog_closable…

77f3891

…_snaps

remove views

8c129f1

improve assertion message

3104f21

change to assertion

6bd6bce

roll back unneeded commit

8961611

add test for minSeqNo APIs

530c33f

remove empty closeWhileHandlingException()

d5a97fb

restore randomization

0b391fa

Merge remote-tracking branch 'upstream/master' into translog_closable…

2ed5b83

…_snaps

bleskes added :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. :Translog >enhancement v6.0.0 v6.1.0 v7.0.0 labels Jul 30, 2017

bleskes requested a review from jasontedor July 30, 2017 11:46

jasontedor reviewed Jul 31, 2017

View reviewed changes

bleskes added 3 commits July 31, 2017 12:55

Jason rocks

a268c4f

feedback

1c42610

Merge remote-tracking branch 'upstream/master' into translog_closable…

da4d69c

…_snaps

jasontedor approved these changes Jul 31, 2017

View reviewed changes

bleskes merged commit 9d10ffd into elastic:master Jul 31, 2017

bleskes deleted the translog_closable_snaps branch July 31, 2017 15:30

colings86 added v6.0.0-beta1 and removed v6.0.0 labels Aug 3, 2017

lcawl removed the v6.1.0 label Dec 12, 2017

jpountz removed :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. v7.0.0 labels Jan 29, 2019

Goodbye, Translog Views #25962

Goodbye, Translog Views #25962

Uh oh!

Conversation

bleskes commented Jul 30, 2017

Uh oh!

jasontedor left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bleskes commented Jul 31, 2017

Uh oh!

jasontedor left a comment

Choose a reason for hiding this comment

Uh oh!

bleskes commented Jul 31, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants