-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Goodbye, Translog Views #25962
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Goodbye, Translog Views #25962
Conversation
jasontedor
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general it looks good. I think that main point of contention is abstraction for the chained listeners in the primary/replica resync code. I think it's unnecessary and makes the code hard to follow, how it was (and what I propose) is clearer. Also, I left a comment about the InternalEngineTests although I'm not sure if anything can be done (I leave that to you to figure out). The rest I think are minor comments.
| }; | ||
| } | ||
|
|
||
| static <Response> ActionListener<Response> chain(ActionListener<Response> first, ActionListener<Response> second) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Javadocs please (although I'm unsure if this is really needed, see also my comment at the call site).
| final long startingSeqNo = indexShard.getGlobalCheckpoint() + 1; | ||
| Translog.Snapshot snapshot = view.snapshot(startingSeqNo); | ||
| Translog.Snapshot snapshot = indexShard.getTranslog().newSnapshotFromMinSeqNo(startingSeqNo); | ||
| listener = chain(wrap(r -> snapshot.close(), e -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure, I recoil when I see this wrapping and wrapping, I find it harder than necessary to follow. I'd like to avoid adding an abstraction used in exactly one place, we can add it later if we see the same pattern arise a few more times. Right now, I think what is already there is more straightforward and it makes it immediately clear what is going on:
diff --git a/core/src/main/java/org/elasticsearch/index/shard/PrimaryReplicaSyncer.java b/core/src/main/java/org/elasticsearch/index/shard/PrimaryReplicaSyncer.java
index 9313176d9c..ce0d36c7cc 100644
--- a/core/src/main/java/org/elasticsearch/index/shard/PrimaryReplicaSyncer.java
+++ b/core/src/main/java/org/elasticsearch/index/shard/PrimaryReplicaSyncer.java
@@ -80,17 +80,33 @@ public class PrimaryReplicaSyncer extends AbstractComponent {
this.chunkSize = chunkSize;
}
- public void resync(IndexShard indexShard, ActionListener<ResyncTask> listener) {
+ public void resync(IndexShard indexShard, final ActionListener<ResyncTask> listener) {
+ ActionListener<ResyncTask> resyncListener = null;
try {
final long startingSeqNo = indexShard.getGlobalCheckpoint() + 1;
Translog.Snapshot snapshot = indexShard.getTranslog().newSnapshotFromMinSeqNo(startingSeqNo);
- listener = chain(wrap(r -> snapshot.close(), e -> {
- try {
- snapshot.close();
- } catch (IOException e1) {
- e.addSuppressed(e1);
+ resyncListener = new ActionListener<ResyncTask>() {
+ @Override
+ public void onResponse(final ResyncTask resyncTask) {
+ try {
+ snapshot.close();
+ listener.onResponse(resyncTask);
+ } catch (final Exception e) {
+ onFailure(e);
+ }
+ }
+
+ @Override
+ public void onFailure(final Exception e) {
+ try {
+ snapshot.close();
+ } catch (final IOException inner) {
+ e.addSuppressed(inner);
+ } finally {
+ listener.onFailure(e);
+ }
}
- }), listener);
+ };
ShardId shardId = indexShard.shardId();
// Wrap translog snapshot to make it synchronized as it is accessed by different threads through SnapshotSender.
@@ -120,9 +136,13 @@ public class PrimaryReplicaSyncer extends AbstractComponent {
}
};
resync(shardId, indexShard.routingEntry().allocationId().getId(), indexShard.getPrimaryTerm(), wrappedSnapshot,
- startingSeqNo, listener);
+ startingSeqNo, resyncListener);
} catch (Exception e) {
- listener.onFailure(e);
+ if (resyncListener != null) {
+ resyncListener.onFailure(e);
+ } else {
+ listener.onFailure(e);
+ }
}
}This is what I would prefer to see, basically undoing the change that you're proposing here. It makes it straightforward to see what is happening here.
Note also that I do not like the reassignment to listener, that also makes the code hard to follow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These things are subjecting. I personally prefer the wrapping because it allows to ignore the crud of wrapping and focus on the functionality. Same goes for not re-assigning the listener (I prefer my version as it the complexities are dealt with in the same place, when we wrap). I don't feel strongly about it and will happily go along with your version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree it allows focusing on the functionality, to chase down what is really being executed here you have to run off and grok two methods, it's really not straightforward at all. With the listener defined front and center you can immediately see what is happening, no need to chase anything down. Sorry, I feel very strongly about this one.
| // Also fail the resync early if the shard is shutting down | ||
| Translog.Snapshot wrappedSnapshot = new Translog.Snapshot() { | ||
|
|
||
| @Override |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be synchronized too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
| if (snapshots.length == 0) { | ||
| onClose = () -> {}; | ||
| } else { | ||
| assert Arrays.stream(snapshots).map(BaseTranslogReader::getGeneration).min(Long::compareTo).get() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want a stronger condition here? That the snapshot generations are in sorted order?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so? acquiring the min gen from the translog deletion policy will keep all the other ones around which is what we care about here?
| onClose = () -> {}; | ||
| } else { | ||
| assert Arrays.stream(snapshots).map(BaseTranslogReader::getGeneration).min(Long::compareTo).get() | ||
| == snapshots[0].generation : "first reader generation of " + snapshots[0].generation + " is not the smallest"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead output the full array in the assertion message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++
| */ | ||
| synchronized long minTranslogGenRequired(List<TranslogReader> readers, TranslogWriter writer) throws IOException { | ||
| long minByView = getMinTranslogGenRequiredByViews(); | ||
| long minByView = getMinTranslogGenRequiredByLocks(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove mention of views?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++
| try (ReleasableLock ignored = writeLock.acquire()) { | ||
| if (closed.get() && deletionPolicy.pendingViewsCount() == 0) { | ||
| if (closed.get() && deletionPolicy.pendingTranslogRefCount() == 0) { | ||
| logger.trace("closing files. translog is closed and there are no pending views"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove mention of views from this trace message: closing files; translog is closed and there are no pending retention locks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch.
| SnapshotsInProgress snapshots = currentState.custom(SnapshotsInProgress.TYPE); | ||
| if (snapshots == null || snapshots.entries().isEmpty()) { | ||
| // Store newSnapshot here to be processed in clusterStateProcessed | ||
| // Store newSnapshotFromGen here to be processed in clusterStateProcessed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:(
|
|
||
| try { | ||
| prepareTargetForTranslog(translogView.estimateTotalOperations(startingSeqNo)); | ||
| prepareTargetForTranslog(translog.estimateTotalOperationsFromMinSeq(startingSeqNo)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's avoid invoking Translog#estimateTotalOperationsFromMinSeq twice here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could have changed? prepare target has a network call in it?
| while ((operation = snapshot.next()) != null) { | ||
| if (operation.seqNo() != SequenceNumbersService.UNASSIGNED_SEQ_NO) { | ||
| tracker.markSeqNoAsCompleted(operation.seqNo()); | ||
| try(Translog.Snapshot snapshot = shard.getTranslog().newSnapshotFromMinSeqNo(startingSeqNo)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: space between try and (.
|
@jasontedor thx. I addressed all your feedback. Can you take another look? |
jasontedor
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
|
Thanks @jasontedor |
During peer recoveries, we need to copy over lucene files and replay the operations they miss from the source translog. Guaranteeing that translog files are not cleaned up has seen many iterations overtime. Back in the old 1.0 days, recoveries went through the Engine and actively prevented both translog cleaning and lucene commits. We then moved to a notion called Translog Views, which allowed the recovery code to "acquire" a view into the translog which is then guaranteed to be kept around until the view is closed. The Engine code was free to commit lucene and do what it ever it wanted without coordinating with recoveries. Translog file deletion logic was based on reference counting on the file level. Those counters were incremented when a view was acquired but also when the view was used to create a `Snapshot` that allowed you to read operations from the files. At some point we removed the file based counting complexity in favor of constructs on the Translog level that just keep track of "open" views and the minimum translog generation they refer to. To do so, Views had to be kept around until the last snapshot that was made from them was consumed. This was fine in recovery code but lead to [a subtle bug](#25862) in the [Primary Replica Resyncer](#25862). Concurrently, we have developed the notion of a `TranslogDeletionPolicy` which is responsible for the liveness aspect of translog files. This class makes it very simple to take translog Snapshot into account for keep translog files around, allowing people that just need a snapshot to just take a snapshot and not worry about views and such. Recovery code which actually does need a view can now prevent trimming by acquiring a simple retention lock (a `Closable`). This removes the need for the notion of a View.
During peer recoveries, we need to copy over lucene files and replay the operations they miss from the source translog. Guaranteeing that translog files are not cleaned up has seen many iterations overtime. Back in the old 1.0 days, recoveries went through the Engine and actively prevented both translog cleaning and lucene commits. We then moved to a notion called Translog Views, which allowed the recovery code to "acquire" a view into the translog which is then guaranteed to be kept around until the view is closed. The Engine code was free to commit lucene and do what it ever it wanted without coordinating with recoveries. Translog file deletion logic was based on reference counting on the file level. Those counters were incremented when a view was acquired but also when the view was used to create a `Snapshot` that allowed you to read operations from the files. At some point we removed the file based counting complexity in favor of constructs on the Translog level that just keep track of "open" views and the minimum translog generation they refer to. To do so, Views had to be kept around until the last snapshot that was made from them was consumed. This was fine in recovery code but lead to [a subtle bug](#25862) in the [Primary Replica Resyncer](#25862). Concurrently, we have developed the notion of a `TranslogDeletionPolicy` which is responsible for the liveness aspect of translog files. This class makes it very simple to take translog Snapshot into account for keep translog files around, allowing people that just need a snapshot to just take a snapshot and not worry about views and such. Recovery code which actually does need a view can now prevent trimming by acquiring a simple retention lock (a `Closable`). This removes the need for the notion of a View.
During peer recoveries, we need to copy over lucene files and replay the operations they miss from the source translog. Guaranteeing that translog files are not cleaned up has seen many iterations overtime. Back in the old 1.0 days, recoveries went through the Engine and actively prevented both translog cleaning and lucene commits. We then moved to a notion called Translog Views, which allowed the recovery code to "acquire" a view into the translog which is then guaranteed to be kept around until the view is closed. The Engine code was free to commit lucene and do what it ever it wanted without coordinating with recoveries. Translog file deletion logic was based on reference counting on the file level. Those counters were incremented when a view was acquired but also when the view was used to create a
Snapshotthat allowed you to read operations from the files. At some point we removed the file based counting complexity in favor of constructs on the Translog level that just keep track of "open" views and the minimum translog generation they refer to. To do so, Views had to be kept around until the last snapshot that was made from them was consumed. This was fine in recovery code but lead to a subtle bug in the Primary Replica Resyncer.Concurrently, we have developed the notion of a
TranslogDeletionPolicywhich is responsible for the liveness aspect of translog files. This class makes it very simple to take translog Snapshot into account for keep translog files around, allowing people that just need a snapshot to just take a snapshot and not worry about views and such. Recovery code which actually does need a view can now prevent trimming by acquiring a simple retention lock (aClosable). This removes the need for the notion of a View.