-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Enable a long translog retention policy by default #25294
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…n_part2 # Conflicts: # core/src/main/java/org/elasticsearch/index/shard/IndexShard.java
s1monw
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left a bunch of comments
| if (refreshTask.getInterval().equals(indexSettings.getRefreshInterval()) == false) { | ||
| rescheduleRefreshTasks(); | ||
| } | ||
| if (trimTranslogTask.getInterval().equals(indexSettings.getTranslogRetentionCheckInterval()) == false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate why you go down this route instead of using IndexShard#afterWriteOperation() and maybe also do it in IndexShard#flush? I'd love to rather remove tasks than adding them?! it's just way easier to reason about things if they are happening due to a user interaction form the outside
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate why you go down this route instead of using IndexShard#afterWriteOperation()
This is not instead but on top of it. The problem is that if people have a 12h retention policy and we need to start cleaning stuff 12hs later (assuming they index full speed and roll over to a new index).
I'd love to rather remove tasks than adding them?!
I'm totally with you. I spent some time thinking about ways to avoid the regular check but everything I came up with ended up more complex. I considered stuff scheduling things in in activity etc.. always come back to this being the clearest/easiest. I'll be more than happy to hear about alternatives.
| engine.getTranslog().rollGeneration(); | ||
| final Translog translog = engine.getTranslog(); | ||
| translog.rollGeneration(); | ||
| translog.trimUnreferencedReaders(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm you run translog.trimUnreferencedReaders(); under the engine lock here but you run it without a lock here so I wonder when we have to run under lock and if we can assert on that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a great observation. Strictly speaking we don't need to coordinate through the write lock here. In the place you linked to I was mostly interested in the error handling and that we fail the engine if anything goes wrong (as we close the translog on error as well). I was worried for poisonous situations where we keep trying to get stuff from a closed translog. I will work on removing the read lock in the engine and fold this functionality (trim and roll) into the engine so we can be consistent. agreed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the readlock is healthy here, I wonder if we would rather fold this part into the engine and don't call it from the outside and add a parameter boolean rollGeneration to the trim method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I too find it weird that we have IndexShard#trimTranslog that goes through the engine (with error handling) but ultimately amounts to trimming unreferenced readers, and we have this method that does the same without going through the engine instead poking all the way through to the translog.
| } | ||
|
|
||
| private Stream<? extends BaseTranslogReader> readersAboveMinSeqNo(long minSeqNo) { | ||
| assert readLock.isHeldByCurrentThread() || writeLock.isHeldByCurrentThread(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please have one assert per check it's easier to see what failed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused - this is an or check. I will add a message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nevermind message is good
| // we're shutdown potentially on some tragic event, don't delete anything | ||
| return; | ||
| } | ||
| if (readers.isEmpty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe instead of adding this intermediate return we can invert and only do the work if it's not empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I can do that and thought about it. The reason I went with this is that to do something like shouldTrimTranslog where I can add this shortcut, I will also have to go and ask the retention policy for the current trimming situation and add tests etc. Since this method is not called on every write, but I rather on every roll / flush/ background check every 10m.. I felt this was simpler. Let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
methods should have a single return statement IMO, everything else is more complex in 99% of the cases.
| } | ||
|
|
||
| synchronized long getViewCount(long viewGen) { | ||
| return translogRefCounts.getOrDefault(viewGen, Counter.newCounter(false)).get(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are creating a new counter anyway even if it's there I think we can safe that object and use computeIfAbsent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++. will change
|
|
||
| if (isSequenceNumberBasedRecoveryPossible) { | ||
| logger.trace("performing sequence numbers based recovery. starting at [{}]", request.startingSeqNo()); | ||
| startingSeqNo = request.startingSeqNo(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the changes in here are unrelated to the trimming code, I wonder if we can do it in 2 steps and add the integration in a followup, that way reviews are much simpler and more effective?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, this became less important than what it used to be after some iterations. That said the change is mostly related to reporting and that acquiring a snapshot now has a seq# in it's constructor (to avoid scanning a large translog). I can try to roll that part back but I was afraid it will change the system too much (we will stream more ops from the translog). I'm not sure it's worth it tbh.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+727 −336 I think we should try harder to make them more contained. really!
|
@s1monw thanks. I responded to some of you comment. Will apply code changes tomorrow. |
This reverts commit 286d4a3.
jasontedor
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few comments.
|
|
||
| @Override | ||
| public void trimTranslog() throws EngineException { | ||
| try (ReleasableLock lock = readLock.acquire()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind renaming this to ignored? It keeps the IDE from complaining, and it makes it clear that the resource is intentionally unused through the scope of the try-with-resources block.
| engine.getTranslog().rollGeneration(); | ||
| final Translog translog = engine.getTranslog(); | ||
| translog.rollGeneration(); | ||
| translog.trimUnreferencedReaders(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I too find it weird that we have IndexShard#trimTranslog that goes through the engine (with error handling) but ultimately amounts to trimming unreferenced readers, and we have this method that does the same without going through the engine instead poking all the way through to the translog.
| .filter(reader -> { | ||
| final long maxSeqNo = reader.getCheckpoint().maxSeqNo; | ||
| return maxSeqNo == SequenceNumbersService.UNASSIGNED_SEQ_NO || | ||
| maxSeqNo >= minSeqNo; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I think this fits on the preceding line without wrapping.
| private long translogSizeInBytes; | ||
| private int numberOfOperations; | ||
| private long uncommittedSizeInBytes; | ||
| private int uncommittedOperations; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if since these are summed in node stats if this should be a long (to reduce the chance of overflow)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, but I think we need to do this consistently and I didn't want to touch the other numberOfOperations field (and the bwc code that entails) in the same PR. That's why I did it like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, you're saying that you'll address this in a follow-up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can will do.
|
@s1monw I took out all the scheduled trimming (we can decide what to do about it later) and did the engine changes. @jasontedor I addressed your feedback. Can you please look again? |
jasontedor
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
s1monw
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 2
|
Thx @jasontedor @s1monw for the iterations. |
* master: (56 commits) Initialize max unsafe auto ID timestamp on shrink Enable a long translog retention policy by default (elastic#25294) Remove `index.mapping.single_type=false` from core/tests (elastic#25331) test: single type defaults to true since alpha1 and not alpha3 Get short path name for native controllers Live primary-replica resync (no rollback) (elastic#24841) Upgrade to lucene-7.0.0-snapshot-ad2cb77. (elastic#25349) percolator: Deprecate `document_type` parameter. [DOCS] Fixed typo. [rest-api-spec/indices.refresh] Remove old params Remove redundant and broken MD5 checksum from repository-s3 (elastic#25270) Initialize sequence numbers on a shrunken index Port most snapshot/restore static bwc tests to qa:full-cluster-restart (elastic#25296) Javadoc: ThreadPool doesn't reject while shutdown (elastic#23678) test: verify `size_to_upgrade_in_bytes` in assertBusy(...) Docs: Removed duplicated line in mapping docs Add backward compatibility indices for 5.4.2 Update MockTransportService to the age of Transport.Connection (elastic#25320) Add version v5.4.2 after release IndexMetaData: Add internal format index setting (elastic#25292) ...
We currently check whether translog files can be trimmed whenever we create a new translog generation or close a view. However #25294 added a long translog retention period (12h, max 512MB by default), which means translog files should potentially be cleaned up long after there isn't any indexing activity to trigger flushes/the creation of new translog files. We therefore need a scheduled background check to clean up those files once they are no longer needed. Relates to #10708
#25147 added the translog deletion policy but didn't enable it by default. This PR enables a default retention of 512MB (same maximum size of the current translog) and an age of 12 hours (i.e., after 12 hours all translog files will be deleted). This increases to chance to have an ops based recovery, even if the primary flushed or the replica was offline for a few hours.
In order to see which parts of the translog are committed into lucene the translog stats are extended to include information about uncommitted operations.
Views now include all translog ops and guarantee, as before, that those will not go away. Snapshotting a view allows to filter out generations that are not relevant based on a specific sequence number.
I still have to write some docs and add a migration note, but I think we can start reviewing.
Relates to #10708