Skip to content

Conversation

@bleskes
Copy link
Contributor

@bleskes bleskes commented Jun 19, 2017

#25147 added the translog deletion policy but didn't enable it by default. This PR enables a default retention of 512MB (same maximum size of the current translog) and an age of 12 hours (i.e., after 12 hours all translog files will be deleted). This increases to chance to have an ops based recovery, even if the primary flushed or the replica was offline for a few hours.

In order to see which parts of the translog are committed into lucene the translog stats are extended to include information about uncommitted operations.

Views now include all translog ops and guarantee, as before, that those will not go away. Snapshotting a view allows to filter out generations that are not relevant based on a specific sequence number.

I still have to write some docs and add a migration note, but I think we can start reviewing.

Relates to #10708

Copy link
Contributor

@s1monw s1monw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left a bunch of comments

if (refreshTask.getInterval().equals(indexSettings.getRefreshInterval()) == false) {
rescheduleRefreshTasks();
}
if (trimTranslogTask.getInterval().equals(indexSettings.getTranslogRetentionCheckInterval()) == false) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate why you go down this route instead of using IndexShard#afterWriteOperation() and maybe also do it in IndexShard#flush? I'd love to rather remove tasks than adding them?! it's just way easier to reason about things if they are happening due to a user interaction form the outside

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate why you go down this route instead of using IndexShard#afterWriteOperation()

This is not instead but on top of it. The problem is that if people have a 12h retention policy and we need to start cleaning stuff 12hs later (assuming they index full speed and roll over to a new index).

I'd love to rather remove tasks than adding them?!

I'm totally with you. I spent some time thinking about ways to avoid the regular check but everything I came up with ended up more complex. I considered stuff scheduling things in in activity etc.. always come back to this being the clearest/easiest. I'll be more than happy to hear about alternatives.

engine.getTranslog().rollGeneration();
final Translog translog = engine.getTranslog();
translog.rollGeneration();
translog.trimUnreferencedReaders();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm you run translog.trimUnreferencedReaders(); under the engine lock here but you run it without a lock here so I wonder when we have to run under lock and if we can assert on that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great observation. Strictly speaking we don't need to coordinate through the write lock here. In the place you linked to I was mostly interested in the error handling and that we fail the engine if anything goes wrong (as we close the translog on error as well). I was worried for poisonous situations where we keep trying to get stuff from a closed translog. I will work on removing the read lock in the engine and fold this functionality (trim and roll) into the engine so we can be consistent. agreed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the readlock is healthy here, I wonder if we would rather fold this part into the engine and don't call it from the outside and add a parameter boolean rollGeneration to the trim method?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I too find it weird that we have IndexShard#trimTranslog that goes through the engine (with error handling) but ultimately amounts to trimming unreferenced readers, and we have this method that does the same without going through the engine instead poking all the way through to the translog.

}

private Stream<? extends BaseTranslogReader> readersAboveMinSeqNo(long minSeqNo) {
assert readLock.isHeldByCurrentThread() || writeLock.isHeldByCurrentThread();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please have one assert per check it's easier to see what failed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused - this is an or check. I will add a message.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nevermind message is good

// we're shutdown potentially on some tragic event, don't delete anything
return;
}
if (readers.isEmpty()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe instead of adding this intermediate return we can invert and only do the work if it's not empty?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I can do that and thought about it. The reason I went with this is that to do something like shouldTrimTranslog where I can add this shortcut, I will also have to go and ask the retention policy for the current trimming situation and add tests etc. Since this method is not called on every write, but I rather on every roll / flush/ background check every 10m.. I felt this was simpler. Let me know what you think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

methods should have a single return statement IMO, everything else is more complex in 99% of the cases.

}

synchronized long getViewCount(long viewGen) {
return translogRefCounts.getOrDefault(viewGen, Counter.newCounter(false)).get();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are creating a new counter anyway even if it's there I think we can safe that object and use computeIfAbsent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++. will change


if (isSequenceNumberBasedRecoveryPossible) {
logger.trace("performing sequence numbers based recovery. starting at [{}]", request.startingSeqNo());
startingSeqNo = request.startingSeqNo();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the changes in here are unrelated to the trimming code, I wonder if we can do it in 2 steps and add the integration in a followup, that way reviews are much simpler and more effective?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this became less important than what it used to be after some iterations. That said the change is mostly related to reporting and that acquiring a snapshot now has a seq# in it's constructor (to avoid scanning a large translog). I can try to roll that part back but I was afraid it will change the system too much (we will stream more ops from the translog). I'm not sure it's worth it tbh.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+727 −336 I think we should try harder to make them more contained. really!

@bleskes
Copy link
Contributor Author

bleskes commented Jun 20, 2017

@s1monw thanks. I responded to some of you comment. Will apply code changes tomorrow.

Copy link
Member

@jasontedor jasontedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few comments.


@Override
public void trimTranslog() throws EngineException {
try (ReleasableLock lock = readLock.acquire()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind renaming this to ignored? It keeps the IDE from complaining, and it makes it clear that the resource is intentionally unused through the scope of the try-with-resources block.

engine.getTranslog().rollGeneration();
final Translog translog = engine.getTranslog();
translog.rollGeneration();
translog.trimUnreferencedReaders();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I too find it weird that we have IndexShard#trimTranslog that goes through the engine (with error handling) but ultimately amounts to trimming unreferenced readers, and we have this method that does the same without going through the engine instead poking all the way through to the translog.

.filter(reader -> {
final long maxSeqNo = reader.getCheckpoint().maxSeqNo;
return maxSeqNo == SequenceNumbersService.UNASSIGNED_SEQ_NO ||
maxSeqNo >= minSeqNo;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think this fits on the preceding line without wrapping.

private long translogSizeInBytes;
private int numberOfOperations;
private long uncommittedSizeInBytes;
private int uncommittedOperations;
Copy link
Member

@jasontedor jasontedor Jun 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if since these are summed in node stats if this should be a long (to reduce the chance of overflow)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but I think we need to do this consistently and I didn't want to touch the other numberOfOperations field (and the bwc code that entails) in the same PR. That's why I did it like this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, you're saying that you'll address this in a follow-up?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can will do.

@bleskes
Copy link
Contributor Author

bleskes commented Jun 21, 2017

@s1monw I took out all the scheduled trimming (we can decide what to do about it later) and did the engine changes.

@jasontedor I addressed your feedback.

Can you please look again?

Copy link
Member

@jasontedor jasontedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Copy link
Contributor

@s1monw s1monw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 2

@bleskes bleskes merged commit d963882 into elastic:master Jun 22, 2017
@bleskes
Copy link
Contributor Author

bleskes commented Jun 22, 2017

Thx @jasontedor @s1monw for the iterations.

jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Jun 22, 2017
* master: (56 commits)
  Initialize max unsafe auto ID timestamp on shrink
  Enable a long translog retention policy by default (elastic#25294)
  Remove `index.mapping.single_type=false` from core/tests (elastic#25331)
  test: single type defaults to true since alpha1 and not alpha3
  Get short path name for native controllers
  Live primary-replica resync (no rollback) (elastic#24841)
  Upgrade to lucene-7.0.0-snapshot-ad2cb77. (elastic#25349)
  percolator: Deprecate `document_type` parameter.
  [DOCS] Fixed typo.
  [rest-api-spec/indices.refresh] Remove old params
  Remove redundant and broken MD5 checksum from repository-s3 (elastic#25270)
  Initialize sequence numbers on a shrunken index
  Port most snapshot/restore static bwc tests to qa:full-cluster-restart (elastic#25296)
  Javadoc: ThreadPool doesn't reject while shutdown (elastic#23678)
  test: verify `size_to_upgrade_in_bytes` in assertBusy(...)
  Docs: Removed duplicated line in mapping docs
  Add backward compatibility indices for 5.4.2
  Update MockTransportService to the age of Transport.Connection (elastic#25320)
  Add version v5.4.2 after release
  IndexMetaData: Add internal format index setting (elastic#25292)
  ...
bleskes added a commit that referenced this pull request Jul 10, 2017
We currently check whether translog files can be trimmed whenever we create a new translog generation or close a view. However #25294 added a long translog retention period (12h, max 512MB by default), which means translog files should potentially be cleaned up long after there isn't any indexing activity to trigger flushes/the creation of new translog files. We therefore need a scheduled background check to clean up those files once they are no longer needed.

Relates to #10708
@clintongormley clintongormley added :Distributed Indexing/Distributed A catch all label for anything in the Distributed Indexing Area. Please avoid if you can. :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. and removed :Translog :Distributed Indexing/Distributed A catch all label for anything in the Distributed Indexing Area. Please avoid if you can. labels Feb 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. >enhancement v6.0.0-beta1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants