Skip to content

Conversation

@nik9000
Copy link
Member

@nik9000 nik9000 commented Mar 14, 2022

This speeds up the term query on _id in time series indices by
skipping segments that don't contain any matching @timestamps.

@nik9000
Copy link
Member Author

nik9000 commented Mar 14, 2022

@henningandersen told me that he and @jpountz had talked about using @timestamp to skip whole segments for the _id query that we use for duplicate checking. I realized tonight that it'd be reasonably easy to do in time series mode because we know that there is an @timestamp and we can extract it from the _id. My totally non-scientific tests show this is faster. On my data not earth shakingly faster, but faster.

This speeds up the `term` query on `_id` in time series indices by
skipping segments that don't contain any matching `@timestamp`s.
@jpountz
Copy link
Contributor

jpountz commented Mar 14, 2022

LUCENE-8980 is the Lucene JIRA that added the early exit when the term to look up isn't in the range managed by a segment. And you can see the associated speedup with annotation CS on nightly benchmarks.

@jpountz
Copy link
Contributor

jpountz commented Mar 14, 2022

@nik9000 I'm not familiar with the index-time logic of deduplication for TSDB, would this change only result in a search-time speedup for term queries, or would it also speedup ingestion by more efficiently skipping irrelevant segments?

@nik9000
Copy link
Member Author

nik9000 commented Mar 14, 2022 via email

@nik9000
Copy link
Member Author

nik9000 commented Mar 14, 2022 via email

return new MatchNoDocsQuery();
}
long timestamp = ByteUtils.readLongLE(suffix, 8);
return new TermQuery(new Term(NAME, new BytesRef(id))) {
Copy link
Contributor

@jpountz jpountz Mar 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be an option to encode the timestamp as a prefix of the id in big endian order so that the optimization would work out-of-the-box without needing a custom query?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I'm certainly thinking about it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put the timestamp last to get the most shared prefixes. Maybe a bad choice, but it's what I was doing then. We can change it.

I encoded the timestamp in little endian because we had a little endian method. I still have an open follow up to evaluate flipping it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A (very late) update - we care a lot about the size of the _id now. And we think that an encoding scheme like @jpountz mentioned would make it much bigger. So we're likely going to need this query.

@jpountz
Copy link
Contributor

jpountz commented Mar 21, 2022

It looks correct to me. The thing I'm unclear about is whether the complexity is worth the benefits as I can't think of many use-cases for doing ID queries on timeseries data. It feels to me like the important thing to do would be to have this skipping logic for index-time deduplication?

@nik9000
Copy link
Member Author

nik9000 commented Mar 21, 2022

It looks correct to me. The thing I'm unclear about is whether the complexity is worth the benefits as I can't think of many use-cases for doing ID queries on timeseries data. It feels to me like the important thing to do would be to have this skipping logic for index-time deduplication?

Same. I'll try and pick this up and some point and rig up indexing deduplication. And I'll also see what it'd cost us to get the deduplication for free by putting the timestamp at the front of the id.

@elasticsearchmachine elasticsearchmachine changed the base branch from master to main July 22, 2022 23:08
@mark-vieira mark-vieira added v8.5.0 and removed v8.4.0 labels Jul 27, 2022
@csoulios csoulios added v8.6.0 and removed v8.5.0 labels Sep 21, 2022
@kingherc kingherc added v8.7.0 and removed v8.6.0 labels Nov 16, 2022
@rjernst rjernst added v8.8.0 and removed v8.7.0 labels Feb 8, 2023
@gmarouli gmarouli added v8.9.0 and removed v8.8.0 labels Apr 26, 2023
@quux00 quux00 added v8.11.0 and removed v8.10.0 labels Aug 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.