-
Notifications
You must be signed in to change notification settings - Fork 25.6k
TSDB: Speed up _id query #84928
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TSDB: Speed up _id query #84928
Conversation
|
@henningandersen told me that he and @jpountz had talked about using |
This speeds up the `term` query on `_id` in time series indices by skipping segments that don't contain any matching `@timestamp`s.
|
LUCENE-8980 is the Lucene JIRA that added the early exit when the term to look up isn't in the range managed by a segment. And you can see the associated speedup with annotation |
|
@nik9000 I'm not familiar with the index-time logic of deduplication for TSDB, would this change only result in a search-time speedup for |
|
We'd have to plumb it into the deduplicetion logic.
…On Mon, Mar 14, 2022, 7:33 AM Adrien Grand ***@***.***> wrote:
@nik9000 <https://github.com/nik9000> I'm not familiar with the
index-time logic of deduplication for TSDB, would this change only result
in a search-time speedup for term queries, or would it also speedup
ingestion by more efficiently skipping irrelevant segments?
—
Reply to this email directly, view it on GitHub
<#84928 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABUXIROZHQUEH4WE5TFGNTU74PXJANCNFSM5QUFITSQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
|
The speedup I saw in my tests was fairly in line with the benchmark. I'd
have to use rally to have a ton of confidence in the numbers. But it's in
that ballpark.
…On Mon, Mar 14, 2022, 8:01 AM Nikolas Everett ***@***.***> wrote:
We'd have to plumb it into the deduplicetion logic.
On Mon, Mar 14, 2022, 7:33 AM Adrien Grand ***@***.***>
wrote:
> @nik9000 <https://github.com/nik9000> I'm not familiar with the
> index-time logic of deduplication for TSDB, would this change only result
> in a search-time speedup for term queries, or would it also speedup
> ingestion by more efficiently skipping irrelevant segments?
>
> —
> Reply to this email directly, view it on GitHub
> <#84928 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AABUXIROZHQUEH4WE5TFGNTU74PXJANCNFSM5QUFITSQ>
> .
> Triage notifications on the go with GitHub Mobile for iOS
> <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
> or Android
> <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
>
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>
|
| return new MatchNoDocsQuery(); | ||
| } | ||
| long timestamp = ByteUtils.readLongLE(suffix, 8); | ||
| return new TermQuery(new Term(NAME, new BytesRef(id))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be an option to encode the timestamp as a prefix of the id in big endian order so that the optimization would work out-of-the-box without needing a custom query?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. I'm certainly thinking about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put the timestamp last to get the most shared prefixes. Maybe a bad choice, but it's what I was doing then. We can change it.
I encoded the timestamp in little endian because we had a little endian method. I still have an open follow up to evaluate flipping it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A (very late) update - we care a lot about the size of the _id now. And we think that an encoding scheme like @jpountz mentioned would make it much bigger. So we're likely going to need this query.
|
It looks correct to me. The thing I'm unclear about is whether the complexity is worth the benefits as I can't think of many use-cases for doing ID queries on timeseries data. It feels to me like the important thing to do would be to have this skipping logic for index-time deduplication? |
Same. I'll try and pick this up and some point and rig up indexing deduplication. And I'll also see what it'd cost us to get the deduplication for free by putting the timestamp at the front of the id. |
This speeds up the
termquery on_idin time series indices byskipping segments that don't contain any matching
@timestamps.