-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Move early termination based on index sort to TopDocs collector #27666
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move early termination based on index sort to TopDocs collector #27666
Conversation
jpountz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for tackling this change. It took me some time to understand how it works (not your fault!) but it looks good to me in general. I left a suggestion about a potential improvement to the early-terminating logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could be wrong, but I think we could simplify it a bit by doing something like that:
// implicit total hit counts are valid only when there is no filter collector in the chain
int count = hasFilterCollector ? -1 : shortcutTotalHitCount(reader, query);
boolean doTrackTotalHits = trackTotalHits && count == -1; // we can also skip total hit counts if the query gives it for free
final TopDocsCollector<?> topDocsCollector = TopFieldCollector.create(sortAndFormats.sort, numHits,
(FieldDoc) searchAfter, true, trackMaxScore, trackMaxScore, doTrackTotalHits);
This way, we don't need to check whether the collector can actually early-terminate or not, and we never have to add a TotalHitCount collector?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still want the top docs collector to early terminate if track_total_hits is true.
This is why we have the complex logic that wraps a counting collector in the next block.
We could also add a filtered collector that intercepts CollectionTerminatedException coming from the top docs collector and continue the collection with a simple counting collector to get the total hit count. This way we can always assume that the TopDocsCollector can early terminate ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed 09a033a to simplify the logic. Can you take another look ?
53eb99f to
09a033a
Compare
Lucene TopDocs collector are now able to early terminate the collection based on the index sort. This change plugs this new functionality directly in the query phase instead of relying on a dedicated early terminating collector.
09a033a to
1be6c29
Compare
Lucene TopDocs collector are now able to early terminate the collection based on the index sort. This change plugs this new functionality directly in the query phase instead of relying on a dedicated early terminating collector.
* es/master: (45 commits) Adapt scroll rest test after backport. relates #27842 Move early termination based on index sort to TopDocs collector (#27666) Upgrade beats templates that we use for bwc testing. (#27929) ingest: upgraded ingest geoip's geoip2's dependencies. [TEST] logging for update by query test #27820 Add elasticsearch-nio jar for base nio classes (#27801) Use full profile on JDK 10 builds Require Gradle 4.3 Enable grok processor to support long, double and boolean (#27896) Add unreleased v6.1.2 version TEST: reduce blob size #testExecuteMultipartUpload Check index under the store metadata lock (#27768) Fixes DocStats to not report index size < -1 (#27863) Fixed test to be up to date with the new database files. Upgrade to Lucene 7.2.0. (#27910) Disable TestZenDiscovery in cloud providers integrations test Use `_refresh` to shrink the version map on inactivity (#27918) Make KeyedLock reentrant (#27920) ingest: Upgraded the geolite2 databases. [Test] Fix IndicesClientDocumentationIT (#27899) ...
The QueryCollectorContext abstraction was introduced by #24864 based on the requirement that the top docs collector creation needed to be delayed until after all the other collectors had been created. At the same time, collectors get wrapped depending on the search features enabled by the request, but the top score / total hit count collector is the root collector where the wrapping starts, which is why its corresponding context gets added at position 0 in the list of collector contexts. Requirements have changed since #27666 , which means that we can go back to a simpler way of creating collectors and wrapping them. We no longer need a QueryCollectorContext abstraction, and we can instead create collectors straight-away, and wrap them as needed. This is much easier to follow compared to the very generic create(Collector) method that the context exposes. TopDocsCollectorContext adds some value in that it incorporates all the logic around creating the top docs collector, yet it can be further simplified as well by making the postProcess method more specific.
Lucene TopDocs collector are now able to early terminate the collection based on index sort (https://issues.apache.org/jira/browse/LUCENE-8059). This change plugs this new functionality directly in the query phase instead of relying on a dedicated early terminating sorting collector.