-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Aggregations: bucket_sort pipeline aggregation #27152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Aggregations: bucket_sort pipeline aggregation #27152
Conversation
|
Note I'll come back and add the docs for this before merging the PR. Just wanted enough feedback to ensure this is going the right direction. |
colings86
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a couple of comments about added JavaDocs but otherwise I think this is good so far, just needs documentation added
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a JavaDoc to explain what this is for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a JavaDoc explaining what this aggregation does?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
@colings86 I have now also added the doc page for |
9fffe25 to
d31fd0e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need some validation (after parsing) that one of size or sort is defined? E.g. in case a user does:
{
"bucket_sort": {
"sort": []
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, not sure. If the user specifies neither, it will just be a no-op. What is the consistent thing to do here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrm, not sure either. I think I'd prefer it to throw an exception, but I could see the alternative argument too. E.g. your application assembles the queries and it omits both parameters in some situations, so you'd prefer it to not throw exceptions. Don't have a strong opinion on this one.
@colings86 any thoughts? Dimitris and I chatted about this on zoom and thought we'd see what you had to say :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should throw an exception. It seems like a clear mistake if the user does this and I'm not sure it would actually be a no-op because it would output a complete copy of the target aggregation in the response as well as the original?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be no-op because this is sorting inline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah yes, sorry. I'd still like to throw an exception though as this is almost certainly a mistake by the user so they should be notified so they can correct it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pipelines are only called on the final reduce, so this check isn't needed. I was a bit confused about it too (the javadocs for isFinalReduce() is sorta misleading) and had to bug Colin :)
You can see that the incremental reduction passes null for pipelines in SearchPhaseController: https://github.com/elastic/elasticsearch/blob/master/core/src/main/java/org/elasticsearch/action/search/SearchPhaseController.java#L518
So I don't think this check is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, fair enough! Will remove.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want, and totally up to personal preference, I think truncate's functionality here can be done with a sublist?
new ArrayList<>(buckets.subList(offset, offset + currentSize)) or similar?
But that's just preference, doesn't bother me either way :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely agree with your preference :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may be reading this wrong, but if a bucket is skipped I think the final list will be incorrectly sized? E.g.
resultSize == size == 5bucketCount == 10- Bucket 2 is skipped
The conditional will prevent Bucket 2 from being added to the list, but the loop will decrement regardless and it'll only add four values to newBuckets, rather than grabbing Bucket 6 for a total of five results.
Would it be easier to just resolve the bucket value up front and if it's skipped, avoid adding it to the priority queue entirely? Also see the note below in ComparableBucket regarding resolving the values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should just preemptively cache the bucket's value (and set skip) in the ctor for the ComparableBucket? resolveBucketValue() isn't terribly expensive or anything, but since this happens during comparison, every push/pop onto the priority queue's heap will invoke a bunch of comparisons and re-resolving these values.
The downside is that the cached values may not be needed if we're only sorting on keys. Dunno, thoughts?
Might be affected by how/if the above skipping thing is actually an issue or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re: size + skipping issue from above, might be good to have a test for that scenario where an explicit size is set (which is less than the available buckets but spans across a skip).
Skips make everything more complicated :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"after all other sibling aggs" <-- potentially confusing since there are sibling pipeline aggs.
Maybe something like "... after all other bucket and metric aggregations"? Or "... after all non-pipeline aggregations"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps a section of the docs (or just a note, somewhere) which mentions you can use bucket_sort without any sorts, as a way to limit the number of buckets? It's kinda implied since the docs say sort is optional, but maybe someone will overlook that? It's such a useful part of the feature I'd hate people to miss it :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added an example
|
Left some comments, mostly little things. Only potentially thorny issue is how skips are handled / Otherwise I think this looks really good, and all the tests are lovely! ❤️ :) |
This commit adds a parent pipeline aggregation that allows sorting the buckets of a parent multi-bucket aggregation. The aggregation also offers [from] and [size] parameters in order to truncate the result as desired. Closes elastic#14928
d31fd0e to
65ba372
Compare
|
@polyfractal I have pushed a commit that addresses all feedback. Let me know what you think when you have time. |
|
LGTM! :) |
* master: (22 commits) Update Tika version to 1.15 Aggregations: bucket_sort pipeline aggregation (#27152) Introduce templating support to timezone/locale in DateProcessor (#27089) Increase logging on qa:mixed-cluster tests Update to AWS SDK 1.11.223 (#27278) Improve error message for parse failures of completion fields (#27297) Ensure external refreshes will also refresh internal searcher to minimize segment creation (#27253) Remove optimisations to reuse objects when applying a new `ClusterState` (#27317) Decouple `ChannelFactory` from Tcp classes (#27286) Fix find remote when building BWC Remove colons from task and configuration names Add unreleased 5.6.5 version number testCreateSplitIndexToN: do not set `routing_partition_size` to >= `number_of_routing_shards` Snapshot/Restore: better handle incorrect chunk_size settings in FS repo (#26844) Add limits for ngram and shingle settings (#27211) (#27318) Correct comment in index shard test Roll translog generation on primary promotion ObjectParser: Replace IllegalStateException with ParsingException (#27302) scripted_metric _agg parameter disappears if params are provided (#27159) Update discovery-ec2.asciidoc ...
This commit adds a parent pipeline aggregation that allows sorting the buckets of a parent multi-bucket aggregation. The aggregation also offers [from] and [size] parameters in order to truncate the result as desired. Closes #14928
| return new BucketSelectorPipelineAggregationBuilder(name, script, bucketsPaths); | ||
| } | ||
|
|
||
| public static BucketSortPipelineAggregationBuilder bucketSort(String name, List<FieldSortBuilder> sorts) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is field sort builder specified, instead of a more flexible sort builder ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it can only refer to a bucket path in the parent aggregation.
This commit adds a parent pipeline aggregation that allows
sorting the buckets of a parent multi-bucket aggregation.
The aggregation also offers [from] and [size] parameters
in order to truncate the result as desired.
Closes #14928