Skip to content

Conversation

@dimitris-athanasiou
Copy link
Contributor

This commit adds a parent pipeline aggregation that allows
sorting the buckets of a parent multi-bucket aggregation.

The aggregation also offers [from] and [size] parameters
in order to truncate the result as desired.

Closes #14928

@dimitris-athanasiou
Copy link
Contributor Author

Note I'll come back and add the docs for this before merging the PR. Just wanted enough feedback to ensure this is going the right direction.

Copy link
Contributor

@colings86 colings86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a couple of comments about added JavaDocs but otherwise I think this is good so far, just needs documentation added

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a JavaDoc to explain what this is for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a JavaDoc explaining what this aggregation does?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@dimitris-athanasiou
Copy link
Contributor Author

@colings86 I have now also added the doc page for bucket_sort. Could you take a look at that when you can?

@dimitris-athanasiou dimitris-athanasiou force-pushed the aggregations-bucket-sort-pipeline branch from 9fffe25 to d31fd0e Compare October 30, 2017 15:26
@dimitris-athanasiou dimitris-athanasiou added the >docs General docs changes label Oct 30, 2017
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need some validation (after parsing) that one of size or sort is defined? E.g. in case a user does:

{
    "bucket_sort": {
        "sort": []
    }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, not sure. If the user specifies neither, it will just be a no-op. What is the consistent thing to do here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm, not sure either. I think I'd prefer it to throw an exception, but I could see the alternative argument too. E.g. your application assembles the queries and it omits both parameters in some situations, so you'd prefer it to not throw exceptions. Don't have a strong opinion on this one.

@colings86 any thoughts? Dimitris and I chatted about this on zoom and thought we'd see what you had to say :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should throw an exception. It seems like a clear mistake if the user does this and I'm not sure it would actually be a no-op because it would output a complete copy of the target aggregation in the response as well as the original?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be no-op because this is sorting inline.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yes, sorry. I'd still like to throw an exception though as this is almost certainly a mistake by the user so they should be notified so they can correct it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pipelines are only called on the final reduce, so this check isn't needed. I was a bit confused about it too (the javadocs for isFinalReduce() is sorta misleading) and had to bug Colin :)

You can see that the incremental reduction passes null for pipelines in SearchPhaseController: https://github.com/elastic/elasticsearch/blob/master/core/src/main/java/org/elasticsearch/action/search/SearchPhaseController.java#L518

So I don't think this check is needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, fair enough! Will remove.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want, and totally up to personal preference, I think truncate's functionality here can be done with a sublist?

new ArrayList<>(buckets.subList(offset, offset + currentSize)) or similar?

But that's just preference, doesn't bother me either way :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely agree with your preference :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be reading this wrong, but if a bucket is skipped I think the final list will be incorrectly sized? E.g.

  • resultSize == size == 5
  • bucketCount == 10
  • Bucket 2 is skipped

The conditional will prevent Bucket 2 from being added to the list, but the loop will decrement regardless and it'll only add four values to newBuckets, rather than grabbing Bucket 6 for a total of five results.

Would it be easier to just resolve the bucket value up front and if it's skipped, avoid adding it to the priority queue entirely? Also see the note below in ComparableBucket regarding resolving the values

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should just preemptively cache the bucket's value (and set skip) in the ctor for the ComparableBucket? resolveBucketValue() isn't terribly expensive or anything, but since this happens during comparison, every push/pop onto the priority queue's heap will invoke a bunch of comparisons and re-resolving these values.

The downside is that the cached values may not be needed if we're only sorting on keys. Dunno, thoughts?

Might be affected by how/if the above skipping thing is actually an issue or not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re: size + skipping issue from above, might be good to have a test for that scenario where an explicit size is set (which is less than the available buckets but spans across a skip).

Skips make everything more complicated :(

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"after all other sibling aggs" <-- potentially confusing since there are sibling pipeline aggs.

Maybe something like "... after all other bucket and metric aggregations"? Or "... after all non-pipeline aggregations"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps a section of the docs (or just a note, somewhere) which mentions you can use bucket_sort without any sorts, as a way to limit the number of buckets? It's kinda implied since the docs say sort is optional, but maybe someone will overlook that? It's such a useful part of the feature I'd hate people to miss it :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an example

@polyfractal
Copy link
Contributor

Left some comments, mostly little things. Only potentially thorny issue is how skips are handled / ComparableBucket resolving. But I may totally have misread that section of the code, so feel free to correct me :)

Otherwise I think this looks really good, and all the tests are lovely! ❤️ :)

This commit adds a parent pipeline aggregation that allows
sorting the buckets of a parent multi-bucket aggregation.

The aggregation also offers [from] and [size] parameters
in order to truncate the result as desired.

Closes elastic#14928
@dimitris-athanasiou dimitris-athanasiou force-pushed the aggregations-bucket-sort-pipeline branch from d31fd0e to 65ba372 Compare November 7, 2017 20:00
@dimitris-athanasiou
Copy link
Contributor Author

@polyfractal I have pushed a commit that addresses all feedback. Let me know what you think when you have time.

@polyfractal
Copy link
Contributor

LGTM! :)

@clintongormley clintongormley removed the >docs General docs changes label Nov 9, 2017
@dimitris-athanasiou dimitris-athanasiou merged commit 66bef26 into elastic:master Nov 9, 2017
@dimitris-athanasiou dimitris-athanasiou deleted the aggregations-bucket-sort-pipeline branch November 9, 2017 18:00
jasontedor added a commit that referenced this pull request Nov 9, 2017
* master: (22 commits)
  Update Tika version to 1.15
  Aggregations: bucket_sort pipeline aggregation (#27152)
  Introduce templating support to timezone/locale in DateProcessor (#27089)
  Increase logging on qa:mixed-cluster tests
  Update to AWS SDK 1.11.223 (#27278)
  Improve error message for parse failures of completion fields (#27297)
  Ensure external refreshes will also refresh internal searcher to minimize segment creation (#27253)
  Remove optimisations to reuse objects when applying a new `ClusterState` (#27317)
  Decouple `ChannelFactory` from Tcp classes (#27286)
  Fix find remote when building BWC
  Remove colons from task and configuration names
  Add unreleased 5.6.5 version number
  testCreateSplitIndexToN: do not set `routing_partition_size` to >= `number_of_routing_shards`
  Snapshot/Restore: better handle incorrect chunk_size settings in FS repo (#26844)
  Add limits for ngram and shingle settings (#27211) (#27318)
  Correct comment in index shard test
  Roll translog generation on primary promotion
  ObjectParser: Replace IllegalStateException with ParsingException (#27302)
  scripted_metric _agg parameter disappears if params are provided (#27159)
  Update discovery-ec2.asciidoc
  ...
dimitris-athanasiou added a commit that referenced this pull request Nov 10, 2017
This commit adds a parent pipeline aggregation that allows
sorting the buckets of a parent multi-bucket aggregation.

The aggregation also offers [from] and [size] parameters
in order to truncate the result as desired.

Closes #14928
return new BucketSelectorPipelineAggregationBuilder(name, script, bucketsPaths);
}

public static BucketSortPipelineAggregationBuilder bucketSort(String name, List<FieldSortBuilder> sorts) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is field sort builder specified, instead of a more flexible sort builder ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it can only refer to a bucket path in the parent aggregation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants