-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Use the breadth first collection mode for significant terms aggs. #29042
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use the breadth first collection mode for significant terms aggs. #29042
Conversation
|
Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually? |
1 similar comment
|
Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @matarrese for the PR! The approach looks good to me overall. Here are a few higher-level thoughts, in addition to the comments on the code:
- It would be great to give an example of
collect_modein the significant terms documentation (https://github.com/elastic/elasticsearch/blob/master/docs/reference/aggregations/bucket/significantterms-aggregation.asciidoc), with a link to the section of thetermsaggregation documentation that describes the different options. - We should aim to add a few more tests to cover these changes.
- First, it’d be good to add a random collection mode in
SignificantTermsTeststo test parsing, equals and hashcode, etc. - It’d also be good to randomize the collection mode in
SignificantTermsAggregatorTests, as we do inTermsAggregatorTests. The same goes for the integration testSignificantTermsSignificanceScoreIT. - Finally, we should sanity check that the collection mode is actually being propagated. I think this could be done in
SignificantTermsAggregatorTestsby creating an aggregator, and then checkingshouldDefer.
- First, it’d be good to add a random collection mode in
| super(in, ValuesSourceType.ANY); | ||
| bucketCountThresholds = new BucketCountThresholds(in); | ||
| executionHint = in.readOptionalString(); | ||
| collectMode = in.readOptionalWriteable(SubAggCollectionMode::readFromStream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll need to add a version check here and ininnerWriteTo, since we may be communicating with older nodes that don't support collectMode. Here's an example of such a version check: https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/elasticsearch/search/aggregations/bucket/range/InternalBinaryRange.java#L77. For now, we'll want to check against the current master version (Version.V_7_0_0_alpha1).
|
|
||
| // return the SubAggCollectionMode that this aggregation should use based on the expected size | ||
| // and the cardinality of the field | ||
| static SubAggCollectionMode subAggCollectionMode(int expectedSize, long maxOrd) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to avoid duplicating the code in these two methods, as they're identical to TermsAggregatorFactory#subAggCollectionMode and getMaxOrd. Perhaps we could move these to ValuesSourceAggregatorFactory to make a common implementation available.
|
@matarrese is this something you're interested in picking up again? I wanted to check in case you'd prefer for us to take it over. I'm sorry it took so long to review after you first opened it. |
|
Happy to work on this! I will update you soon. Thank you for bringing it up. |
…-agg * master: (528 commits) Register Azure max_retries setting (elastic#35286) add version 6.4.4 [Docs] Add painless context details for bucket_script (elastic#35142) Upgrade jline to 3.8.2 (elastic#35288) SQL: new SQL CLI logo (elastic#35261) Logger: Merge ESLoggerFactory into Loggers (elastic#35146) Docs: Add section about range query for range type (elastic#35222) [ILM] change remove-policy-from-index http method from DELETE to POST (elastic#35268) [CCR] Forgot missing return statement, SQL: Fix null handling for AND and OR in SELECT (elastic#35277) [TEST] Mute ChangePolicyForIndexIT#testChangePolicyForIndex Serialize ignore_throttled also to 6.6 after backport Check for java 11 in buildSrc (elastic#35260) [TEST] increase await timeout in RemoteClusterConnectionTests Add missing up-to-date configuration (elastic#35255) Adapt Lucene BWC version SQL: Introduce Coalesce function (elastic#35253) Upgrade to lucene-8.0.0-snapshot-31d7dfe6b1 (elastic#35224) Fix failing ICU tests (elastic#35207) Prevent throttled indices to be searched through wildcards by default (elastic#34354) ...
|
Hi @jtibshirani, |
|
Thanks @matarrese for the changes, I will take another look shortly. With respect to
This test around remapping global ordinals performs a similar set of steps: https://github.com/elastic/elasticsearch/blob/master/server/src/test/java/org/elasticsearch/search/aggregations/bucket/terms/TermsAggregatorTests.java#L113 |
|
Hi @jtibshirani please have a look at the new test introduces to test |
jtibshirani
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @matarrese, I'm really sorry for completely dropping this review on the floor. I really appreciate your contribution, and am looking forward to integrating this change!
I left a few small comments -- the PR looks close to being done. After you merge in the latest changes and resolve conflicts, I'll start a full CI build to make sure tests pass.
| factory.backgroundFilter(QueryBuilders.termsQuery("foo", "bar")); | ||
| } | ||
| if (randomBoolean()) { | ||
| int collectMode = randomInt(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to use randomFrom(Aggregator.SubAggCollectionMode.values()) here (as you did in SignificantTermsSignificanceScoreIT.java).
|
|
||
| Please note that Elasticsearch will ignore this execution hint if it is not applicable. | ||
|
|
||
| ==== Collection mode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of giving a detailed explanation of the breadth vs. depth-first modes here, I think it would be good to give a one-sentence summary and then simply link to the terms aggregation documentation. That way we don't have two separate explanations to try to keep in sync. The way to do this is to add a tag like [[search-aggregations-bucket-terms-aggregation-collect]] to the == Collect Mode == section in terms-aggregation.asciidoc. Then you could link to it here by doing <<search-aggregations-bucket-terms-aggregation-collect, terms aggregation collection mode>>.
Also, a super small comment, but maybe we could move this above the 'execution hint' section and name it 'collect mode' for consistency with the terms aggregation reference?
|
|
||
| import static org.hamcrest.Matchers.equalTo; | ||
|
|
||
| public class SignificantTermsAggregatorFactoryTests extends ESTestCase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we have a single implementation of subAggCollectionMode in ValuesSourceAggregatorFactory, I think these tests are redundant.
| * Get the maximum global ordinal value for the provided {@link ValuesSource} or -1 | ||
| * if the values source is not an instance of {@link ValuesSource.Bytes.WithOrdinals}. | ||
| */ | ||
| static long getMaxOrd(ValuesSource source, IndexSearcher searcher) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we did for subAggregationMode, this method can also be moved to ValuesSourceAggregatorFactory to avoid duplication.
| out.writeOptionalString(executionHint); | ||
| if (out.getVersion().onOrAfter(Version.V_7_0_0_alpha1)) { | ||
| out.writeOptionalWriteable(collectMode); | ||
| } else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple comments:
- Because I've taken so long to review, we'll need to update this version constant to be the latest (
V_8_0_0). Once it's backported to previous versions, I'll make sure to adjust it. - If the node we're writing to is older than the current version, then writing a collection mode will result in an error, since that node is not expecting a field here. I'm going to check with the rest of the team what would be best to do in this case, and will get back to you. We could either skip the collection mode, or throw an error here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should have this option in the builder but if we do I don't think we should throw an error either. This option should be treated as best effort so IMO it is ok to just ignore it for older nodes.
|
Thank you @jtibshirani for your update. Please have a look at the new commit. |
|
@elasticmachine test this please |
|
@matarrese thanks for the update, the latest round is looking good to me. I just kicked off a build and it looks like there are some checkstyle issues. To debug these locally, you can run |
jimczi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments regarding how we want to expose the execution mode and how to test it.
Thanks for tackling this @matarrese and sorry for the delay reviewing it.
| `exclude` parameters which are based on a regular expression string or arrays of exact terms. This functionality mirrors the features | ||
| described in the <<search-aggregations-bucket-terms-aggregation,terms aggregation>> documentation. | ||
|
|
||
| ==== Collect Mode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to expose this option to users. IMO this is different than the terms aggregation where the choice depends on the cardinality of the field, the significant_terms aggregation runs on high cardinality field so it should always pick the breadth_first mode.
| out.writeOptionalString(executionHint); | ||
| if (out.getVersion().onOrAfter(Version.V_7_0_0_alpha1)) { | ||
| out.writeOptionalWriteable(collectMode); | ||
| } else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should have this option in the builder but if we do I don't think we should throw an error either. This option should be treated as best effort so IMO it is ok to just ignore it for older nodes.
|
|
||
| private IncludeExclude includeExclude = null; | ||
| private String executionHint = null; | ||
| private SubAggCollectionMode collectMode = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should have this option in the builder
| subAggCollectMode, false, pipelineAggregators, metaData); | ||
| this.significanceHeuristic = significanceHeuristic; | ||
| this.termsAggFactory = termsAggFactory; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll also need to change buildAggregation to call runDeferredCollections in order to run the deferred aggregations on the top buckets. You can check here how it is done for the terms aggregation on strings.
| subAggCollectMode, false, includeExclude, pipelineAggregators, metaData); | ||
| this.significanceHeuristic = significanceHeuristic; | ||
| this.termsAggFactory = termsAggFactory; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll also need to change buildAggregation to call runDeferredCollections in order to run the deferred aggregations on the top buckets. You can check here how it is done for the terms aggregation on strings.
| Map<String, Object> metaData) throws IOException { | ||
| super(name, factories, valuesSource, null, format, bucketCountThresholds, includeExclude, context, parent, | ||
| forceRemapGlobalOrds, SubAggCollectionMode.DEPTH_FIRST, false, pipelineAggregators, metaData); | ||
| forceRemapGlobalOrds, subAggCollectMode, false, pipelineAggregators, metaData); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd go with BREADTH_FIRST all the time ;)
| .addAggregation(terms("class").field("class").subAggregation(significantTerms("mySignificantTerms") | ||
| .field("text") | ||
| .executionHint(randomExecutionHint()) | ||
| .collectMode(randomFrom(Aggregator.SubAggCollectionMode.values())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need a test with sub-aggregations (a terms aggregation under a significant_terms for instance) to test the breadth_first mode. Without sub-aggregations depth_first and breadth_first are no different.
|
We caught up offline and confirmed @jimczi's recommendations for this PR:
Unfortunately this requires undoing some of the changes in this PR to select the mode dynamically and expose the |
|
@matarrese I am happy to pick this up and make the requested changes. If it's okay with you, I could push to your branch so the commit will be 'co-authored' by you in git history. |
|
Hey @jtibshirani, I'm fine with that. |
|
@elasticmachine run elasticsearch-ci/1 |
|
@jimczi this is ready for another look whenever you have the chance. |
jimczi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for following on this @jtibshirani , I left one comment regarding the visibility change of shouldDefer, feel free to ignore if you think that we need to keep those tests.
|
|
||
| @Override | ||
| protected boolean shouldDefer(Aggregator aggregator) { | ||
| public boolean shouldDefer(Aggregator aggregator) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there another way to test this function without changing the visibility ? I understand it's for unit tests but the first condition is always true for significant terms and we don't verify the other condition in the added uts so is there a value to do this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'll just remove the test and avoid making this public -- the test was more valuable in previous versions of the PR when we were dynamically selecting the collection mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, thanks
|
@elasticmachine test this please |
Closes #28652