Use the breadth first collection mode for significant terms aggs. #29042

matarrese · 2018-03-13T23:35:36Z

elasticmachine · 2018-03-13T23:35:38Z

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

elasticmachine · 2018-03-13T23:35:38Z

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

jtibshirani

Thank you @matarrese for the PR! The approach looks good to me overall. Here are a few higher-level thoughts, in addition to the comments on the code:

It would be great to give an example of collect_mode in the significant terms documentation (https://github.com/elastic/elasticsearch/blob/master/docs/reference/aggregations/bucket/significantterms-aggregation.asciidoc), with a link to the section of the terms aggregation documentation that describes the different options.
We should aim to add a few more tests to cover these changes.
- First, it’d be good to add a random collection mode in SignificantTermsTests to test parsing, equals and hashcode, etc.
- It’d also be good to randomize the collection mode in SignificantTermsAggregatorTests, as we do in TermsAggregatorTests. The same goes for the integration test SignificantTermsSignificanceScoreIT.
- Finally, we should sanity check that the collection mode is actually being propagated. I think this could be done in SignificantTermsAggregatorTests by creating an aggregator, and then checking shouldDefer.

jtibshirani · 2018-08-06T22:12:09Z

...elasticsearch/search/aggregations/bucket/significant/SignificantTermsAggregationBuilder.java

        super(in, ValuesSourceType.ANY);
        bucketCountThresholds = new BucketCountThresholds(in);
        executionHint = in.readOptionalString();
+        collectMode = in.readOptionalWriteable(SubAggCollectionMode::readFromStream);


We'll need to add a version check here and ininnerWriteTo, since we may be communicating with older nodes that don't support collectMode. Here's an example of such a version check: https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/elasticsearch/search/aggregations/bucket/range/InternalBinaryRange.java#L77. For now, we'll want to check against the current master version (Version.V_7_0_0_alpha1).

jtibshirani · 2018-08-06T22:32:16Z

.../elasticsearch/search/aggregations/bucket/significant/SignificantTermsAggregatorFactory.java

+
+    // return the SubAggCollectionMode that this aggregation should use based on the expected size
+    // and the cardinality of the field
+    static SubAggCollectionMode subAggCollectionMode(int expectedSize, long maxOrd) {


It would be good to avoid duplicating the code in these two methods, as they're identical to TermsAggregatorFactory#subAggCollectionMode and getMaxOrd. Perhaps we could move these to ValuesSourceAggregatorFactory to make a common implementation available.

jtibshirani · 2018-10-18T00:31:02Z

@matarrese is this something you're interested in picking up again? I wanted to check in case you'd prefer for us to take it over. I'm sorry it took so long to review after you first opened it.

matarrese · 2018-10-18T08:52:02Z

Happy to work on this! I will update you soon. Thank you for bringing it up.

…-agg * master: (528 commits) Register Azure max_retries setting (elastic#35286) add version 6.4.4 [Docs] Add painless context details for bucket_script (elastic#35142) Upgrade jline to 3.8.2 (elastic#35288) SQL: new SQL CLI logo (elastic#35261) Logger: Merge ESLoggerFactory into Loggers (elastic#35146) Docs: Add section about range query for range type (elastic#35222) [ILM] change remove-policy-from-index http method from DELETE to POST (elastic#35268) [CCR] Forgot missing return statement, SQL: Fix null handling for AND and OR in SELECT (elastic#35277) [TEST] Mute ChangePolicyForIndexIT#testChangePolicyForIndex Serialize ignore_throttled also to 6.6 after backport Check for java 11 in buildSrc (elastic#35260) [TEST] increase await timeout in RemoteClusterConnectionTests Add missing up-to-date configuration (elastic#35255) Adapt Lucene BWC version SQL: Introduce Coalesce function (elastic#35253) Upgrade to lucene-8.0.0-snapshot-31d7dfe6b1 (elastic#35224) Fix failing ICU tests (elastic#35207) Prevent throttled indices to be searched through wildcards by default (elastic#34354) ...

matarrese · 2018-11-06T20:52:34Z

Hi @jtibshirani,
sorry for this late update. Please have a look at the latest commit.
I've missed adding a test to check shouldefer value, it's not completely clear how I can test this. If you can gently provide more details I'm happy to update this.
Thanks

jtibshirani · 2018-11-10T00:30:57Z

Thanks @matarrese for the changes, I will take another look shortly.

With respect to shouldDefer, I was thinking we could add a test in SignificantTermsAggregatorTests that does the following:

Creates a SignificantTermsAggregationBuilder and sets collectMode
Creates an aggregator from the builder through createAggregator
Checks that the aggregator produces the right value for TermsAggregator#shouldDefer

This test around remapping global ordinals performs a similar set of steps: https://github.com/elastic/elasticsearch/blob/master/server/src/test/java/org/elasticsearch/search/aggregations/bucket/terms/TermsAggregatorTests.java#L113

matarrese · 2018-11-18T10:20:14Z

Hi @jtibshirani please have a look at the new test introduces to test shouldDefer method.
I had to change its visibility to be accessible from SignificantTermsAggregatorTests, otherwise I could move it inside TermsAggregatorTest

jtibshirani

Hi @matarrese, I'm really sorry for completely dropping this review on the floor. I really appreciate your contribution, and am looking forward to integrating this change!

I left a few small comments -- the PR looks close to being done. After you merge in the latest changes and resolve conflicts, I'll start a full CI build to make sure tests pass.

jtibshirani · 2018-11-09T23:58:20Z

server/src/test/java/org/elasticsearch/search/aggregations/bucket/SignificantTermsTests.java

            factory.backgroundFilter(QueryBuilders.termsQuery("foo", "bar"));
        }
+        if (randomBoolean()) {
+            int collectMode = randomInt(1);


It would be nice to use randomFrom(Aggregator.SubAggCollectionMode.values()) here (as you did in SignificantTermsSignificanceScoreIT.java).

jtibshirani · 2018-11-22T03:11:25Z

docs/reference/aggregations/bucket/significantterms-aggregation.asciidoc


 Please note that Elasticsearch will ignore this execution hint if it is not applicable.
+
+==== Collection mode


Instead of giving a detailed explanation of the breadth vs. depth-first modes here, I think it would be good to give a one-sentence summary and then simply link to the terms aggregation documentation. That way we don't have two separate explanations to try to keep in sync. The way to do this is to add a tag like [[search-aggregations-bucket-terms-aggregation-collect]] to the == Collect Mode == section in terms-aggregation.asciidoc. Then you could link to it here by doing <<search-aggregations-bucket-terms-aggregation-collect, terms aggregation collection mode>>.

Also, a super small comment, but maybe we could move this above the 'execution hint' section and name it 'collect mode' for consistency with the terms aggregation reference?

jtibshirani · 2018-11-22T03:21:24Z

...ticsearch/search/aggregations/bucket/significant/SignificantTermsAggregatorFactoryTests.java

+
+import static org.hamcrest.Matchers.equalTo;
+
+public class SignificantTermsAggregatorFactoryTests extends ESTestCase {


Now that we have a single implementation of subAggCollectionMode in ValuesSourceAggregatorFactory, I think these tests are redundant.

jtibshirani · 2019-02-07T02:08:14Z

.../elasticsearch/search/aggregations/bucket/significant/SignificantTermsAggregatorFactory.java

+     * Get the maximum global ordinal value for the provided {@link ValuesSource} or -1
+     * if the values source is not an instance of {@link ValuesSource.Bytes.WithOrdinals}.
+     */
+    static long getMaxOrd(ValuesSource source, IndexSearcher searcher) throws IOException {


As we did for subAggregationMode, this method can also be moved to ValuesSourceAggregatorFactory to avoid duplication.

jtibshirani · 2019-02-12T02:14:35Z

...elasticsearch/search/aggregations/bucket/significant/SignificantTermsAggregationBuilder.java

        out.writeOptionalString(executionHint);
+        if (out.getVersion().onOrAfter(Version.V_7_0_0_alpha1)) {
+            out.writeOptionalWriteable(collectMode);
+        } else {


A couple comments:

Because I've taken so long to review, we'll need to update this version constant to be the latest (V_8_0_0). Once it's backported to previous versions, I'll make sure to adjust it.

If the node we're writing to is older than the current version, then writing a collection mode will result in an error, since that node is not expecting a field here. I'm going to check with the rest of the team what would be best to do in this case, and will get back to you. We could either skip the collection mode, or throw an error here.

I don't think we should have this option in the builder but if we do I don't think we should throw an error either. This option should be treated as best effort so IMO it is ok to just ignore it for older nodes.

…-agg

matarrese · 2019-02-13T15:50:17Z

Thank you @jtibshirani for your update. Please have a look at the new commit.

jtibshirani · 2019-02-15T05:10:26Z

@elasticmachine test this please

jtibshirani · 2019-02-15T05:23:11Z

@matarrese thanks for the update, the latest round is looking good to me. I just kicked off a build and it looks like there are some checkstyle issues. To debug these locally, you can run ./gradlew checkstyle from the root folder of the repository. You can also try ./gradlew precommit --parallel to run a more extensive set of sanity checks.

jimczi

I left some comments regarding how we want to expose the execution mode and how to test it.
Thanks for tackling this @matarrese and sorry for the delay reviewing it.

jimczi · 2019-02-15T20:53:20Z

docs/reference/aggregations/bucket/significantterms-aggregation.asciidoc

 `exclude` parameters which are based on a regular expression string or arrays of exact terms. This functionality mirrors the features
 described in the <<search-aggregations-bucket-terms-aggregation,terms aggregation>> documentation.

+==== Collect Mode


I don't think we need to expose this option to users. IMO this is different than the terms aggregation where the choice depends on the cardinality of the field, the significant_terms aggregation runs on high cardinality field so it should always pick the breadth_first mode.

jimczi · 2019-02-15T21:02:03Z

...elasticsearch/search/aggregations/bucket/significant/SignificantTermsAggregationBuilder.java

        out.writeOptionalString(executionHint);
+        if (out.getVersion().onOrAfter(Version.V_7_0_0_alpha1)) {
+            out.writeOptionalWriteable(collectMode);
+        } else {


I don't think we should have this option in the builder but if we do I don't think we should throw an error either. This option should be treated as best effort so IMO it is ok to just ignore it for older nodes.

jimczi · 2019-02-15T21:02:29Z

...elasticsearch/search/aggregations/bucket/significant/SignificantTermsAggregationBuilder.java


    private IncludeExclude includeExclude = null;
    private String executionHint = null;
+    private SubAggCollectionMode collectMode = null;


I don't think we should have this option in the builder

jimczi · 2019-02-15T21:02:57Z

...g/elasticsearch/search/aggregations/bucket/significant/SignificantStringTermsAggregator.java

+                subAggCollectMode, false, pipelineAggregators, metaData);
        this.significanceHeuristic = significanceHeuristic;
        this.termsAggFactory = termsAggFactory;
    }


You'll also need to change buildAggregation to call runDeferredCollections in order to run the deferred aggregations on the top buckets. You can check here how it is done for the terms aggregation on strings.

jimczi · 2019-02-15T21:03:38Z

...org/elasticsearch/search/aggregations/bucket/significant/SignificantLongTermsAggregator.java

+                subAggCollectMode, false, includeExclude, pipelineAggregators, metaData);
        this.significanceHeuristic = significanceHeuristic;
        this.termsAggFactory = termsAggFactory;
    }


You'll also need to change buildAggregation to call runDeferredCollections in order to run the deferred aggregations on the top buckets. You can check here how it is done for the terms aggregation on strings.

jimczi · 2019-02-15T21:05:55Z

...csearch/search/aggregations/bucket/significant/GlobalOrdinalsSignificantTermsAggregator.java

                                                    Map<String, Object> metaData) throws IOException {
        super(name, factories, valuesSource, null, format, bucketCountThresholds, includeExclude, context, parent,
-            forceRemapGlobalOrds, SubAggCollectionMode.DEPTH_FIRST, false, pipelineAggregators, metaData);
+            forceRemapGlobalOrds, subAggCollectMode, false, pipelineAggregators, metaData);


I'd go with BREADTH_FIRST all the time ;)

jimczi · 2019-02-15T21:10:00Z

...t/java/org/elasticsearch/search/aggregations/bucket/SignificantTermsSignificanceScoreIT.java

                    .addAggregation(terms("class").field("class").subAggregation(significantTerms("mySignificantTerms")
                            .field("text")
                            .executionHint(randomExecutionHint())
+                            .collectMode(randomFrom(Aggregator.SubAggCollectionMode.values()))


We also need a test with sub-aggregations (a terms aggregation under a significant_terms for instance) to test the breadth_first mode. Without sub-aggregations depth_first and breadth_first are no different.

jtibshirani · 2019-02-19T07:14:09Z

@jimczi’s review raised a few questions about this option that weren’t quite settled. I added them to the original issue (#28652) and asked some questions to start a discussion.

jtibshirani · 2019-03-07T01:00:34Z

We caught up offline and confirmed @jimczi's recommendations for this PR:

For simplicity we should always choose breadth_first instead of dynamically selecting a mode based on the field's cardinality.
Unlike terms aggregations, the user should not be able to specify collect_mode. We'll avoid exposing this 'expert' option, and always go with the breadth_first mode.

Unfortunately this requires undoing some of the changes in this PR to select the mode dynamically and expose the collect_mode option. @matarrese would you be up for updating the PR with the simpler approach? My apologies that the approach wasn't more settled before we jumped in.

jtibshirani · 2019-04-08T20:58:10Z

@matarrese I am happy to pick this up and make the requested changes. If it's okay with you, I could push to your branch so the commit will be 'co-authored' by you in git history.

matarrese · 2019-04-09T00:28:32Z

Hey @jtibshirani, I'm fine with that.
Sorry for not replying sooner.

…-agg

jtibshirani · 2019-04-11T18:05:47Z

@elasticmachine run elasticsearch-ci/1

jtibshirani · 2019-04-11T18:06:00Z

@jimczi this is ready for another look whenever you have the chance.

jimczi

LGTM, thanks for following on this @jtibshirani , I left one comment regarding the visibility change of shouldDefer, feel free to ignore if you think that we need to keep those tests.

jimczi · 2019-04-11T18:33:58Z

server/src/main/java/org/elasticsearch/search/aggregations/bucket/terms/TermsAggregator.java


    @Override
-    protected boolean shouldDefer(Aggregator aggregator) {
+    public boolean shouldDefer(Aggregator aggregator) {


Is there another way to test this function without changing the visibility ? I understand it's for unit tests but the first condition is always true for significant terms and we don't verify the other condition in the added uts so is there a value to do this ?

I think I'll just remove the test and avoid making this public -- the test was more valuable in previous versions of the PR when we were dynamically selecting the collection mode.

jtibshirani · 2019-04-11T19:36:53Z

@elasticmachine test this please

…9042) This helps avoid memory issues when computing deep sub-aggregations. Because it should be rare to use sub-aggregations with significant terms, we opted to always choose breadth first as opposed to exposing a `collect_mode` option. Closes #28652.

SubAggCollectionMode field for SignificantTermAggregation

93f1a70

SignificantTermsAggregatorFactoryTests added

2302e59

cbuescher added review :Analytics/Aggregations Aggregations labels Mar 15, 2018

colings86 added the >feature label Apr 24, 2018

jtibshirani reviewed Aug 6, 2018

View reviewed changes

rjernst removed the review label Oct 10, 2018

matarrese added 3 commits November 5, 2018 23:09

Merge remote-tracking branch 'upstream/master'

3f60fb6

documentation and refactoring

a9f5e3e

testing should defer

0b8f805

jtibshirani reviewed Feb 12, 2019

View reviewed changes

matarrese added 2 commits February 13, 2019 07:53

Merge branch 'master' into breadth_first-support-for-significant-term…

d11841b

…-agg

refactor significant terms aggregator with documentation

311c5ec

jimczi requested changes Feb 15, 2019

View reviewed changes

jtibshirani mentioned this pull request Feb 19, 2019

Significant terms aggregations should support breadth_first mode #28652

Closed

jtibshirani self-assigned this Apr 9, 2019

jtibshirani added 5 commits April 10, 2019 14:12

Merge branch 'master' into breadth_first-support-for-significant-term…

86b358a

…-agg

Always use the breadth first collection mode.

ac3136e

Update the docs to clarify that breadth_first is always used.

dbe8398

Merge branch 'master' into breadth_first-support-for-significant-term…

082b61b

…-agg

Minor clean-ups.

3669efe

jimczi approved these changes Apr 11, 2019

View reviewed changes

Remove SignificantTermsAggregatorTests#testShouldDefer.

8d9e246

jtibshirani merged commit badb855 into elastic:master Apr 11, 2019

jtibshirani changed the title ~~SubAggCollectionMode field for SignificantTermAggregation~~ Use the breadth first collection mode for significant terms aggs. Apr 11, 2019

jtibshirani added v7.2.0 v8.0.0 v7.0.1 and removed v7.0.1 labels Apr 11, 2019

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021


		Please note that Elasticsearch will ignore this execution hint if it is not applicable.

		==== Collection mode


		import static org.hamcrest.Matchers.equalTo;

		public class SignificantTermsAggregatorFactoryTests extends ESTestCase {

Use the breadth first collection mode for significant terms aggs. #29042

Use the breadth first collection mode for significant terms aggs. #29042

Uh oh!

Conversation

matarrese commented Mar 13, 2018

Uh oh!

elasticmachine commented Mar 13, 2018

Uh oh!

elasticmachine commented Mar 13, 2018

Uh oh!

jtibshirani left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jtibshirani commented Oct 18, 2018

Uh oh!

matarrese commented Oct 18, 2018

Uh oh!

matarrese commented Nov 6, 2018

Uh oh!

jtibshirani commented Nov 10, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

matarrese commented Nov 18, 2018

Uh oh!

jtibshirani left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

matarrese commented Feb 13, 2019

Uh oh!

jtibshirani commented Feb 15, 2019

Uh oh!

jtibshirani commented Feb 15, 2019

Uh oh!

jimczi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jtibshirani commented Feb 19, 2019

Uh oh!

jtibshirani commented Mar 7, 2019

Uh oh!

jtibshirani commented Apr 8, 2019

Uh oh!

matarrese commented Apr 9, 2019

Uh oh!

jtibshirani commented Apr 11, 2019

Uh oh!

jtibshirani commented Apr 11, 2019

Uh oh!

jimczi left a comment

jtibshirani left a comment •

edited

Loading

jtibshirani commented Nov 10, 2018 •

edited

Loading