Skip to content

Conversation

@matarrese
Copy link
Contributor

Closes #28652

@elasticmachine
Copy link
Collaborator

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

1 similar comment
@elasticmachine
Copy link
Collaborator

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

Copy link
Contributor

@jtibshirani jtibshirani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @matarrese for the PR! The approach looks good to me overall. Here are a few higher-level thoughts, in addition to the comments on the code:

  • It would be great to give an example of collect_mode in the significant terms documentation (https://github.com/elastic/elasticsearch/blob/master/docs/reference/aggregations/bucket/significantterms-aggregation.asciidoc), with a link to the section of the terms aggregation documentation that describes the different options.
  • We should aim to add a few more tests to cover these changes.
    • First, it’d be good to add a random collection mode in SignificantTermsTests to test parsing, equals and hashcode, etc.
    • It’d also be good to randomize the collection mode in SignificantTermsAggregatorTests, as we do in TermsAggregatorTests. The same goes for the integration test SignificantTermsSignificanceScoreIT.
    • Finally, we should sanity check that the collection mode is actually being propagated. I think this could be done in SignificantTermsAggregatorTests by creating an aggregator, and then checking shouldDefer.

super(in, ValuesSourceType.ANY);
bucketCountThresholds = new BucketCountThresholds(in);
executionHint = in.readOptionalString();
collectMode = in.readOptionalWriteable(SubAggCollectionMode::readFromStream);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need to add a version check here and ininnerWriteTo, since we may be communicating with older nodes that don't support collectMode. Here's an example of such a version check: https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/elasticsearch/search/aggregations/bucket/range/InternalBinaryRange.java#L77. For now, we'll want to check against the current master version (Version.V_7_0_0_alpha1).


// return the SubAggCollectionMode that this aggregation should use based on the expected size
// and the cardinality of the field
static SubAggCollectionMode subAggCollectionMode(int expectedSize, long maxOrd) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to avoid duplicating the code in these two methods, as they're identical to TermsAggregatorFactory#subAggCollectionMode and getMaxOrd. Perhaps we could move these to ValuesSourceAggregatorFactory to make a common implementation available.

@rjernst rjernst removed the review label Oct 10, 2018
@jtibshirani
Copy link
Contributor

@matarrese is this something you're interested in picking up again? I wanted to check in case you'd prefer for us to take it over. I'm sorry it took so long to review after you first opened it.

@matarrese
Copy link
Contributor Author

Happy to work on this! I will update you soon. Thank you for bringing it up.

…-agg

* master: (528 commits)
  Register Azure max_retries setting (elastic#35286)
  add version 6.4.4
  [Docs] Add painless context details for bucket_script (elastic#35142)
  Upgrade jline to 3.8.2 (elastic#35288)
  SQL: new SQL CLI logo (elastic#35261)
  Logger: Merge ESLoggerFactory into Loggers (elastic#35146)
  Docs: Add section about range query for range type (elastic#35222)
  [ILM] change remove-policy-from-index http method from DELETE to POST (elastic#35268)
  [CCR] Forgot missing return statement,
  SQL: Fix null handling for AND and OR in SELECT (elastic#35277)
  [TEST] Mute ChangePolicyForIndexIT#testChangePolicyForIndex
  Serialize ignore_throttled also to 6.6 after backport
  Check for java 11 in buildSrc (elastic#35260)
  [TEST] increase await timeout in RemoteClusterConnectionTests
  Add missing up-to-date configuration (elastic#35255)
  Adapt Lucene BWC version
  SQL: Introduce Coalesce function (elastic#35253)
  Upgrade to lucene-8.0.0-snapshot-31d7dfe6b1 (elastic#35224)
  Fix failing ICU tests (elastic#35207)
  Prevent throttled indices to be searched through wildcards by default (elastic#34354)
  ...
@matarrese
Copy link
Contributor Author

Hi @jtibshirani,
sorry for this late update. Please have a look at the latest commit.
I've missed adding a test to check shouldefer value, it's not completely clear how I can test this. If you can gently provide more details I'm happy to update this.
Thanks

@jtibshirani
Copy link
Contributor

jtibshirani commented Nov 10, 2018

Thanks @matarrese for the changes, I will take another look shortly.

With respect to shouldDefer, I was thinking we could add a test in SignificantTermsAggregatorTests that does the following:

  1. Creates a SignificantTermsAggregationBuilder and sets collectMode
  2. Creates an aggregator from the builder through createAggregator
  3. Checks that the aggregator produces the right value for TermsAggregator#shouldDefer

This test around remapping global ordinals performs a similar set of steps: https://github.com/elastic/elasticsearch/blob/master/server/src/test/java/org/elasticsearch/search/aggregations/bucket/terms/TermsAggregatorTests.java#L113

@matarrese
Copy link
Contributor Author

Hi @jtibshirani please have a look at the new test introduces to test shouldDefer method.
I had to change its visibility to be accessible from SignificantTermsAggregatorTests, otherwise I could move it inside TermsAggregatorTest

Copy link
Contributor

@jtibshirani jtibshirani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @matarrese, I'm really sorry for completely dropping this review on the floor. I really appreciate your contribution, and am looking forward to integrating this change!

I left a few small comments -- the PR looks close to being done. After you merge in the latest changes and resolve conflicts, I'll start a full CI build to make sure tests pass.

factory.backgroundFilter(QueryBuilders.termsQuery("foo", "bar"));
}
if (randomBoolean()) {
int collectMode = randomInt(1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to use randomFrom(Aggregator.SubAggCollectionMode.values()) here (as you did in SignificantTermsSignificanceScoreIT.java).


Please note that Elasticsearch will ignore this execution hint if it is not applicable.

==== Collection mode
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of giving a detailed explanation of the breadth vs. depth-first modes here, I think it would be good to give a one-sentence summary and then simply link to the terms aggregation documentation. That way we don't have two separate explanations to try to keep in sync. The way to do this is to add a tag like [[search-aggregations-bucket-terms-aggregation-collect]] to the == Collect Mode == section in terms-aggregation.asciidoc. Then you could link to it here by doing <<search-aggregations-bucket-terms-aggregation-collect, terms aggregation collection mode>>.

Also, a super small comment, but maybe we could move this above the 'execution hint' section and name it 'collect mode' for consistency with the terms aggregation reference?


import static org.hamcrest.Matchers.equalTo;

public class SignificantTermsAggregatorFactoryTests extends ESTestCase {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we have a single implementation of subAggCollectionMode in ValuesSourceAggregatorFactory, I think these tests are redundant.

* Get the maximum global ordinal value for the provided {@link ValuesSource} or -1
* if the values source is not an instance of {@link ValuesSource.Bytes.WithOrdinals}.
*/
static long getMaxOrd(ValuesSource source, IndexSearcher searcher) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we did for subAggregationMode, this method can also be moved to ValuesSourceAggregatorFactory to avoid duplication.

out.writeOptionalString(executionHint);
if (out.getVersion().onOrAfter(Version.V_7_0_0_alpha1)) {
out.writeOptionalWriteable(collectMode);
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple comments:

  • Because I've taken so long to review, we'll need to update this version constant to be the latest (V_8_0_0). Once it's backported to previous versions, I'll make sure to adjust it.
  • If the node we're writing to is older than the current version, then writing a collection mode will result in an error, since that node is not expecting a field here. I'm going to check with the rest of the team what would be best to do in this case, and will get back to you. We could either skip the collection mode, or throw an error here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should have this option in the builder but if we do I don't think we should throw an error either. This option should be treated as best effort so IMO it is ok to just ignore it for older nodes.

@matarrese
Copy link
Contributor Author

Thank you @jtibshirani for your update. Please have a look at the new commit.

@jtibshirani
Copy link
Contributor

@elasticmachine test this please

@jtibshirani
Copy link
Contributor

@matarrese thanks for the update, the latest round is looking good to me. I just kicked off a build and it looks like there are some checkstyle issues. To debug these locally, you can run ./gradlew checkstyle from the root folder of the repository. You can also try ./gradlew precommit --parallel to run a more extensive set of sanity checks.

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments regarding how we want to expose the execution mode and how to test it.
Thanks for tackling this @matarrese and sorry for the delay reviewing it.

`exclude` parameters which are based on a regular expression string or arrays of exact terms. This functionality mirrors the features
described in the <<search-aggregations-bucket-terms-aggregation,terms aggregation>> documentation.

==== Collect Mode
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to expose this option to users. IMO this is different than the terms aggregation where the choice depends on the cardinality of the field, the significant_terms aggregation runs on high cardinality field so it should always pick the breadth_first mode.

out.writeOptionalString(executionHint);
if (out.getVersion().onOrAfter(Version.V_7_0_0_alpha1)) {
out.writeOptionalWriteable(collectMode);
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should have this option in the builder but if we do I don't think we should throw an error either. This option should be treated as best effort so IMO it is ok to just ignore it for older nodes.


private IncludeExclude includeExclude = null;
private String executionHint = null;
private SubAggCollectionMode collectMode = null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should have this option in the builder

subAggCollectMode, false, pipelineAggregators, metaData);
this.significanceHeuristic = significanceHeuristic;
this.termsAggFactory = termsAggFactory;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll also need to change buildAggregation to call runDeferredCollections in order to run the deferred aggregations on the top buckets. You can check here how it is done for the terms aggregation on strings.

subAggCollectMode, false, includeExclude, pipelineAggregators, metaData);
this.significanceHeuristic = significanceHeuristic;
this.termsAggFactory = termsAggFactory;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll also need to change buildAggregation to call runDeferredCollections in order to run the deferred aggregations on the top buckets. You can check here how it is done for the terms aggregation on strings.

Map<String, Object> metaData) throws IOException {
super(name, factories, valuesSource, null, format, bucketCountThresholds, includeExclude, context, parent,
forceRemapGlobalOrds, SubAggCollectionMode.DEPTH_FIRST, false, pipelineAggregators, metaData);
forceRemapGlobalOrds, subAggCollectMode, false, pipelineAggregators, metaData);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd go with BREADTH_FIRST all the time ;)

.addAggregation(terms("class").field("class").subAggregation(significantTerms("mySignificantTerms")
.field("text")
.executionHint(randomExecutionHint())
.collectMode(randomFrom(Aggregator.SubAggCollectionMode.values()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need a test with sub-aggregations (a terms aggregation under a significant_terms for instance) to test the breadth_first mode. Without sub-aggregations depth_first and breadth_first are no different.

@jtibshirani
Copy link
Contributor

@jimczi’s review raised a few questions about this option that weren’t quite settled. I added them to the original issue (#28652) and asked some questions to start a discussion.

@jtibshirani
Copy link
Contributor

We caught up offline and confirmed @jimczi's recommendations for this PR:

  • For simplicity we should always choose breadth_first instead of dynamically selecting a mode based on the field's cardinality.
  • Unlike terms aggregations, the user should not be able to specify collect_mode. We'll avoid exposing this 'expert' option, and always go with the breadth_first mode.

Unfortunately this requires undoing some of the changes in this PR to select the mode dynamically and expose the collect_mode option. @matarrese would you be up for updating the PR with the simpler approach? My apologies that the approach wasn't more settled before we jumped in.

@jtibshirani
Copy link
Contributor

@matarrese I am happy to pick this up and make the requested changes. If it's okay with you, I could push to your branch so the commit will be 'co-authored' by you in git history.

@matarrese
Copy link
Contributor Author

Hey @jtibshirani, I'm fine with that.
Sorry for not replying sooner.

@jtibshirani jtibshirani self-assigned this Apr 9, 2019
@jtibshirani
Copy link
Contributor

@elasticmachine run elasticsearch-ci/1

@jtibshirani
Copy link
Contributor

@jimczi this is ready for another look whenever you have the chance.

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for following on this @jtibshirani , I left one comment regarding the visibility change of shouldDefer, feel free to ignore if you think that we need to keep those tests.


@Override
protected boolean shouldDefer(Aggregator aggregator) {
public boolean shouldDefer(Aggregator aggregator) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there another way to test this function without changing the visibility ? I understand it's for unit tests but the first condition is always true for significant terms and we don't verify the other condition in the added uts so is there a value to do this ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'll just remove the test and avoid making this public -- the test was more valuable in previous versions of the PR when we were dynamically selecting the collection mode.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, thanks

@jtibshirani
Copy link
Contributor

@elasticmachine test this please

@jtibshirani jtibshirani merged commit badb855 into elastic:master Apr 11, 2019
@jtibshirani jtibshirani changed the title SubAggCollectionMode field for SignificantTermAggregation Use the breadth first collection mode for significant terms aggs. Apr 11, 2019
jtibshirani pushed a commit that referenced this pull request Apr 11, 2019
…9042)

This helps avoid memory issues when computing deep sub-aggregations. Because it
should be rare to use sub-aggregations with significant terms, we opted to always
choose breadth first as opposed to exposing a `collect_mode` option.

Closes #28652.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Significant terms aggregations should support breadth_first mode

8 participants