Add superset size to Significant Term REST response #24865

tlrx · 2017-05-24T16:22:07Z

This commit adds a new bg_count field to the REST response of SignificantTerms aggregations. Similarly to the bg_count that already exists in significant terms buckets, this new bg_count field is set at the aggregation level and is populated with the superset size value.

The addition of this field allows the High Level REST client to provide implementations of SignificantTermsand SignificantTerms.Bucket with the exact same behavior as the internal implementations. Before this pull request, a significant term aggregation didn't know about the superset size at all. Same thing for the aggregation's buckets that throw unsupported operation exceptions because the aggregation's superset size and subset size were unkown. This PR fixes that and adds support for both fields that are now populated at parsing time.

Note that the subset size field at the bucket could have been implemented before but I didn't know much about how to do that. Thanks to @markharwood it is now supported.

There's a bit of history around these fields (see #5146 (comment)) and the superset size information was not added to the aggregation at the first place. I think the main argument was that it could be retrieved at query time using a Global aggregation. This argument is still valid, but adding this new field provides a better support of Significant Terms aggregation in the High Level REST Client and it might also be useful in the future for a new chart type in Kibana.

tlrx · 2017-05-24T16:22:49Z

@clintongormley @markharwood I'll be happy to have your opinion on this.

cbuescher

@tlrx I took a look because I was curious and left two questions, however I might not be best qualified to approve the rest of this PR although to me it looks good.

cbuescher · 2017-05-24T16:38:23Z

...g/elasticsearch/search/aggregations/bucket/significant/InternalSignificantTermsTestCase.java

What's happening with expectedSigTerm.getDocCount()? I might miss something, in that case a comment might be good.

By definition, getDocCount() returns the subset df value. The test used to test that getSubsetDf() return the same value for both the internal and parsed aggregations, but it didn't test that getDocCount() returns the susbset df as the internal implementation does. So I added this assertion.

Thanks for the clarification, that makes sense. Can you maybe add a comparison between either expectedSigTerm.getDocCount(), actualSigTerm.getSubsetDf() or expectedSigTerm.getDocCount(), expectedSigTerm.getSubsetDf(), then the explanation you give here would be clearer from the test.

cbuescher · 2017-05-24T16:41:14Z

docs/reference/aggregations/bucket/diversified-sampler-aggregation.asciidoc

I don't understand this explanation anymore, how is 151 related to a maximum on 200 documetns and 5 shards?

Indeed, it does not mean anything. I remove the "because we asked for a maximum of 200 from an index with 5 shards" because I think it is wrong. The returned response has doc_count: 151 and not 1000 (see the now removed // TESTRESPONSE[s/1000/151/] )

Makes sense.

tlrx · 2017-05-30T08:56:15Z

@cbuescher Thanks! I rebased and changed the documentation. Can you please have another look?

cbuescher

LGTM. As I said I'm not too familiar with the underlying aggregation, so you need to decide if you want a second pair of eyes to look at this.

This commit adds a new bg_count field to the REST response of SignificantTerms aggregations. Similarly to the bg_count that already exists in significant terms buckets, this new bg_count field is set at the aggregation level and is populated with the superset size value.

markharwood · 2017-05-30T12:17:54Z

...src/main/java/org/elasticsearch/search/aggregations/bucket/significant/SignificantTerms.java

+        /**
+         * @return The numbers of docs in the superset (also known as the background count
+         * of the containing aggregation).
+         */


Maybe change "also known as" to "ordinarily". It is possible to use a background_filter to redefine the scope of the background set you want to "diff" against.

Done, thanks!

tlrx · 2017-06-01T07:59:57Z

@markharwood Do you think that this can be merged now?

markharwood · 2017-06-01T11:23:50Z

LGTM!

tlrx · 2017-06-02T07:52:49Z

Thanks @cbuescher and @markharwood !

This commit adds a new bg_count field to the REST response of SignificantTerms aggregations. Similarly to the bg_count that already exists in significant terms buckets, this new bg_count field is set at the aggregation level and is populated with the superset size value.

* master: (62 commits) Handle already closed while filling gaps [DOCS] Clarify behaviour of scripted-metric arg with empty parent buckets [DOCS] Clarify connections and gateway nodes selection in cross cluster search docs (elastic#24859) Java api: Remove unneeded getTookInMillis method (elastic#23923) Adds nodes usage API to monitor usages of actions (elastic#24169) Add superset size to Significant Term REST response (elastic#24865) Disallow multiple parent-join fields per mapping (elastic#25002) [Test] Remove unused test resources in core (elastic#25011) Scripting: Add optional context parameter to put stored script requests (elastic#25014) Extract a common base class for scroll executions (elastic#24979) Build: fix version sorting Build: Move verifyVersions to new branchConsistency task (elastic#25009) Add backwards compatibility indices Build: improve verifyVersions error message (elastic#25006) Add version 5.4.2 constant Docs: More search speed advices. (elastic#24802) Add version 5.3.3 constant Reorganize docs of global ordinals. (elastic#24982) Provide the TransportRequest during validation of a search context (elastic#24985) [TEST] fix SearchIT assertion to also accept took set to 0 ...

tlrx added :Analytics/Aggregations Aggregations >enhancement review v5.5.0 v6.0.0 labels May 24, 2017

tlrx mentioned this pull request May 24, 2017

Add parsing to Significant Terms aggregations #24682

Merged

cbuescher reviewed May 24, 2017

View reviewed changes

tlrx force-pushed the add-superset-size branch from e6c5ce4 to 82f2b13 Compare May 30, 2017 08:55

cbuescher approved these changes May 30, 2017

View reviewed changes

tlrx added 3 commits May 30, 2017 11:54

Apply feedback

ab6ff6f

change doc count assertion

149782f

tlrx force-pushed the add-superset-size branch from 82f2b13 to 149782f Compare May 30, 2017 09:59

markharwood reviewed May 30, 2017

View reviewed changes

Change to "ordinarily"

7832026

tlrx merged commit 528bd25 into elastic:master Jun 2, 2017

tlrx deleted the add-superset-size branch June 2, 2017 07:52

clintongormley added v6.0.0-alpha2 v6.0.0 and removed v6.0.0 v6.0.0-alpha2 labels Jun 6, 2017

colings86 added v6.0.0-beta1 and removed v6.0.0 labels Jul 31, 2017

Add superset size to Significant Term REST response #24865

Add superset size to Significant Term REST response #24865

Uh oh!

Conversation

tlrx commented May 24, 2017

Uh oh!

tlrx commented May 24, 2017

Uh oh!

cbuescher left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tlrx commented May 30, 2017

Uh oh!

cbuescher left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tlrx commented Jun 1, 2017

Uh oh!

markharwood commented Jun 1, 2017

Uh oh!

tlrx commented Jun 2, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants