Skip to content

Conversation

@tlrx
Copy link
Member

@tlrx tlrx commented May 24, 2017

This commit adds a new bg_count field to the REST response of SignificantTerms aggregations. Similarly to the bg_count that already exists in significant terms buckets, this new bg_count field is set at the aggregation level and is populated with the superset size value.

The addition of this field allows the High Level REST client to provide implementations of SignificantTermsand SignificantTerms.Bucket with the exact same behavior as the internal implementations. Before this pull request, a significant term aggregation didn't know about the superset size at all. Same thing for the aggregation's buckets that throw unsupported operation exceptions because the aggregation's superset size and subset size were unkown. This PR fixes that and adds support for both fields that are now populated at parsing time.

Note that the subset size field at the bucket could have been implemented before but I didn't know much about how to do that. Thanks to @markharwood it is now supported.

There's a bit of history around these fields (see #5146 (comment)) and the superset size information was not added to the aggregation at the first place. I think the main argument was that it could be retrieved at query time using a Global aggregation. This argument is still valid, but adding this new field provides a better support of Significant Terms aggregation in the High Level REST Client and it might also be useful in the future for a new chart type in Kibana.

@tlrx
Copy link
Member Author

tlrx commented May 24, 2017

@clintongormley @markharwood I'll be happy to have your opinion on this.

Copy link
Member

@cbuescher cbuescher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tlrx I took a look because I was curious and left two questions, however I might not be best qualified to approve the rest of this PR although to me it looks good.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's happening with expectedSigTerm.getDocCount()? I might miss something, in that case a comment might be good.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By definition, getDocCount() returns the subset df value. The test used to test that getSubsetDf() return the same value for both the internal and parsed aggregations, but it didn't test that getDocCount() returns the susbset df as the internal implementation does. So I added this assertion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification, that makes sense. Can you maybe add a comparison between either expectedSigTerm.getDocCount(), actualSigTerm.getSubsetDf() or expectedSigTerm.getDocCount(), expectedSigTerm.getSubsetDf(), then the explanation you give here would be clearer from the test.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this explanation anymore, how is 151 related to a maximum on 200 documetns and 5 shards?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, it does not mean anything. I remove the "because we asked for a maximum of 200 from an index with 5 shards" because I think it is wrong. The returned response has doc_count: 151 and not 1000 (see the now removed // TESTRESPONSE[s/1000/151/] )

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense.

@tlrx tlrx force-pushed the add-superset-size branch from e6c5ce4 to 82f2b13 Compare May 30, 2017 08:55
@tlrx
Copy link
Member Author

tlrx commented May 30, 2017

@cbuescher Thanks! I rebased and changed the documentation. Can you please have another look?

Copy link
Member

@cbuescher cbuescher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. As I said I'm not too familiar with the underlying aggregation, so you need to decide if you want a second pair of eyes to look at this.

tlrx added 3 commits May 30, 2017 11:54
This commit adds a new bg_count field to the REST response of
SignificantTerms aggregations. Similarly to the bg_count that already
exists in significant terms buckets, this new bg_count field is set at
the aggregation level and is populated with the superset size value.
@tlrx tlrx force-pushed the add-superset-size branch from 82f2b13 to 149782f Compare May 30, 2017 09:59
/**
* @return The numbers of docs in the superset (also known as the background count
* of the containing aggregation).
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe change "also known as" to "ordinarily". It is possible to use a background_filter to redefine the scope of the background set you want to "diff" against.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

@tlrx
Copy link
Member Author

tlrx commented Jun 1, 2017

@markharwood Do you think that this can be merged now?

@markharwood
Copy link
Contributor

LGTM!

@tlrx tlrx merged commit 528bd25 into elastic:master Jun 2, 2017
@tlrx tlrx deleted the add-superset-size branch June 2, 2017 07:52
@tlrx
Copy link
Member Author

tlrx commented Jun 2, 2017

Thanks @cbuescher and @markharwood !

tlrx added a commit that referenced this pull request Jun 2, 2017
This commit adds a new bg_count field to the REST response of
SignificantTerms aggregations. Similarly to the bg_count that already
exists in significant terms buckets, this new bg_count field is set at
the aggregation level and is populated with the superset size value.
jasontedor added a commit to s12v/elasticsearch that referenced this pull request Jun 2, 2017
* master: (62 commits)
  Handle already closed while filling gaps
  [DOCS] Clarify behaviour of scripted-metric arg with empty parent buckets
  [DOCS] Clarify connections and gateway nodes selection in cross cluster search docs (elastic#24859)
  Java api: Remove unneeded getTookInMillis method (elastic#23923)
  Adds nodes usage API to monitor usages of actions (elastic#24169)
  Add superset size to Significant Term REST response (elastic#24865)
  Disallow multiple parent-join fields per mapping (elastic#25002)
  [Test] Remove unused test resources in core (elastic#25011)
  Scripting: Add optional context parameter to put stored script requests (elastic#25014)
  Extract a common base class for scroll executions (elastic#24979)
  Build: fix version sorting
  Build: Move verifyVersions to new branchConsistency task (elastic#25009)
  Add backwards compatibility indices
  Build: improve verifyVersions error message (elastic#25006)
  Add version 5.4.2 constant
  Docs: More search speed advices. (elastic#24802)
  Add version 5.3.3 constant
  Reorganize docs of global ordinals. (elastic#24982)
  Provide the TransportRequest during validation of a search context (elastic#24985)
  [TEST] fix SearchIT assertion to also accept took set to 0
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants