-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Add superset size to Significant Term REST response #24865
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@clintongormley @markharwood I'll be happy to have your opinion on this. |
cbuescher
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tlrx I took a look because I was curious and left two questions, however I might not be best qualified to approve the rest of this PR although to me it looks good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's happening with expectedSigTerm.getDocCount()? I might miss something, in that case a comment might be good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By definition, getDocCount() returns the subset df value. The test used to test that getSubsetDf() return the same value for both the internal and parsed aggregations, but it didn't test that getDocCount() returns the susbset df as the internal implementation does. So I added this assertion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarification, that makes sense. Can you maybe add a comparison between either expectedSigTerm.getDocCount(), actualSigTerm.getSubsetDf() or expectedSigTerm.getDocCount(), expectedSigTerm.getSubsetDf(), then the explanation you give here would be clearer from the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this explanation anymore, how is 151 related to a maximum on 200 documetns and 5 shards?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, it does not mean anything. I remove the "because we asked for a maximum of 200 from an index with 5 shards" because I think it is wrong. The returned response has doc_count: 151 and not 1000 (see the now removed // TESTRESPONSE[s/1000/151/] )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense.
|
@cbuescher Thanks! I rebased and changed the documentation. Can you please have another look? |
cbuescher
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. As I said I'm not too familiar with the underlying aggregation, so you need to decide if you want a second pair of eyes to look at this.
This commit adds a new bg_count field to the REST response of SignificantTerms aggregations. Similarly to the bg_count that already exists in significant terms buckets, this new bg_count field is set at the aggregation level and is populated with the superset size value.
| /** | ||
| * @return The numbers of docs in the superset (also known as the background count | ||
| * of the containing aggregation). | ||
| */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe change "also known as" to "ordinarily". It is possible to use a background_filter to redefine the scope of the background set you want to "diff" against.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks!
|
@markharwood Do you think that this can be merged now? |
|
LGTM! |
|
Thanks @cbuescher and @markharwood ! |
This commit adds a new bg_count field to the REST response of SignificantTerms aggregations. Similarly to the bg_count that already exists in significant terms buckets, this new bg_count field is set at the aggregation level and is populated with the superset size value.
* master: (62 commits) Handle already closed while filling gaps [DOCS] Clarify behaviour of scripted-metric arg with empty parent buckets [DOCS] Clarify connections and gateway nodes selection in cross cluster search docs (elastic#24859) Java api: Remove unneeded getTookInMillis method (elastic#23923) Adds nodes usage API to monitor usages of actions (elastic#24169) Add superset size to Significant Term REST response (elastic#24865) Disallow multiple parent-join fields per mapping (elastic#25002) [Test] Remove unused test resources in core (elastic#25011) Scripting: Add optional context parameter to put stored script requests (elastic#25014) Extract a common base class for scroll executions (elastic#24979) Build: fix version sorting Build: Move verifyVersions to new branchConsistency task (elastic#25009) Add backwards compatibility indices Build: improve verifyVersions error message (elastic#25006) Add version 5.4.2 constant Docs: More search speed advices. (elastic#24802) Add version 5.3.3 constant Reorganize docs of global ordinals. (elastic#24982) Provide the TransportRequest during validation of a search context (elastic#24985) [TEST] fix SearchIT assertion to also accept took set to 0 ...
This commit adds a new
bg_countfield to the REST response ofSignificantTermsaggregations. Similarly to thebg_countthat already exists in significant terms buckets, this newbg_countfield is set at the aggregation level and is populated with the superset size value.The addition of this field allows the High Level REST client to provide implementations of
SignificantTermsandSignificantTerms.Bucketwith the exact same behavior as the internal implementations. Before this pull request, a significant term aggregation didn't know about the superset size at all. Same thing for the aggregation's buckets that throw unsupported operation exceptions because the aggregation's superset size and subset size were unkown. This PR fixes that and adds support for both fields that are now populated at parsing time.Note that the subset size field at the bucket could have been implemented before but I didn't know much about how to do that. Thanks to @markharwood it is now supported.
There's a bit of history around these fields (see #5146 (comment)) and the superset size information was not added to the aggregation at the first place. I think the main argument was that it could be retrieved at query time using a Global aggregation. This argument is still valid, but adding this new field provides a better support of Significant Terms aggregation in the High Level REST Client and it might also be useful in the future for a new chart type in Kibana.