-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Significant_terms aggregation #5146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think our examples shouldn't encourage using query_string for this kind of queries but rely on the query DSL instead?
|
This looks good to me but I think we should try to share more code with terms aggregations before merging this in. I'm wondering if we could just remove the long aggregator (it would still work on longs but through their string representation) and make the significant terms aggregator extend the string terms aggregator and just override |
|
Would swapping longs for their string representations mean a lot more RAM/net traffic? There can be a lot of "candidate" buckets generated before final reductions are made. |
|
Indeed it would. As a trade-off, maybe we could try to share code with the long terms aggregator in a similar way to what I described for the string terms aggregator? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks identical to the doRelease impl of the parent class?
|
@markharwood It looks good to me in general, I think there is an hppc hash table that should be replaced with a BytesRefHash to save object creations (we don't have anything against hppc structures but try to avoid to have numbers of object creations that are linear with the number of unique values as the latter can be quite high). Other than that, there are a few lines that are missing spaces around equals signs or at the beginning of single-line comments, it would be nice if you could try to clean it up. |
|
Thanks for review, Adrien. |
…ather than merely popular in a set. Significance is related to the changes in document frequency observed between everyday use in the corpus and frequency observed in the result set. The asciidocs include extensive details on the applications of this feature.
…ificantTerms use new readSize and writeSize methods in base class. Also added support and tests for unmapped indices,
…om (Long/String)TermsAggregators, changed visibility of member variables to allow for this. Some minor documentation changes
…efHash + IntArray instead of hpcc collection. Code formatting changes.
|
Thanks Mark, the fix looks good. My understanding is that this cache is useful when using the significant terms aggregation as a sub-aggregation, maybe it should be disabled when there is no parent aggregation? Or would it still be useful? Another thought I had while reading this PR is that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we change this to a long?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be getBucketByKey(String key) (overriding the one in MultiBucketsAggregation)
|
Done... well.. first of all... this is just awesome!! I left some comments, but overall it looks good! |
|
Is there a circumstance where that would mask a release failure if an exception is thrown by Releasables.release()? |
|
I don't think so, Releasables.release() will throw the first exception that it got while trying to release the provided Releasables. |
|
OK. |
…xamples changed to lowercase, base class change to SignificantTerms, code formatting, parser parses “format” field. I’ve added a “TODO” comment for the refactoring suggestion here: #5146 (comment) - as this should be considered as part of future changes
|
+1 to push |
…ather than merely popular in a set. Significance is related to the changes in document frequency observed between everyday use in the corpus and frequency observed in the result set. The asciidocs include extensive details on the applications of this feature. Closes #5146
A new aggregation that identifies terms that are significant rather than merely popular in a result set.
Significance is related to the changes in document frequency observed between everyday use in the corpus and frequency observed in the result set. The asciidocs include extensive details on the various applications of this feature.