Significant_terms aggregation #5146

markharwood · 2014-02-17T15:58:52Z

A new aggregation that identifies terms that are significant rather than merely popular in a result set.

Significance is related to the changes in document frequency observed between everyday use in the corpus and frequency observed in the result set. The asciidocs include extensive details on the various applications of this feature.

jpountz · 2014-02-17T16:05:58Z

docs/reference/search/aggregations/bucket/significantterms-aggregation.asciidoc

I think our examples shouldn't encourage using query_string for this kind of queries but rely on the query DSL instead?

jpountz · 2014-02-19T11:28:25Z

This looks good to me but I think we should try to share more code with terms aggregations before merging this in. I'm wondering if we could just remove the long aggregator (it would still work on longs but through their string representation) and make the significant terms aggregator extend the string terms aggregator and just override build(empty)Aggregation.

markharwood · 2014-02-19T11:36:07Z

Would swapping longs for their string representations mean a lot more RAM/net traffic? There can be a lot of "candidate" buckets generated before final reductions are made.

jpountz · 2014-02-19T13:40:22Z

Indeed it would. As a trade-off, maybe we could try to share code with the long terms aggregator in a similar way to what I described for the string terms aggregator?

jpountz · 2014-02-20T08:14:04Z

...g/elasticsearch/search/aggregations/bucket/significant/SignificantStringTermsAggregator.java

This looks identical to the doRelease impl of the parent class?

jpountz · 2014-03-10T19:38:34Z

@markharwood It looks good to me in general, I think there is an hppc hash table that should be replaced with a BytesRefHash to save object creations (we don't have anything against hppc structures but try to avoid to have numbers of object creations that are linear with the number of unique values as the latter can be quite high). Other than that, there are a few lines that are missing spaces around equals signs or at the beginning of single-line comments, it would be nice if you could try to clean it up.

markharwood · 2014-03-11T11:44:16Z

Thanks for review, Adrien.

…ather than merely popular in a set. Significance is related to the changes in document frequency observed between everyday use in the corpus and frequency observed in the result set. The asciidocs include extensive details on the applications of this feature.

…ificantTerms use new readSize and writeSize methods in base class. Also added support and tests for unmapped indices,

…om (Long/String)TermsAggregators, changed visibility of member variables to allow for this. Some minor documentation changes

…efHash + IntArray instead of hpcc collection. Code formatting changes.

jpountz · 2014-03-11T21:18:00Z

Thanks Mark, the fix looks good. My understanding is that this cache is useful when using the significant terms aggregation as a sub-aggregation, maybe it should be disabled when there is no parent aggregation? Or would it still be useful?

Another thought I had while reading this PR is that buildAggregation can do lots of random seeks in the terms dictionary. It might be interesting to explore how we can make it more sequential in a future pull request (no need to delay this change).

uboness · 2014-03-11T23:25:35Z

...org/elasticsearch/search/aggregations/bucket/significant/SignificantLongTermsAggregator.java

can we change this to a long?

uboness · 2014-03-11T23:35:33Z

src/main/java/org/elasticsearch/search/aggregations/bucket/significant/SignificantTerms.java

should be getBucketByKey(String key) (overriding the one in MultiBucketsAggregation)

uboness · 2014-03-12T00:10:19Z

Done... well.. first of all... this is just awesome!! I left some comments, but overall it looks good!

markharwood · 2014-03-12T10:27:26Z

Is there a circumstance where that would mask a release failure if an exception is thrown by Releasables.release()?

jpountz · 2014-03-12T10:29:45Z

I don't think so, Releasables.release() will throw the first exception that it got while trying to release the provided Releasables.

markharwood · 2014-03-12T10:32:48Z

OK.
Thanks for the review, @uboness, starting work on your changes.

…xamples changed to lowercase, base class change to SignificantTerms, code formatting, parser parses “format” field. I’ve added a “TODO” comment for the refactoring suggestion here: #5146 (comment) - as this should be considered as part of future changes

jpountz · 2014-03-13T14:39:38Z

+1 to push

…ather than merely popular in a set. Significance is related to the changes in document frequency observed between everyday use in the corpus and frequency observed in the result set. The asciidocs include extensive details on the applications of this feature. Closes #5146

jpountz reviewed Feb 17, 2014
View reviewed changes

jpountz reviewed Feb 20, 2014
View reviewed changes

markharwood added 5 commits March 11, 2014 17:22

Updated following @jpountz review: docs tidy up and made InternalSign…

08efb7e

…ificantTerms use new readSize and writeSize methods in base class. Also added support and tests for unmapped indices,

Changed Significant(Long/String)TermsAggregator classes to inherit fr…

b224a17

…om (Long/String)TermsAggregators, changed visibility of member variables to allow for this. Some minor documentation changes

Removed redundant code, name changes to fields in results JSON

fae1f42

Switched docFreq cache in SignificantTermsAggregatorFactory to BytesR…

9387416

…efHash + IntArray instead of hpcc collection. Code formatting changes.

uboness reviewed Mar 11, 2014
View reviewed changes

Rebased on latest master and added related changes to memory management

ccc9614

markharwood added 2 commits March 12, 2014 17:41

Added “experimental” notices to documentation

bcf27ad

markharwood added feature labels Mar 14, 2014

markharwood closed this in 767bef0 Mar 14, 2014

clintongormley added the :Analytics/Aggregations Aggregations label Jun 6, 2015

This was referenced May 23, 2017

Add parsing to Significant Terms aggregations #24682

Merged

Add superset size to Significant Term REST response #24865

Merged

Significant_terms aggregation #5146

Significant_terms aggregation #5146

Uh oh!

Conversation

markharwood commented Feb 17, 2014

Uh oh!

jpountz Feb 17, 2014

Choose a reason for hiding this comment

Uh oh!

jpountz commented Feb 19, 2014

Uh oh!

markharwood commented Feb 19, 2014

Uh oh!

jpountz commented Feb 19, 2014

Uh oh!

jpountz Feb 20, 2014

Choose a reason for hiding this comment

Uh oh!

jpountz commented Mar 10, 2014

Uh oh!

markharwood commented Mar 11, 2014

Uh oh!

jpountz commented Mar 11, 2014

Uh oh!

uboness Mar 11, 2014

Choose a reason for hiding this comment

Uh oh!

uboness Mar 11, 2014

Choose a reason for hiding this comment

Uh oh!

uboness commented Mar 12, 2014

Uh oh!

markharwood commented Mar 12, 2014

Uh oh!

jpountz commented Mar 12, 2014

Uh oh!

markharwood commented Mar 12, 2014

Uh oh!

jpountz commented Mar 13, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants