Skip to content

Conversation

@markharwood
Copy link
Contributor

A new aggregation that identifies terms that are significant rather than merely popular in a result set.

Significance is related to the changes in document frequency observed between everyday use in the corpus and frequency observed in the result set. The asciidocs include extensive details on the various applications of this feature.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think our examples shouldn't encourage using query_string for this kind of queries but rely on the query DSL instead?

@jpountz
Copy link
Contributor

jpountz commented Feb 19, 2014

This looks good to me but I think we should try to share more code with terms aggregations before merging this in. I'm wondering if we could just remove the long aggregator (it would still work on longs but through their string representation) and make the significant terms aggregator extend the string terms aggregator and just override build(empty)Aggregation.

@markharwood
Copy link
Contributor Author

Would swapping longs for their string representations mean a lot more RAM/net traffic? There can be a lot of "candidate" buckets generated before final reductions are made.

@jpountz
Copy link
Contributor

jpountz commented Feb 19, 2014

Indeed it would. As a trade-off, maybe we could try to share code with the long terms aggregator in a similar way to what I described for the string terms aggregator?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks identical to the doRelease impl of the parent class?

@jpountz
Copy link
Contributor

jpountz commented Mar 10, 2014

@markharwood It looks good to me in general, I think there is an hppc hash table that should be replaced with a BytesRefHash to save object creations (we don't have anything against hppc structures but try to avoid to have numbers of object creations that are linear with the number of unique values as the latter can be quite high). Other than that, there are a few lines that are missing spaces around equals signs or at the beginning of single-line comments, it would be nice if you could try to clean it up.

@markharwood
Copy link
Contributor Author

Thanks for review, Adrien.

…ather than merely popular in a set.

Significance is related to the changes in document frequency observed between everyday use in the corpus and
frequency observed in the result set. The asciidocs include extensive details on the applications of this feature.
…ificantTerms use new readSize and writeSize methods in base class. Also added support and tests for unmapped indices,
…om (Long/String)TermsAggregators, changed visibility of member variables to allow for this. Some minor documentation changes
…efHash + IntArray instead of hpcc collection. Code formatting changes.
@jpountz
Copy link
Contributor

jpountz commented Mar 11, 2014

Thanks Mark, the fix looks good. My understanding is that this cache is useful when using the significant terms aggregation as a sub-aggregation, maybe it should be disabled when there is no parent aggregation? Or would it still be useful?

Another thought I had while reading this PR is that buildAggregation can do lots of random seeks in the terms dictionary. It might be interesting to explore how we can make it more sequential in a future pull request (no need to delay this change).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we change this to a long?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be getBucketByKey(String key) (overriding the one in MultiBucketsAggregation)

@uboness
Copy link
Contributor

uboness commented Mar 12, 2014

Done... well.. first of all... this is just awesome!! I left some comments, but overall it looks good!

@markharwood
Copy link
Contributor Author

Is there a circumstance where that would mask a release failure if an exception is thrown by Releasables.release()?

@jpountz
Copy link
Contributor

jpountz commented Mar 12, 2014

I don't think so, Releasables.release() will throw the first exception that it got while trying to release the provided Releasables.

@markharwood
Copy link
Contributor Author

OK.
Thanks for the review, @uboness, starting work on your changes.

…xamples changed to lowercase, base class change to SignificantTerms, code formatting, parser parses “format” field.

I’ve added a “TODO” comment for the refactoring suggestion here: #5146 (comment) - as this should be considered as part of future changes
@jpountz
Copy link
Contributor

jpountz commented Mar 13, 2014

+1 to push

markharwood added a commit that referenced this pull request Mar 14, 2014
…ather than merely popular in a set.

Significance is related to the changes in document frequency observed between everyday use in the corpus and
frequency observed in the result set. The asciidocs include extensive details on the applications of this feature.

Closes #5146
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants