Skip to content

Conversation

@nik9000
Copy link
Member

@nik9000 nik9000 commented Jun 9, 2020

This moves the code to look up significance heuristics information like
background frequency and superset size out of
SignificantTermsAggregatorFactory and into its own home so that it is
easier to pass around. This will:

  1. Make us feel better about ourselves for not passing around the
    factory, which is really supposed to be a throw away thing.
  2. Abstract the significance lookup logic so we can reuse it for the
    significant_text aggregation.
  3. Make if very simple to cache the background frequencies which should
    speed up when the agg is a sub-agg. We had done this for numerics
    but not string-shaped significant terms.

This moves the code to look up significance heuristics information like
background frequency and superset size out of
`SignificantTermsAggregatorFactory` and into its own home so that it is
easier to pass around. This will:
1. Make us feel better about ourselves for not passing around the
   factory, which is really *supposed* to be a throw away thing.
2. Abstract the significance lookup logic so we can reuse it for the
   `significant_text` aggregation.
3. Make if very simple to cache the background frequencies which should
   speed up when the agg is a sub-agg. We had done this for numerics
   but not string-shaped significant terms.
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (:Analytics/Aggregations)

@elasticmachine elasticmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Jun 9, 2020
@nik9000 nik9000 requested a review from polyfractal June 9, 2020 20:12
@nik9000
Copy link
Member Author

nik9000 commented Jun 10, 2020

This improves performance of string significant_terms by about 15%. before after

@nik9000
Copy link
Member Author

nik9000 commented Jun 10, 2020

This improves performance of string significant_terms by about 15%. before after

And for that reason I've marked this an >enhancement

Copy link
Member

@not-napoleon not-napoleon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

}

/**
* Get the background frequency of a {@link BytesRef} term.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this javadoc correct? Looks like it's operating on a long term below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wasn't! I fixed it.

@nik9000 nik9000 merged commit 62e2d85 into elastic:master Jun 10, 2020
nik9000 added a commit to nik9000/elasticsearch that referenced this pull request Jun 10, 2020
This moves the code to look up significance heuristics information like
background frequency and superset size out of
`SignificantTermsAggregatorFactory` and into its own home so that it is
easier to pass around. This will:
1. Make us feel better about ourselves for not passing around the
   factory, which is really *supposed* to be a throw away thing.
2. Abstract the significance lookup logic so we can reuse it for the
   `significant_text` aggregation.
3. Make if very simple to cache the background frequencies which should
   speed up when the agg is a sub-agg. We had done this for numerics
   but not string-shaped significant terms.
nik9000 added a commit that referenced this pull request Jun 12, 2020
This moves the code to look up significance heuristics information like
background frequency and superset size out of
`SignificantTermsAggregatorFactory` and into its own home so that it is
easier to pass around. This will:
1. Make us feel better about ourselves for not passing around the
   factory, which is really *supposed* to be a throw away thing.
2. Abstract the significance lookup logic so we can reuse it for the
   `significant_text` aggregation.
3. Make if very simple to cache the background frequencies which should
   speed up when the agg is a sub-agg. We had done this for numerics
   but not string-shaped significant terms.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Analytics/Aggregations Aggregations >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v7.9.0 v8.0.0-alpha1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants