Skip to content

Conversation

@csoulios
Copy link
Contributor

@csoulios csoulios commented Dec 2, 2020

A while back, Lucene introduced the ability to index custom term frequencies, ie. giving users the ability to provide a numeric value that should be indexed as a term frequency rather than letting Lucene compute the term frequency by itself based on the number of occurrences of a term.

This PR modifies the _doc_count field so that it is stored as Lucene custom term frequency.

A benefit of moving to custom term frequencies is that Lucene will automatically compute global term statistics like totalTermFreq which will let us know the sum of the values of the _doc_count field across an entire shard. This could in-turn be useful to generalize optimizations to rollup indices, e.g. buckets aggregations where all documents fall into the same bucket.

Relates to #64503

@csoulios csoulios added >enhancement :Search Foundations/Mapping Index mappings, including merging and defining field types v8.0.0 v7.11.0 labels Dec 2, 2020
@csoulios csoulios requested review from jimczi and jpountz December 2, 2020 20:28
@elasticmachine elasticmachine added the Team:Search Meta label for search team label Dec 2, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good.

I wonder if we should have a special field, e.g. _term_freq that we could use whenever we want to index a field as a term frequency, like rank_feature does by always indexing into the _feature Lucene field. And this _term_freq field could have several terms including _doc_count. This way we would only use one Lucene field under the hood?

@Override
public Query existsQuery(QueryShardContext context) {
return new DocValuesFieldExistsQuery(NAME);
throw new IllegalArgumentException("Field [" + name() + "] of type [" + typeName() + "] does not support exists queries");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe use the same exception type as termQuery?

public class DocCountProvider {

private NumericDocValues docCountValues;
public static final int defaultValue = DocCountFieldMapper.DocCountFieldType.defaultValue;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it be DEFAULT_VALUE since it is a constant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally, I had it DEFAULT_VALUE, but then changed it so that is inline with DocCountFieldMapper.DocCountFieldType.defaultValue which is also a constant.

Changing it for both fields

@jpountz
Copy link
Contributor

jpountz commented Dec 2, 2020

Maybe another question: by using NumericDocValuesField, Lucene would enforce for us that this _doc_count field be single-valued. Do we need to check this explicitly now?

@csoulios
Copy link
Contributor Author

csoulios commented Dec 3, 2020

Maybe another question: by using NumericDocValuesField, Lucene would enforce for us that this _doc_count field be single-valued. Do we need to check this explicitly now?

Good catch. I added a check in the parseCreateField() method and a test.

@csoulios
Copy link
Contributor Author

csoulios commented Dec 3, 2020

I wonder if we should have a special field, e.g. _term_freq that we could use whenever we want to index a field as a term frequency, like rank_feature does by always indexing into the _feature Lucene field. And this _term_freq field could have several terms including _doc_count. This way we would only use one Lucene field under the hood?

For a start, I want to say that I don't know if having a single special field (e.g. _term_freq) for all term freq stored fields would result in any Lucene optimization, as I am not aware of the internals.

Conceptually, I would prefer _doc_count field being stored in Lucene as _doc_count and every other field should have their own Lucene field, as I believe this is more intuitive. Moreover, it would be simpler to add another term under the _doc_count field if we ever wanted to store another value for _doc_count as term freq.

@jpountz
Copy link
Contributor

jpountz commented Dec 3, 2020

Yeah the bw compat logic shouldn't be too complicated anyway if we wanted to move to a shared field in the future.

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left two minor comments, otherwise LGTM.


public CustomTermFreqField(String fieldName, int fieldValue) {
this(fieldName, fieldName, fieldValue);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd rather drop this constructor and require callers of this class to think about what is the right field name and term they should use


long value = parser.longValue(false);
// Check that _doc_count is a single value and not an array
if (context.doc().getField(NAME) != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is a bit trappy as it does a linear scan of the fields of the document. Can you use addWithKey like BinaryFieldMapper does?

@csoulios csoulios merged commit b554d5b into elastic:master Dec 3, 2020
@csoulios csoulios deleted the doc_count_term_freq branch December 3, 2020 14:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>enhancement :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Search Meta label for search team v7.11.0 v8.0.0-alpha1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants