-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Introduce combined_fields query
#71213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce combined_fields query
#71213
Conversation
combined_fields query for searching multiple text fieldscombined_fields query
|
Some restrictions on the query (these are explained in more detail in the docs):
The PR is large because it also includes a refactor around |
8e23cd8 to
86e09d2
Compare
* Factor out query creation to this class. * Rename to ZeroTermsQueryOption to clarify it isn't a query.
86e09d2 to
acbba64
Compare
|
Pinging @elastic/es-search (Team:Search) |
| accepts text fields that do not share the same analyzer. | ||
|
|
||
| `multi_match` takes a field-centric view of the query by default. In contrast, | ||
| `combined_fields` is term-centric: `operator` and `minimum_should_match` are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It per term application of operator and minimum_should_match is how combined_fields is superior to cross_fields which is also considered term-centric?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, cross_fields is also able to apply operator and minimum_should_match in a term-centric way. The main benefit of combined_fields is the robust and understandable scoring algorithm.
I tried to update this section to make it clearer:
- When mentioning field-centric scoring, make it clear that we're referring to
best_fieldsandmost_fields - Add a sentence comparing to
cross_fieldsand mentioning the benefit
rest-api-spec/src/yamlRestTest/resources/rest-api-spec/test/search/360_combined_fields.yml
Outdated
Show resolved
Hide resolved
rest-api-spec/src/yamlRestTest/resources/rest-api-spec/test/search/360_combined_fields.yml
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/query/CombinedFieldsQueryBuilder.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/query/CombinedFieldsQueryBuilder.java
Outdated
Show resolved
Hide resolved
...lClusterTest/java/org/elasticsearch/search/fetch/subphase/highlight/HighlighterSearchIT.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/index/query/CombinedFieldsQueryParsingTests.java
Show resolved
Hide resolved
mayya-sharipova
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jtibshirani Thanks, this a great addition of a new query combined_fields. I left a couple of comments with a main comment to make a little bit clear in documentation what is an advantage of combined_fields query over cross_fields and other multi_match.
But overall this PR LGTM!
romseygeek
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me overall. I wonder if we need to implement auto_generate_synonyms_phrase_query and zero_terms_query for the new query? Auto-generate in particular feels like a leftover from when we didn't really handle query-time token graphs and I think we should consider deprecating it elsewhere.
| @@ -0,0 +1,466 @@ | |||
| /* @notice | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this should be here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you talking about the @notice annotation? This is required by our build for non-Elastic licensed code so that the notices are copied: #57017. I based this off other Lucene classes that we've copied like XMoreLikeThis and IndexableBinaryStringTools.
| @@ -0,0 +1,161 @@ | |||
| /* @notice | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And here?
| if (terms.isEmpty()) { | ||
| extractWeightedTerms(terms, query, 1F); | ||
| } | ||
| extractWeightedTerms(terms, query, 1F); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain why this has changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is tricky, I meant to call it out explicitly! Without this fix, the plain highlighter does not work on combined_field queries. Specifically, it will only highlight the first term, and fail to highlight subsequent terms because of the terms.isEmpty() check.
My understanding is that this was added in #15516 to work around a Lucene issue. The Lucene issue was later fixed: https://issues.apache.org/jira/browse/LUCENE-7112. All the tests added in #15516 still pass with this change.
* Avoid using rest_total_hits_as_int * Streamline check in HighlighterSearchIT
|
Thank you @mayya-sharipova and @romseygeek for the reviews! I pushed commits addressing your comments.
I added some improvements to the docs section "Comparison to
To me |
I caught up with @jimczi and now have a better understanding. @romseygeek I think you're suggesting just using the default of |
romseygeek
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works for me, thanks @jtibshirani
|
@jtibshirani Thanks for explanations and new changes, they all LGTM. |
This PR introduces a new query called `combined_fields` for searching multiple text fields. It takes a term-centric view, first analyzing the query string into individual terms, then searching for each term any of the fields as though they were one combined field. It is based on Lucene's `CombinedFieldQuery`, which takes a principled approach to scoring based on the BM25F formula. This query provides an alternative to the `cross_fields` `multi_match` mode. It has simpler behavior and a more robust approach to scoring. Addresses elastic#41106.
This PR introduces a new query called `combined_fields` for searching multiple text fields. It takes a term-centric view, first analyzing the query string into individual terms, then searching for each term any of the fields as though they were one combined field. It is based on Lucene's `CombinedFieldQuery`, which takes a principled approach to scoring based on the BM25F formula. This query provides an alternative to the `cross_fields` `multi_match` mode. It has simpler behavior and a more robust approach to scoring. Addresses #41106.
This PR introduces a new query called
combined_fieldsfor searching multipletext fields. It takes a term-centric view, first analyzing the query string
into individual terms, then searching for each term any of the fields as though
they were one combined field. It is based on Lucene's
CombinedFieldQuery,which takes a principled approach to scoring based on the BM25F formula.
This query provides an alternative to the
cross_fieldsmulti_matchmode. Ithas simpler behavior and a more robust approach to scoring.
Addresses #41106.