-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Introduce combined_fields query
#71213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
jtibshirani
merged 10 commits into
elastic:master
from
jtibshirani:combined-fields-query
Apr 14, 2021
Merged
Changes from all commits
Commits
Show all changes
10 commits
Select commit
Hold shift + click to select a range
924057c
Copy CombinedFieldQuery from Lucene 8.x.
jtibshirani 6f00796
Move ZeroTermsQuery to its own class.
jtibshirani dc3540c
Add new combined_fields query.
jtibshirani bad2253
Make sure highlighting works.
jtibshirani acbba64
Add documentation.
jtibshirani 8a7cd01
Improve test cases.
jtibshirani 2c58a05
Correct field boost validation.
jtibshirani 4a73c79
Improve documentation.
jtibshirani d8cdf79
Merge remote-tracking branch 'upstream/master' into combined-fields-q…
jtibshirani 38dc1de
Fix documentation.
jtibshirani File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
185 changes: 185 additions & 0 deletions
185
docs/reference/query-dsl/combined-fields-query.asciidoc
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,185 @@ | ||
| [[query-dsl-combined-fields-query]] | ||
| === Combined fields | ||
| ++++ | ||
| <titleabbrev>Combined fields</titleabbrev> | ||
| ++++ | ||
|
|
||
| The `combined_fields` query supports searching multiple text fields as if their | ||
| contents had been indexed into one combined field. It takes a term-centric | ||
| view of the query: first it analyzes the query string into individual terms, | ||
| then looks for each term in any of the fields. This query is particularly | ||
| useful when a match could span multiple text fields, for example the `title`, | ||
| `abstract` and `body` of an article: | ||
|
|
||
| [source,console] | ||
| -------------------------------------------------- | ||
| GET /_search | ||
| { | ||
| "query": { | ||
| "combined_fields" : { | ||
| "query": "database systems", | ||
| "fields": [ "title", "abstract", "body"], | ||
| "operator": "and" | ||
| } | ||
| } | ||
| } | ||
| -------------------------------------------------- | ||
|
|
||
| The `combined_fields` query takes a principled approach to scoring based on the | ||
| simple BM25F formula described in | ||
| http://www.staff.city.ac.uk/~sb317/papers/foundations_bm25_review.pdf[The Probabilistic Relevance Framework: BM25 and Beyond]. | ||
| When scoring matches, the query combines term and collection statistics across | ||
| fields. This allows it to score each match as if the specified fields had been | ||
| indexed into a single combined field. (Note that this is a best attempt -- | ||
| `combined_fields` makes some approximations and scores will not obey this | ||
| model perfectly.) | ||
|
|
||
| [WARNING] | ||
| .Field number limit | ||
| =================================================== | ||
| There is a limit on the number of fields that can be queried at once. It is | ||
| defined by the `indices.query.bool.max_clause_count` <<search-settings>> | ||
| which defaults to 1024. | ||
| =================================================== | ||
|
|
||
| ==== Per-field boosting | ||
|
|
||
| Individual fields can be boosted with the caret (`^`) notation: | ||
|
|
||
| [source,console] | ||
| -------------------------------------------------- | ||
| GET /_search | ||
| { | ||
| "query": { | ||
| "combined_fields" : { | ||
| "query" : "distributed consensus", | ||
| "fields" : [ "title^2", "body" ] <1> | ||
| } | ||
| } | ||
| } | ||
| -------------------------------------------------- | ||
|
|
||
| Field boosts are interpreted according to the combined field model. For example, | ||
| if the `title` field has a boost of 2, the score is calculated as if each term | ||
| in the title appeared twice in the synthetic combined field. | ||
|
|
||
| NOTE: The `combined_fields` query requires that field boosts are greater than | ||
| or equal to 1.0. Field boosts are allowed to be fractional. | ||
|
|
||
| [[combined-field-top-level-params]] | ||
| ==== Top-level parameters for `combined_fields` | ||
|
|
||
| `fields`:: | ||
| (Required, array of strings) List of fields to search. Field wildcard patterns | ||
| are allowed. Only <<text,`text`>> fields are supported, and they must all have | ||
| the same search <<analyzer,`analyzer`>>. | ||
|
|
||
| `query`:: | ||
| + | ||
| -- | ||
| (Required, string) Text to search for in the provided `<fields>`. | ||
|
|
||
| The `combined_fields` query <<analysis,analyzes>> the provided text before | ||
| performing a search. | ||
| -- | ||
|
|
||
| `auto_generate_synonyms_phrase_query`:: | ||
| + | ||
| -- | ||
| (Optional, Boolean) If `true`, <<query-dsl-match-query-phrase,match phrase>> | ||
| queries are automatically created for multi-term synonyms. Defaults to `true`. | ||
|
|
||
| See <<query-dsl-match-query-synonyms,Use synonyms with match query>> for an | ||
| example. | ||
| -- | ||
|
|
||
| `operator`:: | ||
| + | ||
| -- | ||
| (Optional, string) Boolean logic used to interpret text in the `query` value. | ||
| Valid values are: | ||
|
|
||
| `or` (Default):: | ||
| For example, a `query` value of `database systems` is interpreted as `database | ||
| OR systems`. | ||
|
|
||
| `and`:: | ||
| For example, a `query` value of `database systems` is interpreted as `database | ||
| AND systems`. | ||
| -- | ||
|
|
||
| `minimum_should_match`:: | ||
| + | ||
| -- | ||
| (Optional, string) Minimum number of clauses that must match for a document to | ||
| be returned. See the <<query-dsl-minimum-should-match, `minimum_should_match` | ||
| parameter>> for valid values and more information. | ||
| -- | ||
|
|
||
| `zero_terms_query`:: | ||
| + | ||
| -- | ||
| (Optional, string) Indicates whether no documents are returned if the `analyzer` | ||
| removes all tokens, such as when using a `stop` filter. Valid values are: | ||
|
|
||
| `none` (Default):: | ||
| No documents are returned if the `analyzer` removes all tokens. | ||
|
|
||
| `all`:: | ||
| Returns all documents, similar to a <<query-dsl-match-all-query,`match_all`>> | ||
| query. | ||
|
|
||
| See <<query-dsl-match-query-zero>> for an example. | ||
| -- | ||
|
|
||
| ===== Comparison to `multi_match` query | ||
|
|
||
| The `combined_fields` query provides a principled way of matching and scoring | ||
| across multiple <<text, `text`>> fields. To support this, it requires that all | ||
| fields have the same search <<analyzer,`analyzer`>>. | ||
|
|
||
| If you want a single query that handles fields of different types like | ||
| keywords or numbers, then the <<query-dsl-multi-match-query,`multi_match`>> | ||
| query may be a better fit. It supports both text and non-text fields, and | ||
| accepts text fields that do not share the same analyzer. | ||
|
|
||
| The main `multi_match` modes `best_fields` and `most_fields` take a | ||
| field-centric view of the query. In contrast, `combined_fields` is | ||
| term-centric: `operator` and `minimum_should_match` are applied per-term, | ||
| instead of per-field. Concretely, a query like | ||
|
|
||
| [source,console] | ||
| -------------------------------------------------- | ||
| GET /_search | ||
| { | ||
| "query": { | ||
| "combined_fields" : { | ||
| "query": "database systems", | ||
| "fields": [ "title", "abstract"], | ||
| "operator": "and" | ||
| } | ||
| } | ||
| } | ||
| -------------------------------------------------- | ||
|
|
||
| is executed as | ||
|
|
||
| +(combined("database", fields:["title" "abstract"])) | ||
| +(combined("systems", fields:["title", "abstract"])) | ||
|
|
||
| In other words, each term must be present in at least one field for a | ||
| document to match. | ||
|
|
||
| The `cross_fields` `multi_match` mode also takes a term-centric approach and | ||
| applies `operator` and `minimum_should_match per-term`. The main advantage of | ||
| `combined_fields` over `cross_fields` is its robust and interpretable approach | ||
| to scoring based on the BM25F algorithm. | ||
|
|
||
| [NOTE] | ||
| .Custom similarities | ||
| =================================================== | ||
| The `combined_fields` query currently only supports the `BM25` similarity | ||
| (which is the default unless a <<index-modules-similarity, custom similarity>> | ||
| is configured). <<similarity, Per-field similarities>> are also not allowed. | ||
| Using `combined_fields` in either of these cases will result in an error. | ||
| =================================================== | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
42 changes: 42 additions & 0 deletions
42
rest-api-spec/src/yamlRestTest/resources/rest-api-spec/test/search/360_combined_fields.yml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,42 @@ | ||
| setup: | ||
| - do: | ||
| indices.create: | ||
| index: test | ||
| body: | ||
| mappings: | ||
| properties: | ||
| title: | ||
| type: text | ||
| abstract: | ||
| type: text | ||
| body: | ||
| type: text | ||
|
|
||
| - do: | ||
| index: | ||
| index: test | ||
| id: 1 | ||
| body: | ||
| title: "Time, Clocks and the Ordering of Events in a Distributed System" | ||
| abstract: "The concept of one event happening before another..." | ||
| body: "The concept of time is fundamental to our way of thinking..." | ||
| refresh: true | ||
|
|
||
| --- | ||
| "Test combined_fields query": | ||
| - skip: | ||
| version: " - 7.99.99" | ||
| reason: "combined fields query is not yet backported" | ||
| - do: | ||
| search: | ||
| index: test | ||
| body: | ||
| query: | ||
| combined_fields: | ||
| query: "time event" | ||
| fields: ["abstract", "body"] | ||
| operator: "and" | ||
|
|
||
| - match: { hits.total.value: 1 } | ||
| - match: { hits.hits.0._id: "1" } | ||
|
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.