From b8f778ca992e975bc7b9e58e3fa569103a13a4f3 Mon Sep 17 00:00:00 2001 From: James Rodewig Date: Tue, 12 Nov 2019 15:14:13 -0500 Subject: [PATCH 1/4] [DOCS] Add `truncate` filter to `edge_ngram` tokenizer ex The `edge_ngram` tokenizer limits tokens to the `max_gram` character length. Autocomplete searches for terms longer than this limit return no results. To prevent this, you can use the `truncate` token filter to truncate tokens to the `max_gram` character length. This adds some advisory text and updates an example snippet in the edge ngram docs. Closes #48956. --- .../tokenizers/edgengram-tokenizer.asciidoc | 17 ++++++++++++++--- 1 file changed, 14 insertions(+), 3 deletions(-) diff --git a/docs/reference/analysis/tokenizers/edgengram-tokenizer.asciidoc b/docs/reference/analysis/tokenizers/edgengram-tokenizer.asciidoc index 8d737a5995952..8bd116d56dd75 100644 --- a/docs/reference/analysis/tokenizers/edgengram-tokenizer.asciidoc +++ b/docs/reference/analysis/tokenizers/edgengram-tokenizer.asciidoc @@ -209,11 +209,15 @@ The above example produces the following terms: --------------------------- Usually we recommend using the same `analyzer` at index time and at search -time. In the case of the `edge_ngram` tokenizer, the advice is different. It +time. In the case of the `edge_ngram` tokenizer, the advice is different. It only makes sense to use the `edge_ngram` tokenizer at index time, to ensure -that partial words are available for matching in the index. At search time, +that partial words are available for matching in the index. At search time, just search for the terms the user has typed in, for instance: `Quick Fo`. +We also recommend using the <> with a custom search analyzer to truncate tokens to the `max_gram` +character length. Otherwise, searches for longer terms could return no results. + Below is an example of how to set up a field for _search-as-you-type_: [source,console] @@ -222,6 +226,12 @@ PUT my_index { "settings": { "analysis": { + "filter": { + "truncate_search": { + "type": "truncate", + "length": 10 + } + }, "analyzer": { "autocomplete": { "tokenizer": "autocomplete", @@ -230,7 +240,8 @@ PUT my_index ] }, "autocomplete_search": { - "tokenizer": "lowercase" + "tokenizer": "lowercase", + "filter": "truncate_search" } }, "tokenizer": { From 0a7c01b2e1728da754c223b9252ffc7920678ef6 Mon Sep 17 00:00:00 2001 From: James Rodewig Date: Wed, 13 Nov 2019 12:28:31 -0500 Subject: [PATCH 2/4] Revert "[DOCS] Add `truncate` filter to `edge_ngram` tokenizer ex" This reverts commit b8f778ca992e975bc7b9e58e3fa569103a13a4f3. --- .../tokenizers/edgengram-tokenizer.asciidoc | 17 +++-------------- 1 file changed, 3 insertions(+), 14 deletions(-) diff --git a/docs/reference/analysis/tokenizers/edgengram-tokenizer.asciidoc b/docs/reference/analysis/tokenizers/edgengram-tokenizer.asciidoc index 8bd116d56dd75..8d737a5995952 100644 --- a/docs/reference/analysis/tokenizers/edgengram-tokenizer.asciidoc +++ b/docs/reference/analysis/tokenizers/edgengram-tokenizer.asciidoc @@ -209,15 +209,11 @@ The above example produces the following terms: --------------------------- Usually we recommend using the same `analyzer` at index time and at search -time. In the case of the `edge_ngram` tokenizer, the advice is different. It +time. In the case of the `edge_ngram` tokenizer, the advice is different. It only makes sense to use the `edge_ngram` tokenizer at index time, to ensure -that partial words are available for matching in the index. At search time, +that partial words are available for matching in the index. At search time, just search for the terms the user has typed in, for instance: `Quick Fo`. -We also recommend using the <> with a custom search analyzer to truncate tokens to the `max_gram` -character length. Otherwise, searches for longer terms could return no results. - Below is an example of how to set up a field for _search-as-you-type_: [source,console] @@ -226,12 +222,6 @@ PUT my_index { "settings": { "analysis": { - "filter": { - "truncate_search": { - "type": "truncate", - "length": 10 - } - }, "analyzer": { "autocomplete": { "tokenizer": "autocomplete", @@ -240,8 +230,7 @@ PUT my_index ] }, "autocomplete_search": { - "tokenizer": "lowercase", - "filter": "truncate_search" + "tokenizer": "lowercase" } }, "tokenizer": { From 0965a9dd3ad35d4ca0c39d7c84ec851c37ed8bd7 Mon Sep 17 00:00:00 2001 From: James Rodewig Date: Wed, 13 Nov 2019 12:37:22 -0500 Subject: [PATCH 3/4] [DOCS] Note limitations of `max_gram` for index analyzers --- .../tokenizers/edgengram-tokenizer.asciidoc | 36 ++++++++++++++++--- 1 file changed, 32 insertions(+), 4 deletions(-) diff --git a/docs/reference/analysis/tokenizers/edgengram-tokenizer.asciidoc b/docs/reference/analysis/tokenizers/edgengram-tokenizer.asciidoc index 8d737a5995952..6b49c0d9200c8 100644 --- a/docs/reference/analysis/tokenizers/edgengram-tokenizer.asciidoc +++ b/docs/reference/analysis/tokenizers/edgengram-tokenizer.asciidoc @@ -72,12 +72,19 @@ configure the `edge_ngram` before using it. The `edge_ngram` tokenizer accepts the following parameters: -[horizontal] `min_gram`:: Minimum length of characters in a gram. Defaults to `1`. `max_gram`:: - Maximum length of characters in a gram. Defaults to `2`. ++ +-- +Maximum length of characters in a gram. Defaults to `2`. + +[NOTE] +====== + +====== +-- `token_chars`:: @@ -93,6 +100,27 @@ Character classes may be any of the following: * `punctuation` -- for example `!` or `"` * `symbol` -- for example `$` or `√` + +=== Limitations of the `max_gram` value + +The `edge_ngram` tokenizer's `max_gram` value limits the character length of +tokens. When the `edge_ngram` tokenizer is used as an index analyzer, this means +search terms longer than the `max_gram` length may not match any indexed terms. + +For example, if the `max_gram` is `3`, searches for `apple` won't match the +indexed term `app`. + +To account for this, you can use the <> token filter to truncate search terms to the `max_gram` character +length in a search analyzer. However, this could return irrelevant results. + +For example, if the `max_gram` is `3` and search terms are truncated to three +characters, searches for `apple` would return any indexed term beginning with +`app`, including `apply`, `applause`, and `apple`. + +As a result, we recommend testing both approaches to see which best fits your +use case. + [float] === Example configuration @@ -209,9 +237,9 @@ The above example produces the following terms: --------------------------- Usually we recommend using the same `analyzer` at index time and at search -time. In the case of the `edge_ngram` tokenizer, the advice is different. It +time. In the case of the `edge_ngram` tokenizer, the advice is different. It only makes sense to use the `edge_ngram` tokenizer at index time, to ensure -that partial words are available for matching in the index. At search time, +that partial words are available for matching in the index. At search time, just search for the terms the user has typed in, for instance: `Quick Fo`. Below is an example of how to set up a field for _search-as-you-type_: From a28b46999a7d075e9eff6fdfd6918033b0528e11 Mon Sep 17 00:00:00 2001 From: James Rodewig Date: Wed, 13 Nov 2019 12:40:04 -0500 Subject: [PATCH 4/4] replace note. add anchor. --- .../tokenizers/edgengram-tokenizer.asciidoc | 33 ++++++++++--------- 1 file changed, 18 insertions(+), 15 deletions(-) diff --git a/docs/reference/analysis/tokenizers/edgengram-tokenizer.asciidoc b/docs/reference/analysis/tokenizers/edgengram-tokenizer.asciidoc index 6b49c0d9200c8..814a1bb633edb 100644 --- a/docs/reference/analysis/tokenizers/edgengram-tokenizer.asciidoc +++ b/docs/reference/analysis/tokenizers/edgengram-tokenizer.asciidoc @@ -80,10 +80,7 @@ The `edge_ngram` tokenizer accepts the following parameters: -- Maximum length of characters in a gram. Defaults to `2`. -[NOTE] -====== - -====== +See <>. -- `token_chars`:: @@ -100,26 +97,28 @@ Character classes may be any of the following: * `punctuation` -- for example `!` or `"` * `symbol` -- for example `$` or `√` - -=== Limitations of the `max_gram` value +[[max-gram-limits]] +=== Limitations of the `max_gram` parameter The `edge_ngram` tokenizer's `max_gram` value limits the character length of -tokens. When the `edge_ngram` tokenizer is used as an index analyzer, this means -search terms longer than the `max_gram` length may not match any indexed terms. +tokens. When the `edge_ngram` tokenizer is used with an index analyzer, this +means search terms longer than the `max_gram` length may not match any indexed +terms. For example, if the `max_gram` is `3`, searches for `apple` won't match the indexed term `app`. To account for this, you can use the <> token filter to truncate search terms to the `max_gram` character -length in a search analyzer. However, this could return irrelevant results. +token filter>> token filter with a search analyzer to shorten search terms to +the `max_gram` character length. However, this could return irrelevant results. For example, if the `max_gram` is `3` and search terms are truncated to three -characters, searches for `apple` would return any indexed term beginning with -`app`, including `apply`, `applause`, and `apple`. +characters, the search term `apple` is shortened to `app`. This means searches +for `apple` return any indexed terms matching `app`, such as `apply`, `snapped`, +and `apple`. -As a result, we recommend testing both approaches to see which best fits your -use case. +We recommend testing both approaches to see which best fits your +use case and desired search experience. [float] === Example configuration @@ -242,7 +241,11 @@ only makes sense to use the `edge_ngram` tokenizer at index time, to ensure that partial words are available for matching in the index. At search time, just search for the terms the user has typed in, for instance: `Quick Fo`. -Below is an example of how to set up a field for _search-as-you-type_: +Below is an example of how to set up a field for _search-as-you-type_. + +Note that the `max_gram` value for the index analyzer is `10`, which limits +indexed terms to 10 characters. Search terms are not truncated, meaning that +search terms longer than 10 characters may not match any indexed terms. [source,console] -----------------------------------