[DOCS] Note limitations of `max_gram` parm in `edge_ngram` tokenizer for index analyzers #49007

jrodewig · 2019-11-12T20:23:43Z

The edge_ngram tokenizer limits tokens to the max_gram character length. Autocomplete searches for terms longer than this limit return no results.

To prevent this, you can use the truncate token filter to truncate tokens to the max_gram character length.

This adds some advisory text and updates an example snippet in the edge ngram docs.

Closes #48956.

The `edge_ngram` tokenizer limits tokens to the `max_gram` character length. Autocomplete searches for terms longer than this limit return no results. To prevent this, you can use the `truncate` token filter to truncate tokens to the `max_gram` character length. This adds some advisory text and updates an example snippet in the edge ngram docs. Closes #48956.

elasticmachine · 2019-11-12T20:23:45Z

Pinging @elastic/es-search (:Search/Analysis)

elasticmachine · 2019-11-12T20:23:46Z

Pinging @elastic/es-docs (>docs)

cbuescher

@jrodewig I took a quick at the issue yesterday, there is a downside to the suggested approach as well which I pointed out in the comment. I'm afraid there might not be a perfect simple solution here, so I don't want to block this docs change. Maybe we should just clearly point out the caveats here, this documentation should really just explain the filter functionality in a clear way. Using it as a drop-in search-as-you-type replacement seems to have its issues which might better be treated somewhere else in more depth.

cbuescher · 2019-11-13T10:14:29Z

docs/reference/analysis/tokenizers/edgengram-tokenizer.asciidoc


+We also recommend using the <<analysis-truncate-tokenfilter,`truncate` token
+filter>> with a custom search analyzer to truncate tokens to the `max_gram`
+character length. Otherwise, searches for longer terms could return no results.


I looked at the issue briefly yesterday and think while the truncate filter helps with avoiding empty results, it creates other problems. When doing this, we now get the same result set for each search term that exceeds the truncate length if the prefix matches. This might be unexpected for users looking for a true search-as-you-type functionality as well.

More generally speaking: if max_gram and truncate length are set to 5 and we have terms "abcdefgh" and "abcdexyz" in the index, searching for anything up to "abcde" will return both docs as expected, but e.g. searching for "abcdexy" will also return both documents, which might be a bit odd. I'm not against adding this filter here, but we should probably explain this trade-off. I'm not sure what a better solution would be tbh. other than pointing to the new "search_as_you_type" datatype which hopefully does a better job with this.

Thanks for taking a look @cbuescher. I'll amend the docs further to better explain the tradeoffs.

This reverts commit b8f778c.

cbuescher

Absolutely great wording, thanks for the update. LGTM

jrodewig · 2019-11-13T17:43:11Z

@cbuescher I added a new section to outline some of the limitations of max_gram and better document the tradeoffs you mentioned.

I also decided to revert my changes to the example and add an advisory paragraph instead.

Let me know if this fits what you had in mind. Thanks again for looking this over!

…for index analyzers (#49007) The `edge_ngram` tokenizer limits tokens to the `max_gram` character length. Autocomplete searches for terms longer than this limit return no results. To prevent this, you can use the `truncate` token filter to truncate tokens to the `max_gram` character length. However, this could return irrelevant results. This commit adds some advisory text to make users aware of this limitation and outline the tradeoffs for each approach. Closes #48956.

jrodewig added >docs General docs changes :Search Relevance/Analysis How text is split into tokens v8.0.0 v7.5.0 v7.6.0 v7.4.3 labels Nov 12, 2019

jrodewig requested a review from jtibshirani November 12, 2019 20:23

cbuescher reviewed Nov 13, 2019

View reviewed changes

jrodewig added the WIP label Nov 13, 2019

jrodewig added 3 commits November 13, 2019 12:28

Revert "[DOCS] Add truncate filter to edge_ngram tokenizer ex"

0a7c01b

This reverts commit b8f778c.

[DOCS] Note limitations of max_gram for index analyzers

0965a9d

replace note. add anchor.

a28b469

cbuescher approved these changes Nov 13, 2019

View reviewed changes

jrodewig requested a review from cbuescher November 13, 2019 17:41

jrodewig changed the title ~~[DOCS] Add truncate filter to edge_ngram tokenizer example~~ [DOCS] Note limitations of max_gram parm in edge_ngram tokenizer for index analyzers Nov 13, 2019

jrodewig merged commit 2fe9ba5 into elastic:master Nov 13, 2019

jrodewig deleted the add-truncate-filter-to-ngram-ex branch November 13, 2019 19:27

jrodewig added v6.8.5 and removed WIP labels Nov 13, 2019

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DOCS] Note limitations of `max_gram` parm in `edge_ngram` tokenizer for index analyzers #49007

[DOCS] Note limitations of `max_gram` parm in `edge_ngram` tokenizer for index analyzers #49007

Uh oh!

jrodewig commented Nov 12, 2019

Uh oh!

elasticmachine commented Nov 12, 2019

Uh oh!

elasticmachine commented Nov 12, 2019

Uh oh!

cbuescher left a comment

Uh oh!

cbuescher Nov 13, 2019

Uh oh!

jrodewig Nov 13, 2019

Uh oh!

cbuescher left a comment

Uh oh!

jrodewig commented Nov 13, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[DOCS] Note limitations of max_gram parm in edge_ngram tokenizer for index analyzers #49007

[DOCS] Note limitations of max_gram parm in edge_ngram tokenizer for index analyzers #49007

Uh oh!

Conversation

jrodewig commented Nov 12, 2019

Uh oh!

elasticmachine commented Nov 12, 2019

Uh oh!

elasticmachine commented Nov 12, 2019

Uh oh!

cbuescher left a comment

Choose a reason for hiding this comment

Uh oh!

cbuescher Nov 13, 2019

Choose a reason for hiding this comment

Uh oh!

jrodewig Nov 13, 2019

Choose a reason for hiding this comment

Uh oh!

cbuescher left a comment

Choose a reason for hiding this comment

Uh oh!

jrodewig commented Nov 13, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[DOCS] Note limitations of `max_gram` parm in `edge_ngram` tokenizer for index analyzers #49007

[DOCS] Note limitations of `max_gram` parm in `edge_ngram` tokenizer for index analyzers #49007