-
Notifications
You must be signed in to change notification settings - Fork 25.6k
[DOCS] Note limitations of max_gram parm in edge_ngram tokenizer for index analyzers
#49007
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DOCS] Note limitations of max_gram parm in edge_ngram tokenizer for index analyzers
#49007
Conversation
The `edge_ngram` tokenizer limits tokens to the `max_gram` character length. Autocomplete searches for terms longer than this limit return no results. To prevent this, you can use the `truncate` token filter to truncate tokens to the `max_gram` character length. This adds some advisory text and updates an example snippet in the edge ngram docs. Closes #48956.
|
Pinging @elastic/es-search (:Search/Analysis) |
|
Pinging @elastic/es-docs (>docs) |
cbuescher
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jrodewig I took a quick at the issue yesterday, there is a downside to the suggested approach as well which I pointed out in the comment. I'm afraid there might not be a perfect simple solution here, so I don't want to block this docs change. Maybe we should just clearly point out the caveats here, this documentation should really just explain the filter functionality in a clear way. Using it as a drop-in search-as-you-type replacement seems to have its issues which might better be treated somewhere else in more depth.
|
|
||
| We also recommend using the <<analysis-truncate-tokenfilter,`truncate` token | ||
| filter>> with a custom search analyzer to truncate tokens to the `max_gram` | ||
| character length. Otherwise, searches for longer terms could return no results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked at the issue briefly yesterday and think while the truncate filter helps with avoiding empty results, it creates other problems. When doing this, we now get the same result set for each search term that exceeds the truncate length if the prefix matches. This might be unexpected for users looking for a true search-as-you-type functionality as well.
More generally speaking: if max_gram and truncate length are set to 5 and we have terms "abcdefgh" and "abcdexyz" in the index, searching for anything up to "abcde" will return both docs as expected, but e.g. searching for "abcdexy" will also return both documents, which might be a bit odd. I'm not against adding this filter here, but we should probably explain this trade-off. I'm not sure what a better solution would be tbh. other than pointing to the new "search_as_you_type" datatype which hopefully does a better job with this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking a look @cbuescher. I'll amend the docs further to better explain the tradeoffs.
cbuescher
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely great wording, thanks for the update. LGTM
|
@cbuescher I added a new section to outline some of the limitations of I also decided to revert my changes to the example and add an advisory paragraph instead. Let me know if this fits what you had in mind. Thanks again for looking this over! |
truncate filter to edge_ngram tokenizer examplemax_gram parm in edge_ngram tokenizer for index analyzers
…for index analyzers (#49007) The `edge_ngram` tokenizer limits tokens to the `max_gram` character length. Autocomplete searches for terms longer than this limit return no results. To prevent this, you can use the `truncate` token filter to truncate tokens to the `max_gram` character length. However, this could return irrelevant results. This commit adds some advisory text to make users aware of this limitation and outline the tradeoffs for each approach. Closes #48956.
…for index analyzers (#49007) The `edge_ngram` tokenizer limits tokens to the `max_gram` character length. Autocomplete searches for terms longer than this limit return no results. To prevent this, you can use the `truncate` token filter to truncate tokens to the `max_gram` character length. However, this could return irrelevant results. This commit adds some advisory text to make users aware of this limitation and outline the tradeoffs for each approach. Closes #48956.
…for index analyzers (#49007) The `edge_ngram` tokenizer limits tokens to the `max_gram` character length. Autocomplete searches for terms longer than this limit return no results. To prevent this, you can use the `truncate` token filter to truncate tokens to the `max_gram` character length. However, this could return irrelevant results. This commit adds some advisory text to make users aware of this limitation and outline the tradeoffs for each approach. Closes #48956.
…for index analyzers (#49007) The `edge_ngram` tokenizer limits tokens to the `max_gram` character length. Autocomplete searches for terms longer than this limit return no results. To prevent this, you can use the `truncate` token filter to truncate tokens to the `max_gram` character length. However, this could return irrelevant results. This commit adds some advisory text to make users aware of this limitation and outline the tradeoffs for each approach. Closes #48956.
The
edge_ngramtokenizer limits tokens to themax_gramcharacter length. Autocomplete searches for terms longer than this limit return no results.To prevent this, you can use the
truncatetoken filter to truncate tokens to themax_gramcharacter length.This adds some advisory text and updates an example snippet in the edge ngram docs.
Closes #48956.