Skip to content

Conversation

@jrodewig
Copy link
Contributor

The edge_ngram tokenizer limits tokens to the max_gram character length. Autocomplete searches for terms longer than this limit return no results.

To prevent this, you can use the truncate token filter to truncate tokens to the max_gram character length.

This adds some advisory text and updates an example snippet in the edge ngram docs.

Closes #48956.

The `edge_ngram` tokenizer limits tokens to the `max_gram` character
length. Autocomplete searches for terms longer than this limit return
no results.

To prevent this, you can use the `truncate` token filter to truncate
tokens to the `max_gram` character length.

This adds some advisory text and updates an example snippet in the
edge ngram docs.

Closes #48956.
@jrodewig jrodewig added >docs General docs changes :Search Relevance/Analysis How text is split into tokens v8.0.0 v7.5.0 v7.6.0 v7.4.3 labels Nov 12, 2019
@jrodewig jrodewig requested a review from jtibshirani November 12, 2019 20:23
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (:Search/Analysis)

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-docs (>docs)

Copy link
Member

@cbuescher cbuescher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jrodewig I took a quick at the issue yesterday, there is a downside to the suggested approach as well which I pointed out in the comment. I'm afraid there might not be a perfect simple solution here, so I don't want to block this docs change. Maybe we should just clearly point out the caveats here, this documentation should really just explain the filter functionality in a clear way. Using it as a drop-in search-as-you-type replacement seems to have its issues which might better be treated somewhere else in more depth.


We also recommend using the <<analysis-truncate-tokenfilter,`truncate` token
filter>> with a custom search analyzer to truncate tokens to the `max_gram`
character length. Otherwise, searches for longer terms could return no results.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at the issue briefly yesterday and think while the truncate filter helps with avoiding empty results, it creates other problems. When doing this, we now get the same result set for each search term that exceeds the truncate length if the prefix matches. This might be unexpected for users looking for a true search-as-you-type functionality as well.

More generally speaking: if max_gram and truncate length are set to 5 and we have terms "abcdefgh" and "abcdexyz" in the index, searching for anything up to "abcde" will return both docs as expected, but e.g. searching for "abcdexy" will also return both documents, which might be a bit odd. I'm not against adding this filter here, but we should probably explain this trade-off. I'm not sure what a better solution would be tbh. other than pointing to the new "search_as_you_type" datatype which hopefully does a better job with this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look @cbuescher. I'll amend the docs further to better explain the tradeoffs.

@jrodewig jrodewig added the WIP label Nov 13, 2019
Copy link
Member

@cbuescher cbuescher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely great wording, thanks for the update. LGTM

@jrodewig jrodewig requested a review from cbuescher November 13, 2019 17:41
@jrodewig
Copy link
Contributor Author

@cbuescher I added a new section to outline some of the limitations of max_gram and better document the tradeoffs you mentioned.

I also decided to revert my changes to the example and add an advisory paragraph instead.

Let me know if this fits what you had in mind. Thanks again for looking this over!

@jrodewig jrodewig changed the title [DOCS] Add truncate filter to edge_ngram tokenizer example [DOCS] Note limitations of max_gram parm in edge_ngram tokenizer for index analyzers Nov 13, 2019
@jrodewig jrodewig merged commit 2fe9ba5 into elastic:master Nov 13, 2019
@jrodewig jrodewig deleted the add-truncate-filter-to-ngram-ex branch November 13, 2019 19:27
jrodewig added a commit that referenced this pull request Nov 13, 2019
…for index analyzers (#49007)

The `edge_ngram` tokenizer limits tokens to the `max_gram` character
length. Autocomplete searches for terms longer than this limit return
no results.

To prevent this, you can use the `truncate` token filter to truncate
tokens to the `max_gram` character length. However, this could return irrelevant results.

This commit adds some advisory text to make users aware of this limitation and outline the tradeoffs for each approach.

Closes #48956.
jrodewig added a commit that referenced this pull request Nov 13, 2019
…for index analyzers (#49007)

The `edge_ngram` tokenizer limits tokens to the `max_gram` character
length. Autocomplete searches for terms longer than this limit return
no results.

To prevent this, you can use the `truncate` token filter to truncate
tokens to the `max_gram` character length. However, this could return irrelevant results.

This commit adds some advisory text to make users aware of this limitation and outline the tradeoffs for each approach.

Closes #48956.
jrodewig added a commit that referenced this pull request Nov 13, 2019
…for index analyzers (#49007)

The `edge_ngram` tokenizer limits tokens to the `max_gram` character
length. Autocomplete searches for terms longer than this limit return
no results.

To prevent this, you can use the `truncate` token filter to truncate
tokens to the `max_gram` character length. However, this could return irrelevant results.

This commit adds some advisory text to make users aware of this limitation and outline the tradeoffs for each approach.

Closes #48956.
jrodewig added a commit that referenced this pull request Nov 13, 2019
…for index analyzers (#49007)

The `edge_ngram` tokenizer limits tokens to the `max_gram` character
length. Autocomplete searches for terms longer than this limit return
no results.

To prevent this, you can use the `truncate` token filter to truncate
tokens to the `max_gram` character length. However, this could return irrelevant results.

This commit adds some advisory text to make users aware of this limitation and outline the tradeoffs for each approach.

Closes #48956.
@jrodewig jrodewig added v6.8.5 and removed WIP labels Nov 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Update Edge NGram Tokenizer documentation

4 participants