-
Notifications
You must be signed in to change notification settings - Fork 25.6k
[DOCS] Reformat word_delimiter_graph token filter
#53170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DOCS] Reformat word_delimiter_graph token filter
#53170
Conversation
Makes the following changes to the `word_delimiter_graph` token filter docs: * Updates the Lucene experimental admonition. * Updates description * Adds analyze snippet * Adds custom analyzer and custom filter snippets * Reorganizes and updates parameter list * Expands and updates section re: differences between `word_delimiter` and `word_delimiter_graph`
|
Pinging @elastic/es-search (:Search/Analysis) |
|
Pinging @elastic/es-docs (>docs) |
romseygeek
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a bit of a rambly comment on when to use this filter. It's very easy to misuse it - @jimczi may have an opinion here as well.
| [ `the`, **`wifi`**, `wi`, `fi`, `is`, `enabled` ]. | ||
|
|
||
| This better preserves the token stream's original sequence and doesn't usually | ||
| interfere with `match_phrase` or similar queries. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is true? catenate_X parameters break phrase searching in general - for example, searching for the exact phrase the wifi is enabled won't match against the token stream above because fi introduces an extra position, so is is indexes as if it were two positions away from wifi. This is a hard problem in lucene - we don't want to start indexing position lengths because that will make phrase queries much slower.
The advantage of the _graph variant is that it produces graphs which can be used at query time to generate several queries, so a query for the wi-fi is enabled will produce two phrase queries, the wi fi is enabled and the wifi is enabled. All good if you've indexed the phrase the wi-fi is enabled, as the first query will match. However, searching for the wifi is enabled won't match - it's all lowercase in the query, so the filter doesn't recognise the need to break it up, and in the index wifi is two positions away from is.
Breaking up words with hyphens in is tricky because of the possibility that people will try and search for the word without the hyphen; I think these are probably better dealt with via synonyms. A better usecase for removing punctuation is for things like part numbers, where you only really want phrase searching within a multi-part token, so you use WDGF with a keyword tokenizer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha! The query time bit makes sense to me. I can also add some warnings to the catenate parameters so users know they'll break phrase searches. I'll also amend the intro a bit to cover the product/part number use case.
|
Thanks again for your feedback @romseygeek. I've made some adjustments throughout the page. Here are the changes:
To better check out the diagrams, you can use this preview link: |
romseygeek
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jrodewig , this looks much better - the token graph diagrams especially are very helpful.
|
Thanks @romseygeek! |
Makes the following changes to the `word_delimiter_graph` token filter docs: * Updates the Lucene experimental admonition. * Updates description * Adds analyze snippet * Adds custom analyzer and custom filter snippets * Reorganizes and updates parameter list * Expands and updates section re: differences between `word_delimiter` and `word_delimiter_graph`
Makes the following changes to the `word_delimiter_graph` token filter docs: * Updates the Lucene experimental admonition. * Updates description * Adds analyze snippet * Adds custom analyzer and custom filter snippets * Reorganizes and updates parameter list * Expands and updates section re: differences between `word_delimiter` and `word_delimiter_graph`
Makes the following changes to the
word_delimiter_graphtoken filter docs:word_delimiterandword_delimiter_graphAlso updates theMoved to #53220trimfilter docs to note that thetrimfilter does not change token offsets.Preview
http://elasticsearch_53170.docs-preview.app.elstc.co/guide/en/elasticsearch/reference/master/analysis-word-delimiter-graph-tokenfilter.html