[DOCS] Reformat `word_delimiter_graph` token filter #53170

jrodewig · 2020-03-05T14:33:52Z

Makes the following changes to the word_delimiter_graph token filter docs:

Updates the Lucene experimental admonition.
Updates description
Adds detailed analyze snippet
Adds custom analyzer and custom filter snippets
Reorganizes and updates parameter documentation
Expands and updates section re: differences between word_delimiter and word_delimiter_graph

~~Also updates the trim filter docs to note that the trim filter does not change token offsets.~~ Moved to #53220

Preview

http://elasticsearch_53170.docs-preview.app.elstc.co/guide/en/elasticsearch/reference/master/analysis-word-delimiter-graph-tokenfilter.html

Makes the following changes to the `word_delimiter_graph` token filter docs: * Updates the Lucene experimental admonition. * Updates description * Adds analyze snippet * Adds custom analyzer and custom filter snippets * Reorganizes and updates parameter list * Expands and updates section re: differences between `word_delimiter` and `word_delimiter_graph`

elasticmachine · 2020-03-05T14:33:54Z

Pinging @elastic/es-search (:Search/Analysis)

elasticmachine · 2020-03-05T14:33:56Z

Pinging @elastic/es-docs (>docs)

romseygeek

I left a bit of a rambly comment on when to use this filter. It's very easy to misuse it - @jimczi may have an opinion here as well.

docs/reference/analysis/tokenfilters/trim-tokenfilter.asciidoc

romseygeek · 2020-03-06T11:19:33Z

docs/reference/analysis/tokenfilters/word-delimiter-graph-tokenfilter.asciidoc

+[ `the`, **`wifi`**, `wi`, `fi`, `is`, `enabled` ].
+
+This better preserves the token stream's original sequence and doesn't usually
+interfere with `match_phrase` or similar queries.


I don't think this is true? catenate_X parameters break phrase searching in general - for example, searching for the exact phrase the wifi is enabled won't match against the token stream above because fi introduces an extra position, so is is indexes as if it were two positions away from wifi. This is a hard problem in lucene - we don't want to start indexing position lengths because that will make phrase queries much slower.

The advantage of the _graph variant is that it produces graphs which can be used at query time to generate several queries, so a query for the wi-fi is enabled will produce two phrase queries, the wi fi is enabled and the wifi is enabled. All good if you've indexed the phrase the wi-fi is enabled, as the first query will match. However, searching for the wifi is enabled won't match - it's all lowercase in the query, so the filter doesn't recognise the need to break it up, and in the index wifi is two positions away from is.

Breaking up words with hyphens in is tricky because of the possibility that people will try and search for the word without the hyphen; I think these are probably better dealt with via synonyms. A better usecase for removing punctuation is for things like part numbers, where you only really want phrase searching within a multi-part token, so you use WDGF with a keyword tokenizer.

Aha! The query time bit makes sense to me. I can also add some warnings to the catenate parameters so users know they'll break phrase searches. I'll also amend the intro a bit to cover the product/part number use case.

jrodewig · 2020-03-09T10:11:07Z

Thanks again for your feedback @romseygeek.

I've made some adjustments throughout the page. Here are the changes:

Updated the examples in the description to better fit identifiers, such as part numbers
Added a tip to the description re: identifiers
Updated the analyze example to fit identifiers
Added a warning re: indexing and multi-position tokens to the catenate_* and preserve_original parms
Added a warning re: match_phrase to the catenate_* parms
Rewrote the differences section to focus on token graph output, including some token graph diagrams.

To better check out the diagrams, you can use this preview link:
http://elasticsearch_53170.docs-preview.app.elstc.co/guide/en/elasticsearch/reference/master/analysis-word-delimiter-graph-tokenfilter.html

romseygeek

Thanks @jrodewig , this looks much better - the token graph diagrams especially are very helpful.

jrodewig · 2020-03-09T10:27:25Z

Thanks @romseygeek!

Makes the following changes to the `word_delimiter_graph` token filter docs: * Updates the Lucene experimental admonition. * Updates description * Adds analyze snippet * Adds custom analyzer and custom filter snippets * Reorganizes and updates parameter list * Expands and updates section re: differences between `word_delimiter` and `word_delimiter_graph`

jrodewig · 2020-03-09T10:53:21Z

Backport commits

master 1c8ab01
7.x 28cb4a1
7.6 8e3c1cb

jrodewig added >docs General docs changes :Search Relevance/Analysis How text is split into tokens labels Mar 5, 2020

jrodewig requested a review from romseygeek March 5, 2020 14:36

kat257 mentioned this pull request Mar 5, 2020

[DOCS] Reorganize, rewrite and add examples to analysis topics #44726

Closed

82 tasks

jrodewig added 4 commits March 5, 2020 15:47

Tweak wording for boolean parms

d48caa6

Remove experimental flag per #53217

aafad4f

Add missing period

a15173a

Add analyze API link

ae34f1e

romseygeek requested changes Mar 6, 2020

View reviewed changes

jrodewig added 5 commits March 6, 2020 06:44

Reset trim filter changes

acf8127

Address review feedback

e3c1144

Fix formatting

97a13e9

Another formatting fix

932eb89

Change heading order

1885f02

jrodewig requested a review from romseygeek March 9, 2020 10:11

romseygeek approved these changes Mar 9, 2020

View reviewed changes

jrodewig merged commit 1c8ab01 into elastic:master Mar 9, 2020

jrodewig deleted the docs__reformat-word-delimiter-graph-tokenfilter branch March 9, 2020 10:27

jrodewig added v7.6.2 v7.7.0 v8.0.0 labels Mar 9, 2020

This was referenced Mar 9, 2020

[7.x] [DOCS] Reformat word_delimiter_graph token filter (#53170) #53272

Merged

[7.6] [DOCS] Reformat word_delimiter_graph token filter (#53170) #53273

Merged

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DOCS] Reformat `word_delimiter_graph` token filter #53170

[DOCS] Reformat `word_delimiter_graph` token filter #53170

jrodewig commented Mar 5, 2020 •

edited

Loading

Uh oh!

elasticmachine commented Mar 5, 2020

Uh oh!

elasticmachine commented Mar 5, 2020

Uh oh!

romseygeek left a comment

Uh oh!

Uh oh!

romseygeek Mar 6, 2020

Uh oh!

jrodewig Mar 6, 2020

Uh oh!

jrodewig commented Mar 9, 2020

Uh oh!

romseygeek left a comment

Uh oh!

jrodewig commented Mar 9, 2020

Uh oh!

jrodewig commented Mar 9, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[DOCS] Reformat word_delimiter_graph token filter #53170

[DOCS] Reformat word_delimiter_graph token filter #53170

Conversation

jrodewig commented Mar 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Preview

Uh oh!

elasticmachine commented Mar 5, 2020

Uh oh!

elasticmachine commented Mar 5, 2020

Uh oh!

romseygeek left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

romseygeek Mar 6, 2020

Choose a reason for hiding this comment

Uh oh!

jrodewig Mar 6, 2020

Choose a reason for hiding this comment

Uh oh!

jrodewig commented Mar 9, 2020

Uh oh!

romseygeek left a comment

Choose a reason for hiding this comment

Uh oh!

jrodewig commented Mar 9, 2020

Uh oh!

jrodewig commented Mar 9, 2020

Backport commits

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[DOCS] Reformat `word_delimiter_graph` token filter #53170

[DOCS] Reformat `word_delimiter_graph` token filter #53170

jrodewig commented Mar 5, 2020 •

edited

Loading