|
| 1 | +[[token-graphs]] |
| 2 | +=== Token graphs |
| 3 | + |
| 4 | +When a <<analyzer-anatomy-tokenizer,tokenizer>> converts a text into a stream of |
| 5 | +tokens, it also records the following: |
| 6 | + |
| 7 | +* The `position` of each token in the stream |
| 8 | +* The `positionLength`, the number of positions that a token spans |
| 9 | + |
| 10 | +Using these, you can create a |
| 11 | +https://en.wikipedia.org/wiki/Directed_acyclic_graph[directed acyclic graph], |
| 12 | +called a _token graph_, for a stream. In a token graph, each position represents |
| 13 | +a node. Each token represents an edge or arc, pointing to the next position. |
| 14 | + |
| 15 | +image::images/analysis/token-graph-qbf-ex.svg[align="center"] |
| 16 | + |
| 17 | +[[token-graphs-synonyms]] |
| 18 | +==== Synonyms |
| 19 | + |
| 20 | +Some <<analyzer-anatomy-token-filters,token filters>> can add new tokens, like |
| 21 | +synonyms, to an existing token stream. These synonyms often span the same |
| 22 | +positions as existing tokens. |
| 23 | + |
| 24 | +In the following graph, `quick` and its synonym `fast` both have a position of |
| 25 | +`0`. They span the same positions. |
| 26 | + |
| 27 | +image::images/analysis/token-graph-qbf-synonym-ex.svg[align="center"] |
| 28 | + |
| 29 | +[[token-graphs-multi-position-tokens]] |
| 30 | +==== Multi-position tokens |
| 31 | + |
| 32 | +Some token filters can add tokens that span multiple positions. These can |
| 33 | +include tokens for multi-word synonyms, such as using "atm" as a synonym for |
| 34 | +"automatic teller machine." |
| 35 | + |
| 36 | +However, only some token filters, known as _graph token filters_, accurately |
| 37 | +record the `positionLength` for multi-position tokens. This filters include: |
| 38 | + |
| 39 | +* <<analysis-synonym-graph-tokenfilter,`synonym_graph`>> |
| 40 | +* <<analysis-word-delimiter-graph-tokenfilter,`word_delimiter_graph`>> |
| 41 | + |
| 42 | +In the following graph, `domain name system` and its synonym, `dns`, both have a |
| 43 | +position of `0`. However, `dns` has a `positionLength` of `3`. Other tokens in |
| 44 | +the graph have a default `positionLength` of `1`. |
| 45 | + |
| 46 | +image::images/analysis/token-graph-dns-synonym-ex.svg[align="center"] |
| 47 | + |
| 48 | +[[token-graphs-token-graphs-search]] |
| 49 | +===== Using token graphs for search |
| 50 | + |
| 51 | +<<analysis-index-search-time,Indexing>> ignores the `positionLength` attribute |
| 52 | +and does not support token graphs containing multi-position tokens. |
| 53 | + |
| 54 | +However, queries, such as the <<query-dsl-match-query,`match`>> or |
| 55 | +<<query-dsl-match-query-phrase,`match_phrase`>> query, can use these graphs to |
| 56 | +generate multiple sub-queries from a single query string. |
| 57 | + |
| 58 | +.*Example* |
| 59 | +[%collapsible] |
| 60 | +==== |
| 61 | +
|
| 62 | +A user runs a search for the following phrase using the `match_phrase` query: |
| 63 | +
|
| 64 | +`domain name system is fragile` |
| 65 | +
|
| 66 | +During <<analysis-index-search-time,search analysis>>, `dns`, a synonym for |
| 67 | +`domain name system`, is added to the query string's token stream. The `dns` |
| 68 | +token has a `positionLength` of `3`. |
| 69 | +
|
| 70 | +image::images/analysis/token-graph-dns-synonym-ex.svg[align="center"] |
| 71 | +
|
| 72 | +The `match_phrase` query uses this graph to generate sub-queries for the |
| 73 | +following phrases: |
| 74 | +
|
| 75 | +[source,text] |
| 76 | +------ |
| 77 | +dns is fragile |
| 78 | +domain name system is fragile |
| 79 | +------ |
| 80 | +
|
| 81 | +This means the query matches documents containing either `dns is fragile` _or_ |
| 82 | +`domain name system is fragile`. |
| 83 | +==== |
| 84 | + |
| 85 | +[[token-graphs-invalid-token-graphs]] |
| 86 | +===== Invalid token graphs |
| 87 | + |
| 88 | +The following token filters can add tokens that span multiple positions but |
| 89 | +only record a default `positionLength` of `1`: |
| 90 | + |
| 91 | +* <<analysis-synonym-tokenfilter,`synonym`>> |
| 92 | +* <<analysis-word-delimiter-tokenfilter,`word_delimiter`>> |
| 93 | + |
| 94 | +This means these filters will produce invalid token graphs for streams |
| 95 | +containing such tokens. |
| 96 | + |
| 97 | +In the following graph, `dns` is a multi-position synonym for `domain name |
| 98 | +system`. However, `dns` has the default `positionLength` value of `1`, resulting |
| 99 | +in an invalid graph. |
| 100 | + |
| 101 | +image::images/analysis/token-graph-dns-invalid-ex.svg[align="center"] |
| 102 | + |
| 103 | +Avoid using invalid token graphs for search. Invalid graphs can cause unexpected |
| 104 | +search results. |
0 commit comments