Skip to content

Commit 8d5478f

Browse files
authored
[DOCS] Add token graph concept docs (#53339)
Adds conceptual docs for token graphs. These docs cover: * How a token graph is constructed from a token stream * How synonyms and multi-position tokens impact token graphs * How token graphs are used during search * Why some token filters produce invalid token graphs Also makes the following supporting changes: * Adds anchors to the 'Anatomy of an Analyzer' docs for cross-linking * Adds several SVGs for token graph diagrams
1 parent 7636930 commit 8d5478f

10 files changed

+420
-5
lines changed

docs/reference/analysis/anatomy.asciidoc

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ blocks into analyzers suitable for different languages and types of text.
1010
Elasticsearch also exposes the individual building blocks so that they can be
1111
combined to define new <<analysis-custom-analyzer,`custom`>> analyzers.
1212

13+
[[analyzer-anatomy-character-filters]]
1314
==== Character filters
1415

1516
A _character filter_ receives the original text as a stream of characters and
@@ -21,6 +22,7 @@ elements like `<b>` from the stream.
2122
An analyzer may have *zero or more* <<analysis-charfilters,character filters>>,
2223
which are applied in order.
2324

25+
[[analyzer-anatomy-tokenizer]]
2426
==== Tokenizer
2527

2628
A _tokenizer_ receives a stream of characters, breaks it up into individual
@@ -35,6 +37,7 @@ the term represents.
3537

3638
An analyzer must have *exactly one* <<analysis-tokenizers,tokenizer>>.
3739

40+
[[analyzer-anatomy-token-filters]]
3841
==== Token filters
3942

4043
A _token filter_ receives the token stream and may add, remove, or change

docs/reference/analysis/concepts.asciidoc

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,8 @@ This section explains the fundamental concepts of text analysis in {es}.
88

99
* <<analyzer-anatomy>>
1010
* <<analysis-index-search-time>>
11+
* <<token-graphs>>
1112

1213
include::anatomy.asciidoc[]
13-
include::index-search-time.asciidoc[]
14+
include::index-search-time.asciidoc[]
15+
include::token-graphs.asciidoc[]
Lines changed: 104 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,104 @@
1+
[[token-graphs]]
2+
=== Token graphs
3+
4+
When a <<analyzer-anatomy-tokenizer,tokenizer>> converts a text into a stream of
5+
tokens, it also records the following:
6+
7+
* The `position` of each token in the stream
8+
* The `positionLength`, the number of positions that a token spans
9+
10+
Using these, you can create a
11+
https://en.wikipedia.org/wiki/Directed_acyclic_graph[directed acyclic graph],
12+
called a _token graph_, for a stream. In a token graph, each position represents
13+
a node. Each token represents an edge or arc, pointing to the next position.
14+
15+
image::images/analysis/token-graph-qbf-ex.svg[align="center"]
16+
17+
[[token-graphs-synonyms]]
18+
==== Synonyms
19+
20+
Some <<analyzer-anatomy-token-filters,token filters>> can add new tokens, like
21+
synonyms, to an existing token stream. These synonyms often span the same
22+
positions as existing tokens.
23+
24+
In the following graph, `quick` and its synonym `fast` both have a position of
25+
`0`. They span the same positions.
26+
27+
image::images/analysis/token-graph-qbf-synonym-ex.svg[align="center"]
28+
29+
[[token-graphs-multi-position-tokens]]
30+
==== Multi-position tokens
31+
32+
Some token filters can add tokens that span multiple positions. These can
33+
include tokens for multi-word synonyms, such as using "atm" as a synonym for
34+
"automatic teller machine."
35+
36+
However, only some token filters, known as _graph token filters_, accurately
37+
record the `positionLength` for multi-position tokens. This filters include:
38+
39+
* <<analysis-synonym-graph-tokenfilter,`synonym_graph`>>
40+
* <<analysis-word-delimiter-graph-tokenfilter,`word_delimiter_graph`>>
41+
42+
In the following graph, `domain name system` and its synonym, `dns`, both have a
43+
position of `0`. However, `dns` has a `positionLength` of `3`. Other tokens in
44+
the graph have a default `positionLength` of `1`.
45+
46+
image::images/analysis/token-graph-dns-synonym-ex.svg[align="center"]
47+
48+
[[token-graphs-token-graphs-search]]
49+
===== Using token graphs for search
50+
51+
<<analysis-index-search-time,Indexing>> ignores the `positionLength` attribute
52+
and does not support token graphs containing multi-position tokens.
53+
54+
However, queries, such as the <<query-dsl-match-query,`match`>> or
55+
<<query-dsl-match-query-phrase,`match_phrase`>> query, can use these graphs to
56+
generate multiple sub-queries from a single query string.
57+
58+
.*Example*
59+
[%collapsible]
60+
====
61+
62+
A user runs a search for the following phrase using the `match_phrase` query:
63+
64+
`domain name system is fragile`
65+
66+
During <<analysis-index-search-time,search analysis>>, `dns`, a synonym for
67+
`domain name system`, is added to the query string's token stream. The `dns`
68+
token has a `positionLength` of `3`.
69+
70+
image::images/analysis/token-graph-dns-synonym-ex.svg[align="center"]
71+
72+
The `match_phrase` query uses this graph to generate sub-queries for the
73+
following phrases:
74+
75+
[source,text]
76+
------
77+
dns is fragile
78+
domain name system is fragile
79+
------
80+
81+
This means the query matches documents containing either `dns is fragile` _or_
82+
`domain name system is fragile`.
83+
====
84+
85+
[[token-graphs-invalid-token-graphs]]
86+
===== Invalid token graphs
87+
88+
The following token filters can add tokens that span multiple positions but
89+
only record a default `positionLength` of `1`:
90+
91+
* <<analysis-synonym-tokenfilter,`synonym`>>
92+
* <<analysis-word-delimiter-tokenfilter,`word_delimiter`>>
93+
94+
This means these filters will produce invalid token graphs for streams
95+
containing such tokens.
96+
97+
In the following graph, `dns` is a multi-position synonym for `domain name
98+
system`. However, `dns` has the default `positionLength` value of `1`, resulting
99+
in an invalid graph.
100+
101+
image::images/analysis/token-graph-dns-invalid-ex.svg[align="center"]
102+
103+
Avoid using invalid token graphs for search. Invalid graphs can cause unexpected
104+
search results.

docs/reference/analysis/tokenfilters/synonym-graph-tokenfilter.asciidoc

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,8 +8,8 @@ The `synonym_graph` token filter allows to easily handle synonyms,
88
including multi-word synonyms correctly during the analysis process.
99

1010
In order to properly handle multi-word synonyms this token filter
11-
creates a "graph token stream" during processing. For more information
12-
on this topic and its various complexities, please read the
11+
creates a <<token-graphs,graph token stream>> during processing. For more
12+
information on this topic and its various complexities, please read the
1313
http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html[Lucene's TokenStreams are actually graphs] blog post.
1414

1515
["NOTE",id="synonym-graph-index-note"]

docs/reference/analysis/tokenfilters/word-delimiter-graph-tokenfilter.asciidoc

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -440,8 +440,8 @@ that span multiple positions when any of the following parameters are `true`:
440440

441441
However, only the `word_delimiter_graph` filter assigns multi-position tokens a
442442
`positionLength` attribute, which indicates the number of positions a token
443-
spans. This ensures the `word_delimiter_graph` filter always produces valid token
444-
https://en.wikipedia.org/wiki/Directed_acyclic_graph[graphs].
443+
spans. This ensures the `word_delimiter_graph` filter always produces valid
444+
<<token-graphs,token graphs>>.
445445

446446
The `word_delimiter` filter does not assign multi-position tokens a
447447
`positionLength` attribute. This means it produces invalid graphs for streams

docs/reference/images/analysis/token-graph-dns-ex.svg

Lines changed: 65 additions & 0 deletions
Loading

docs/reference/images/analysis/token-graph-dns-invalid-ex.svg

Lines changed: 72 additions & 0 deletions
Loading

docs/reference/images/analysis/token-graph-dns-synonym-ex.svg

Lines changed: 72 additions & 0 deletions
Loading

docs/reference/images/analysis/token-graph-qbf-ex.svg

Lines changed: 45 additions & 0 deletions
Loading

docs/reference/images/analysis/token-graph-qbf-synonym-ex.svg

Lines changed: 52 additions & 0 deletions
Loading

0 commit comments

Comments
 (0)