Skip to content

Commit e1eebea

Browse files
authored
[DOCS] Reformat remove_duplicates token filter (#53608)
Makes the following changes to the `remove_duplicates` token filter docs: * Rewrites description and adds Lucene link * Adds detailed analyze example * Adds custom analyzer example
1 parent 2c74f3e commit e1eebea

File tree

1 file changed

+148
-2
lines changed

1 file changed

+148
-2
lines changed

docs/reference/analysis/tokenfilters/remove-duplicates-tokenfilter.asciidoc

Lines changed: 148 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,5 +4,151 @@
44
<titleabbrev>Remove duplicates</titleabbrev>
55
++++
66

7-
A token filter of type `remove_duplicates` that drops identical tokens at the
8-
same position.
7+
Removes duplicate tokens in the same position.
8+
9+
The `remove_duplicates` filter uses Lucene's
10+
{lucene-analysis-docs}/miscellaneous/RemoveDuplicatesTokenFilter.html[RemoveDuplicatesTokenFilter].
11+
12+
[[analysis-remove-duplicates-tokenfilter-analyze-ex]]
13+
==== Example
14+
15+
To see how the `remove_duplicates` filter works, you first need to produce a
16+
token stream containing duplicate tokens in the same position.
17+
18+
The following <<indices-analyze,analyze API>> request uses the
19+
<<analysis-keyword-repeat-tokenfilter,`keyword_repeat`>> and
20+
<<analysis-stemmer-tokenfilter,`stemmer`>> filters to create stemmed and
21+
unstemmed tokens for `jumping dog`.
22+
23+
[source,console]
24+
----
25+
GET _analyze
26+
{
27+
"tokenizer": "whitespace",
28+
"filter": [
29+
"keyword_repeat",
30+
"stemmer"
31+
],
32+
"text": "jumping dog"
33+
}
34+
----
35+
36+
The API returns the following response. Note that the `dog` token in position
37+
`1` is duplicated.
38+
39+
[source,console-result]
40+
----
41+
{
42+
"tokens": [
43+
{
44+
"token": "jumping",
45+
"start_offset": 0,
46+
"end_offset": 7,
47+
"type": "word",
48+
"position": 0
49+
},
50+
{
51+
"token": "jump",
52+
"start_offset": 0,
53+
"end_offset": 7,
54+
"type": "word",
55+
"position": 0
56+
},
57+
{
58+
"token": "dog",
59+
"start_offset": 8,
60+
"end_offset": 11,
61+
"type": "word",
62+
"position": 1
63+
},
64+
{
65+
"token": "dog",
66+
"start_offset": 8,
67+
"end_offset": 11,
68+
"type": "word",
69+
"position": 1
70+
}
71+
]
72+
}
73+
----
74+
75+
To remove one of the duplicate `dog` tokens, add the `remove_duplicates` filter
76+
to the previous analyze API request.
77+
78+
[source,console]
79+
----
80+
GET _analyze
81+
{
82+
"tokenizer": "whitespace",
83+
"filter": [
84+
"keyword_repeat",
85+
"stemmer",
86+
"remove_duplicates"
87+
],
88+
"text": "jumping dog"
89+
}
90+
----
91+
92+
The API returns the following response. There is now only one `dog` token in
93+
position `1`.
94+
95+
[source,console-result]
96+
----
97+
{
98+
"tokens": [
99+
{
100+
"token": "jumping",
101+
"start_offset": 0,
102+
"end_offset": 7,
103+
"type": "word",
104+
"position": 0
105+
},
106+
{
107+
"token": "jump",
108+
"start_offset": 0,
109+
"end_offset": 7,
110+
"type": "word",
111+
"position": 0
112+
},
113+
{
114+
"token": "dog",
115+
"start_offset": 8,
116+
"end_offset": 11,
117+
"type": "word",
118+
"position": 1
119+
}
120+
]
121+
}
122+
----
123+
124+
[[analysis-remove-duplicates-tokenfilter-analyzer-ex]]
125+
==== Add to an analyzer
126+
127+
The following <<indices-create-index,create index API>> request uses the
128+
`remove_duplicates` filter to configure a new <<analysis-custom-analyzer,custom
129+
analyzer>>.
130+
131+
This custom analyzer uses the `keyword_repeat` and `stemmer` filters to create a
132+
stemmed and unstemmed version of each token in a stream. The `remove_duplicates`
133+
filter then removes any duplicate tokens in the same position.
134+
135+
[source,console]
136+
----
137+
PUT my_index
138+
{
139+
"settings": {
140+
"analysis": {
141+
"analyzer": {
142+
"my_custom_analyzer": {
143+
"tokenizer": "standard",
144+
"filter": [
145+
"keyword_repeat",
146+
"stemmer",
147+
"remove_duplicates"
148+
]
149+
}
150+
}
151+
}
152+
}
153+
}
154+
----

0 commit comments

Comments
 (0)