Skip to content

Commit 767836c

Browse files
authored
[DOCS] Reformat kstem token filter (#55823)
Makes the following changes to the `kstem` token filter docs: * Rewrite description and adds a Lucene work * Adds detailed analyze example * Adds an analyzer example
1 parent 6a0e1e1 commit 767836c

File tree

1 file changed

+109
-3
lines changed

1 file changed

+109
-3
lines changed

docs/reference/analysis/tokenfilters/kstem-tokenfilter.asciidoc

Lines changed: 109 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,112 @@
44
<titleabbrev>KStem</titleabbrev>
55
++++
66

7-
The `kstem` token filter is a high performance filter for english. All
8-
terms must already be lowercased (use `lowercase` filter) for this
9-
filter to work correctly.
7+
Provides http://ciir.cs.umass.edu/pubfiles/ir-35.pdf[KStem]-based stemming for
8+
the English language. The `kstem` filter combines
9+
<<algorithmic-stemmers,algorithmic stemming>> with a built-in
10+
<<dictionary-stemmers,dictionary>>.
11+
12+
The `kstem` filter tends to stem less aggressively than other English stemmer
13+
filters, such as the <<analysis-porterstem-tokenfilter,`porter_stem`>> filter.
14+
15+
The `kstem` filter is equivalent to the
16+
<<analysis-stemmer-tokenfilter,`stemmer`>> filter's
17+
<<analysis-stemmer-tokenfilter-language-parm,`light_english`>> variant.
18+
19+
This filter uses Lucene's
20+
{lucene-analysis-docs}s/en/KStemFilter.html[KStemFilter].
21+
22+
[[analysis-kstem-tokenfilter-analyze-ex]]
23+
==== Example
24+
25+
The following analyze API request uses the `kstem` filter to stem `the foxes
26+
jumping quickly` to `the fox jump quick`:
27+
28+
[source,console]
29+
----
30+
GET /_analyze
31+
{
32+
"tokenizer": "standard",
33+
"filter": [ "kstem" ],
34+
"text": "the foxes jumping quickly"
35+
}
36+
----
37+
38+
The filter produces the following tokens:
39+
40+
[source,text]
41+
----
42+
[ the, fox, jump, quick ]
43+
----
44+
45+
////
46+
[source,console-result]
47+
----
48+
{
49+
"tokens": [
50+
{
51+
"token": "the",
52+
"start_offset": 0,
53+
"end_offset": 3,
54+
"type": "<ALPHANUM>",
55+
"position": 0
56+
},
57+
{
58+
"token": "fox",
59+
"start_offset": 4,
60+
"end_offset": 9,
61+
"type": "<ALPHANUM>",
62+
"position": 1
63+
},
64+
{
65+
"token": "jump",
66+
"start_offset": 10,
67+
"end_offset": 17,
68+
"type": "<ALPHANUM>",
69+
"position": 2
70+
},
71+
{
72+
"token": "quick",
73+
"start_offset": 18,
74+
"end_offset": 25,
75+
"type": "<ALPHANUM>",
76+
"position": 3
77+
}
78+
]
79+
}
80+
----
81+
////
82+
83+
[[analysis-kstem-tokenfilter-analyzer-ex]]
84+
==== Add to an analyzer
85+
86+
The following <<indices-create-index,create index API>> request uses the
87+
`kstem` filter to configure a new <<analysis-custom-analyzer,custom
88+
analyzer>>.
89+
90+
[IMPORTANT]
91+
====
92+
To work properly, the `kstem` filter requires lowercase tokens. To ensure tokens
93+
are lowercased, add the <<analysis-lowercase-tokenfilter,`lowercase`>> filter
94+
before the `kstem` filter in the analyzer configuration.
95+
====
96+
97+
[source,console]
98+
----
99+
PUT /my_index
100+
{
101+
"settings": {
102+
"analysis": {
103+
"analyzer": {
104+
"my_analyzer": {
105+
"tokenizer": "whitespace",
106+
"filter": [
107+
"lowercase",
108+
"kstem"
109+
]
110+
}
111+
}
112+
}
113+
}
114+
}
115+
----

0 commit comments

Comments
 (0)