Skip to content

Commit 721af72

Browse files
authored
[DOCS] Add full-width char section to kuromoji analyzer docs (#60317) (#60327)
1 parent 5670998 commit 721af72

File tree

1 file changed

+61
-1
lines changed

1 file changed

+61
-1
lines changed

docs/plugins/analysis-kuromoji.asciidoc

Lines changed: 61 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
=== Japanese (kuromoji) Analysis Plugin
33

44
The Japanese (kuromoji) Analysis plugin integrates Lucene kuromoji analysis
5-
module into elasticsearch.
5+
module into {es}.
66

77
:plugin_name: analysis-kuromoji
88
include::install_remove.asciidoc[]
@@ -23,6 +23,62 @@ The `kuromoji` analyzer consists of the following tokenizer and token filters:
2323
It supports the `mode` and `user_dictionary` settings from
2424
<<analysis-kuromoji-tokenizer,`kuromoji_tokenizer`>>.
2525

26+
[discrete]
27+
[[kuromoji-analyzer-normalize-full-width-characters]]
28+
==== Normalize full-width characters
29+
30+
The `kuromoji_tokenizer` tokenizer uses characters from the MeCab-IPADIC
31+
dictionary to split text into tokens. The dictionary includes some full-width
32+
characters, such as `o` and `f`. If a text contains full-width characters,
33+
the tokenizer can produce unexpected tokens.
34+
35+
For example, the `kuromoji_tokenizer` tokenizer converts the text
36+
`Culture of Japan` to the tokens `[ culture, o, f, japan ]` by
37+
default. However, a user may expect the tokenizer to instead produce
38+
`[ culture, of, japan ]`.
39+
40+
To avoid this, add the <<analysis-icu-normalization-charfilter,`icu_normalizer`
41+
character filter>> to a custom analyzer based on the `kuromoji` analyzer. The
42+
`icu_normalizer` character filter converts full-width characters to their normal
43+
equivalents.
44+
45+
First, duplicate the `kuromoji` analyzer to create the basis for a custom
46+
analyzer. Then add the `icu_normalizer` character filter to the custom analyzer.
47+
For example:
48+
49+
[source,console]
50+
----
51+
PUT index-00001
52+
{
53+
"settings": {
54+
"index": {
55+
"analysis": {
56+
"analyzer": {
57+
"kuromoji_normalize": { <1>
58+
"char_filter": [
59+
"icu_normalizer" <2>
60+
],
61+
"tokenizer": "kuromoji_tokenizer",
62+
"filter": [
63+
"kuromoji_baseform",
64+
"kuromoji_part_of_speech",
65+
"cjk_width",
66+
"ja_stop",
67+
"kuromoji_stemmer",
68+
"lowercase"
69+
]
70+
}
71+
}
72+
}
73+
}
74+
}
75+
}
76+
----
77+
<1> Creates a new custom analyzer, `kuromoji_normalize`, based on the `kuromoji`
78+
analyzer.
79+
<2> Adds the `icu_normalizer` character filter to the analyzer.
80+
81+
2682
[[analysis-kuromoji-charfilter]]
2783
==== `kuromoji_iteration_mark` character filter
2884

@@ -208,6 +264,10 @@ The above `analyze` request returns the following:
208264
}
209265
--------------------------------------------------
210266

267+
NOTE: If a text contains full-width characters, the `kuromoji_tokenizer`
268+
tokenizer can produce unexpected tokens. To avoid this, add the
269+
<<analysis-icu-normalization-charfilter,`icu_normalizer` character filter>> to
270+
your analyzer. See <<kuromoji-analyzer-normalize-full-width-characters>>.
211271

212272
[[analysis-kuromoji-baseform]]
213273
==== `kuromoji_baseform` token filter

0 commit comments

Comments
 (0)