22=== Japanese (kuromoji) Analysis Plugin
33
44The Japanese (kuromoji) Analysis plugin integrates Lucene kuromoji analysis
5- module into elasticsearch .
5+ module into {es} .
66
77:plugin_name: analysis-kuromoji
88include::install_remove.asciidoc[]
@@ -23,6 +23,62 @@ The `kuromoji` analyzer consists of the following tokenizer and token filters:
2323It supports the `mode` and `user_dictionary` settings from
2424<<analysis-kuromoji-tokenizer,`kuromoji_tokenizer`>>.
2525
26+ [discrete]
27+ [[kuromoji-analyzer-normalize-full-width-characters]]
28+ ==== Normalize full-width characters
29+
30+ The `kuromoji_tokenizer` tokenizer uses characters from the MeCab-IPADIC
31+ dictionary to split text into tokens. The dictionary includes some full-width
32+ characters, such as `o` and `f`. If a text contains full-width characters,
33+ the tokenizer can produce unexpected tokens.
34+
35+ For example, the `kuromoji_tokenizer` tokenizer converts the text
36+ `Culture of Japan` to the tokens `[ culture, o, f, japan ]` by
37+ default. However, a user may expect the tokenizer to instead produce
38+ `[ culture, of, japan ]`.
39+
40+ To avoid this, add the <<analysis-icu-normalization-charfilter,`icu_normalizer`
41+ character filter>> to a custom analyzer based on the `kuromoji` analyzer. The
42+ `icu_normalizer` character filter converts full-width characters to their normal
43+ equivalents.
44+
45+ First, duplicate the `kuromoji` analyzer to create the basis for a custom
46+ analyzer. Then add the `icu_normalizer` character filter to the custom analyzer.
47+ For example:
48+
49+ [source,console]
50+ ----
51+ PUT index-00001
52+ {
53+ "settings": {
54+ "index": {
55+ "analysis": {
56+ "analyzer": {
57+ "kuromoji_normalize": { <1>
58+ "char_filter": [
59+ "icu_normalizer" <2>
60+ ],
61+ "tokenizer": "kuromoji_tokenizer",
62+ "filter": [
63+ "kuromoji_baseform",
64+ "kuromoji_part_of_speech",
65+ "cjk_width",
66+ "ja_stop",
67+ "kuromoji_stemmer",
68+ "lowercase"
69+ ]
70+ }
71+ }
72+ }
73+ }
74+ }
75+ }
76+ ----
77+ <1> Creates a new custom analyzer, `kuromoji_normalize`, based on the `kuromoji`
78+ analyzer.
79+ <2> Adds the `icu_normalizer` character filter to the analyzer.
80+
81+
2682[[analysis-kuromoji-charfilter]]
2783==== `kuromoji_iteration_mark` character filter
2884
@@ -208,6 +264,10 @@ The above `analyze` request returns the following:
208264}
209265--------------------------------------------------
210266
267+ NOTE: If a text contains full-width characters, the `kuromoji_tokenizer`
268+ tokenizer can produce unexpected tokens. To avoid this, add the
269+ <<analysis-icu-normalization-charfilter,`icu_normalizer` character filter>> to
270+ your analyzer. See <<kuromoji-analyzer-normalize-full-width-characters>>.
211271
212272[[analysis-kuromoji-baseform]]
213273==== `kuromoji_baseform` token filter
0 commit comments