-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Description
Elasticsearch version (bin/elasticsearch --version): 5.5.2
Plugins installed: [analysis-icu analysis-smartcn ingest-geoip x-pack
analysis-kuromoji analysis-stempel ingest-user-agent
]
JVM version (java -version):
openjdk version "1.8.0_141"
OpenJDK Runtime Environment (build 1.8.0_141-b16)
OpenJDK 64-Bit Server VM (build 25.141-b16, mixed mode)
OS version (uname -a if on a Unix-like system):
Linux 4.9.47-1-lts #1 SMP Sat Sep 2 09:26:00 CEST 2017 x86_64 GNU/Linux
Description of the problem including expected versus actual behavior:
I am trying to migrate from elasticsearch 2.4 to 5.x. Basically, everything is working as expected, but the part-of-speech filter does not remove the default stoptags which used to work alright before.
Steps to reproduce:
- create an index with the kuromoji tokenizer and a part-of-speech filter
$ http PUT :32769/kuromoji_sample <<<'{
"settings": {
"index": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "kuromoji_tokenizer",
"filter": [
"my_posfilter"
]
}
},
"filter": {
"my_posfilter": {
"type": "kuromoji_part_of_speech",
"stoptags": [
"助詞-格助詞-一般",
"助詞-終助詞"
]
}
}
}
}
}
}'
HTTP/1.1 200 OK
content-encoding: gzip
content-type: application/json; charset=UTF-8
transfer-encoding: chunked
{
"acknowledged": true,
"shards_acknowledged": true
}
- analyze the text "寿司がおいしいね"
$ http :32769/kuromoji_sample/_analyze analyzer=my_analyzer text="寿司がおいしいね"
HTTP/1.1 200 OK
content-encoding: gzip
content-type: application/json; charset=UTF-8
transfer-encoding: chunked
{
"tokens": [
{
"end_offset": 2,
"position": 0,
"start_offset": 0,
"token": "寿司",
"type": "word"
},
{
"end_offset": 7,
"position": 2,
"start_offset": 3,
"token": "おいしい",
"type": "word"
}
]
}Here the "が" and "ね" characters are correctly removed.
- create an index the same way as in step 1, but do not specify the
stoptags:
$ http PUT :32769/kuromoji_sample_2 <<<'{
"settings": {
"index": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "kuromoji_tokenizer",
"filter": [
"my_posfilter"
]
}
},
"filter": {
"my_posfilter": {
"type": "kuromoji_part_of_speech"
}
}
}
}
}
}'
HTTP/1.1 200 OK
content-encoding: gzip
content-type: application/json; charset=UTF-8
transfer-encoding: chunked
{
"acknowledged": true,
"shards_acknowledged": true
}- analyze the text "寿司がおいしいね" again
$ http :32769/kuromoji_sample_2/_analyze analyzer=my_analyzer text="寿司がおいしいね"
HTTP/1.1 200 OK
content-encoding: gzip
content-type: application/json; charset=UTF-8
transfer-encoding: chunked
{
"tokens": [
{
"end_offset": 2,
"position": 0,
"start_offset": 0,
"token": "寿司",
"type": "word"
},
{
"end_offset": 3,
"position": 1,
"start_offset": 2,
"token": "が",
"type": "word"
},
{
"end_offset": 7,
"position": 2,
"start_offset": 3,
"token": "おいしい",
"type": "word"
},
{
"end_offset": 8,
"position": 3,
"start_offset": 7,
"token": "ね",
"type": "word"
}
]
}This example is taken from the documentation page here: https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-kuromoji-speech.html
That page says, that stoptags is "An array of part-of-speech tags that should be removed. It defaults to the stoptags.txt file embedded in the lucene-analyzer-kuromoji.jar"
I have looked at the embedded file in that jar and could not find any difference to the version used by in the 2.4 kuromoji plugin.
I also tried to define an empty array, or use a combination of latin characters, but it always returns four tokens instead of two.