Skip to content

Kuromoji analysis part-of-speech filter not working #26519

@avdv

Description

@avdv

Elasticsearch version (bin/elasticsearch --version): 5.5.2

Plugins installed: [analysis-icu analysis-smartcn ingest-geoip x-pack
analysis-kuromoji analysis-stempel ingest-user-agent
]

JVM version (java -version):

openjdk version "1.8.0_141"
OpenJDK Runtime Environment (build 1.8.0_141-b16)
OpenJDK 64-Bit Server VM (build 25.141-b16, mixed mode)

OS version (uname -a if on a Unix-like system):

Linux 4.9.47-1-lts #1 SMP Sat Sep 2 09:26:00 CEST 2017 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:

I am trying to migrate from elasticsearch 2.4 to 5.x. Basically, everything is working as expected, but the part-of-speech filter does not remove the default stoptags which used to work alright before.

Steps to reproduce:

  1. create an index with the kuromoji tokenizer and a part-of-speech filter
$ http PUT :32769/kuromoji_sample <<<'{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "my_analyzer": {
            "tokenizer": "kuromoji_tokenizer",
            "filter": [
              "my_posfilter"
            ]
          }
        },
        "filter": {
          "my_posfilter": {
            "type": "kuromoji_part_of_speech",
            "stoptags": [
              "助詞-格助詞-一般",
              "助詞-終助詞"
            ]
          }
        }
      }
    }
  }
}'

HTTP/1.1 200 OK
content-encoding: gzip
content-type: application/json; charset=UTF-8
transfer-encoding: chunked

{
    "acknowledged": true,
    "shards_acknowledged": true
}
  1. analyze the text "寿司がおいしいね"
$ http :32769/kuromoji_sample/_analyze analyzer=my_analyzer  text="寿司がおいしいね"

HTTP/1.1 200 OK
content-encoding: gzip
content-type: application/json; charset=UTF-8
transfer-encoding: chunked

{
    "tokens": [
        {
            "end_offset": 2,
            "position": 0,
            "start_offset": 0,
            "token": "寿司",
            "type": "word"
        },
        {
            "end_offset": 7,
            "position": 2,
            "start_offset": 3,
            "token": "おいしい",
            "type": "word"
        }
    ]
}

Here the "が" and "ね" characters are correctly removed.

  1. create an index the same way as in step 1, but do not specify the stoptags:
$ http PUT :32769/kuromoji_sample_2 <<<'{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "my_analyzer": {
            "tokenizer": "kuromoji_tokenizer",
            "filter": [
              "my_posfilter"
            ]
          }
        },
        "filter": {
          "my_posfilter": {
            "type": "kuromoji_part_of_speech"
          }
        }
      }
    }
  }
}'

HTTP/1.1 200 OK
content-encoding: gzip
content-type: application/json; charset=UTF-8
transfer-encoding: chunked

{
    "acknowledged": true,
    "shards_acknowledged": true
}
  1. analyze the text "寿司がおいしいね" again
$ http :32769/kuromoji_sample_2/_analyze analyzer=my_analyzer  text="寿司がおいしいね"

HTTP/1.1 200 OK
content-encoding: gzip
content-type: application/json; charset=UTF-8
transfer-encoding: chunked

{
    "tokens": [
        {
            "end_offset": 2,
            "position": 0,
            "start_offset": 0,
            "token": "寿司",
            "type": "word"
        },
        {
            "end_offset": 3,
            "position": 1,
            "start_offset": 2,
            "token": "",
            "type": "word"
        },
        {
            "end_offset": 7,
            "position": 2,
            "start_offset": 3,
            "token": "おいしい",
            "type": "word"
        },
        {
            "end_offset": 8,
            "position": 3,
            "start_offset": 7,
            "token": "",
            "type": "word"
        }
    ]
}

This example is taken from the documentation page here: https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-kuromoji-speech.html

That page says, that stoptags is "An array of part-of-speech tags that should be removed. It defaults to the stoptags.txt file embedded in the lucene-analyzer-kuromoji.jar"

I have looked at the embedded file in that jar and could not find any difference to the version used by in the 2.4 kuromoji plugin.

I also tried to define an empty array, or use a combination of latin characters, but it always returns four tokens instead of two.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions