Kuromoji analysis part-of-speech filter not working


**Elasticsearch version** (`bin/elasticsearch --version`): 5.5.2

**Plugins installed**: [analysis-icu       analysis-smartcn  ingest-geoip       x-pack
analysis-kuromoji  analysis-stempel  ingest-user-agent
]

**JVM version** (`java -version`):

```
openjdk version "1.8.0_141"
OpenJDK Runtime Environment (build 1.8.0_141-b16)
OpenJDK 64-Bit Server VM (build 25.141-b16, mixed mode)
```

**OS version** (`uname -a` if on a Unix-like system):

Linux 4.9.47-1-lts #1 SMP Sat Sep 2 09:26:00 CEST 2017 x86_64 GNU/Linux

**Description of the problem including expected versus actual behavior**:

I am trying to migrate from elasticsearch 2.4 to 5.x. Basically, everything is working as expected, but the part-of-speech filter does not remove the default stoptags which used to work alright before.

**Steps to reproduce**:

 1. create an index with the kuromoji tokenizer and a part-of-speech filter

```bash
$ http PUT :32769/kuromoji_sample <<<'{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "my_analyzer": {
            "tokenizer": "kuromoji_tokenizer",
            "filter": [
              "my_posfilter"
            ]
          }
        },
        "filter": {
          "my_posfilter": {
            "type": "kuromoji_part_of_speech",
            "stoptags": [
              "助詞-格助詞-一般",
              "助詞-終助詞"
            ]
          }
        }
      }
    }
  }
}'

HTTP/1.1 200 OK
content-encoding: gzip
content-type: application/json; charset=UTF-8
transfer-encoding: chunked

{
    "acknowledged": true,
    "shards_acknowledged": true
}

```
 2. analyze the text "寿司がおいしいね"

```bash
$ http :32769/kuromoji_sample/_analyze analyzer=my_analyzer  text="寿司がおいしいね"

HTTP/1.1 200 OK
content-encoding: gzip
content-type: application/json; charset=UTF-8
transfer-encoding: chunked

{
    "tokens": [
        {
            "end_offset": 2,
            "position": 0,
            "start_offset": 0,
            "token": "寿司",
            "type": "word"
        },
        {
            "end_offset": 7,
            "position": 2,
            "start_offset": 3,
            "token": "おいしい",
            "type": "word"
        }
    ]
}
```
Here the "が" and "ね" characters are correctly removed.

 3. create an index the same way as in step 1, but do not specify the `stoptags`:
```bash
$ http PUT :32769/kuromoji_sample_2 <<<'{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "my_analyzer": {
            "tokenizer": "kuromoji_tokenizer",
            "filter": [
              "my_posfilter"
            ]
          }
        },
        "filter": {
          "my_posfilter": {
            "type": "kuromoji_part_of_speech"
          }
        }
      }
    }
  }
}'

HTTP/1.1 200 OK
content-encoding: gzip
content-type: application/json; charset=UTF-8
transfer-encoding: chunked

{
    "acknowledged": true,
    "shards_acknowledged": true
}
```
 4. analyze the text "寿司がおいしいね" again
 
```bash
$ http :32769/kuromoji_sample_2/_analyze analyzer=my_analyzer  text="寿司がおいしいね"

HTTP/1.1 200 OK
content-encoding: gzip
content-type: application/json; charset=UTF-8
transfer-encoding: chunked

{
    "tokens": [
        {
            "end_offset": 2,
            "position": 0,
            "start_offset": 0,
            "token": "寿司",
            "type": "word"
        },
        {
            "end_offset": 3,
            "position": 1,
            "start_offset": 2,
            "token": "が",
            "type": "word"
        },
        {
            "end_offset": 7,
            "position": 2,
            "start_offset": 3,
            "token": "おいしい",
            "type": "word"
        },
        {
            "end_offset": 8,
            "position": 3,
            "start_offset": 7,
            "token": "ね",
            "type": "word"
        }
    ]
}
```

This example is taken from the documentation page here: https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-kuromoji-speech.html

That page says, that stoptags is "An array of part-of-speech tags that should be removed. It defaults to the **stoptags.txt** file embedded in the lucene-analyzer-kuromoji.jar"

I have looked at the embedded file in that jar and could not find any difference to the version used by in the 2.4 kuromoji plugin.

I also tried to define an empty array, or use a combination of latin characters, but it always returns four tokens instead of two.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Kuromoji analysis part-of-speech filter not working #26519

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Kuromoji analysis part-of-speech filter not working #26519

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions