Skip to content

Support kuromoji user dictionary set directly in the settings file #25343

@tatsuya

Description

@tatsuya

It would be nice if kuromoji_tokenizer supports loading user dictionary via array of dictionary entries in the settings json directly, not only from the file.

Current settings example looks like the below:

{
  "settings": {
    "index": {
      "analysis": {
        "tokenizer": {
          "kuromoji_user_dict": {
            "type": "kuromoji_tokenizer",
            "mode": "extended",
            "discard_punctuation": "false",
            "user_dictionary": "userdict_ja.txt"
          }
        },
        "analyzer": {
          "my_analyzer": {
            "type": "custom",
            "tokenizer": "kuromoji_user_dict"
          }
        }
      }
    }
  }
}

My suggestion is to have new json property named user_dictionary_entires (or similar) at the same level of current user_dictionary, and it accepts the array of dictionary entries. If both user_dictionary and user_dictionary_entries given, then it has to either merge both inputs or use only one of them though, I think simply prioritize one of those inputs would be simpler. This is actually pretty similar to the way the Synonym Token Filter supports already.

So the new json format would be:

{
  "settings": {
    "index": {
      "analysis": {
        "tokenizer": {
          "kuromoji_user_dict": {
            "type": "kuromoji_tokenizer",
            "mode": "extended",
            "discard_punctuation": "false",
            "user_dictionary_entires": [
              "東京スカイツリー,東京 スカイツリー,トウキョウ スカイツリー,カスタム名詞",
              "..."
            ]
          }
        },
        "analyzer": {
          "my_analyzer": {
            "type": "custom",
            "tokenizer": "kuromoji_user_dict"
          }
        }
      }
    }
  }
}

If this sounds good to you, I can create a pull request anytime. Thank you!

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions