-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Description
It would be nice if kuromoji_tokenizer supports loading user dictionary via array of dictionary entries in the settings json directly, not only from the file.
Current settings example looks like the below:
{
"settings": {
"index": {
"analysis": {
"tokenizer": {
"kuromoji_user_dict": {
"type": "kuromoji_tokenizer",
"mode": "extended",
"discard_punctuation": "false",
"user_dictionary": "userdict_ja.txt"
}
},
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "kuromoji_user_dict"
}
}
}
}
}
}My suggestion is to have new json property named user_dictionary_entires (or similar) at the same level of current user_dictionary, and it accepts the array of dictionary entries. If both user_dictionary and user_dictionary_entries given, then it has to either merge both inputs or use only one of them though, I think simply prioritize one of those inputs would be simpler. This is actually pretty similar to the way the Synonym Token Filter supports already.
So the new json format would be:
{
"settings": {
"index": {
"analysis": {
"tokenizer": {
"kuromoji_user_dict": {
"type": "kuromoji_tokenizer",
"mode": "extended",
"discard_punctuation": "false",
"user_dictionary_entires": [
"東京スカイツリー,東京 スカイツリー,トウキョウ スカイツリー,カスタム名詞",
"..."
]
}
},
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "kuromoji_user_dict"
}
}
}
}
}
}If this sounds good to you, I can create a pull request anytime. Thank you!