-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Closed
Labels
:Search Relevance/AnalysisHow text is split into tokensHow text is split into tokens>bugTeam:Search RelevanceMeta label for the Search Relevance team in ElasticsearchMeta label for the Search Relevance team in Elasticsearchpriority:highA label for assessing bug priority to be used by ES engineersA label for assessing bug priority to be used by ES engineers
Description
6.5
To reproduce:
Create an index with a custom user dictionary for kuromoji_tokenizer:
{
"analysis": {
"tokenizer": {
"custom_user_dictionary": {
"type": "kuromoji_tokenizer",
"mode": "normal",
"discard_punctuation": "false",
"user_dictionary": "test.csv"
}
},
"analyzer": {
"japanese": {
"tokenizer": "custom_user_dictionary"
},
"japanese-search": {
"tokenizer": "custom_user_dictionary"
}
}
}
}
The above with the attached dictionary file test.txt results in an unsupported_operation_exception with no useful error messaging:
{
"error": {
"root_cause": [
{
"type": "unsupported_operation_exception",
"reason": null
}
],
"type": "unsupported_operation_exception",
"reason": null
},
"status": 500
}
Underlying Lucene/ES exception simply shows a null as well:
java.lang.UnsupportedOperationException: null
at org.apache.lucene.util.fst.Outputs.merge(Outputs.java:97) ~[lucene-core-7.5.0.jar:7.5.0 b5bf70b7e32d7ddd9742cc821d471c5fabd4e3df - jimczi - 2018-09-18 13:01:13]
at org.apache.lucene.util.fst.Builder.add(Builder.java:445) ~[lucene-core-7.5.0.jar:7.5.0 b5bf70b7e32d7ddd9742cc821d471c5fabd4e3df - jimczi - 2018-09-18 13:01:13]
at org.apache.lucene.analysis.ja.dict.UserDictionary.<init>(UserDictionary.java:131) ~[?:?]
at org.apache.lucene.analysis.ja.dict.UserDictionary.open(UserDictionary.java:81) ~[?:?]
at org.elasticsearch.index.analysis.KuromojiTokenizerFactory.getUserDictionary(KuromojiTokenizerFactory.java:64) ~[?:?]
at org.elasticsearch.index.analysis.KuromojiTokenizerFactory.<init>(KuromojiTokenizerFactory.java:50) ~[?:?]
at org.elasticsearch.index.analysis.AnalysisRegistry.buildMapping(AnalysisRegistry.java:348) ~[elasticsearch-6.5.0.jar:6.5.0]
at org.elasticsearch.index.analysis.AnalysisRegistry.buildTokenizerFactories(AnalysisRegistry.java:181) ~[elasticsearch-6.5.0.jar:6.5.0]
at org.elasticsearch.index.analysis.AnalysisRegistry.build(AnalysisRegistry.java:158) ~[elasticsearch-6.5.0.jar:6.5.0]
at org.elasticsearch.index.IndexService.<init>(IndexService.java:165) ~[elasticsearch-6.5.0.jar:6.5.0]
at org.elasticsearch.index.IndexModule.newIndexService(IndexModule.java:397) ~[elasticsearch-6.5.0.jar:6.5.0]
at org.elasticsearch.indices.IndicesService.createIndexService(IndicesService.java:503) ~[elasticsearch-6.5.0.jar:6.5.0]
at org.elasticsearch.indices.IndicesService.createIndex(IndicesService.java:457) ~[elasticsearch-6.5.0.jar:6.5.0]
at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService$IndexCreationTask.execute(MetaDataCreateIndexService.java:446) ~[elasticsearch-6.5.0.jar:6.5.0]
at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:45) ~[elasticsearch-6.5.0.jar:6.5.0]
at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:639) ~[elasticsearch-6.5.0.jar:6.5.0]
at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:268) ~[elasticsearch-6.5.0.jar:6.5.0]
at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:198) [elasticsearch-6.5.0.jar:6.5.0]
at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:133) [elasticsearch-6.5.0.jar:6.5.0]
at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [elasticsearch-6.5.0.jar:6.5.0]
at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) [elasticsearch-6.5.0.jar:6.5.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:624) [elasticsearch-6.5.0.jar:6.5.0]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:244) [elasticsearch-6.5.0.jar:6.5.0]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:207) [elasticsearch-6.5.0.jar:6.5.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_111]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_111]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_111]
The issue here appears to be the duplicate entry in the user dictionary. Note that while I have already isolated this down to the duplication being an issue, for large user dictionaries it can be cumbersome to debug. It will be nice if we can handle duplicate entries automatically or provide a more useful message.
Metadata
Metadata
Assignees
Labels
:Search Relevance/AnalysisHow text is split into tokensHow text is split into tokens>bugTeam:Search RelevanceMeta label for the Search Relevance team in ElasticsearchMeta label for the Search Relevance team in Elasticsearchpriority:highA label for assessing bug priority to be used by ES engineersA label for assessing bug priority to be used by ES engineers