-
Notifications
You must be signed in to change notification settings - Fork 25.4k
Description
Multi-field is great, but searching with multiple analyzers against only one field is simpler/better.
If you have a multi-lingual index, where each document has its source language, you can analyze the text fields using a special analyzer, based on the detected language (maybe even using the _analyzer.path
functionality).
But what happens when you misdetected the language somehow, either at index- or at query-time? Some aggressive stemming can have devastating effects.
In such a scenario, having the original words indexed in parallel to the stemmed one would help. Be they in the same field would even letting phrase/slop queries work properly.
The only way to get multiple terms at the same position with ElasticSearch is through the synonym token filter, useless for stemming.
I've been working on a way to merge the terms that multiple analyzers output.
Say you want both to use a simple analyzer, and any of the specialized language-specific analyzer, or anything.
My plugin can make it as simple as the following index setting:
index:
analysis:
analyzer:
# An analyzer using both the "simple" analyzer and the sophisticated "english" analyzer, combining the resulting terms
combo_en:
type: combo
sub_analyzers: [simple, english]
Here is a simple example of what is does:
# What the "simple" analyzer outputs
curl -XGET 'localhost:9200/yakaz/_analyze?pretty=1&analyzer=simple' -d 'An example'
{
"tokens" : [ {
"token" : "an",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 1
}, {
"token" : "example",
"start_offset" : 3,
"end_offset" : 10,
"type" : "word",
"position" : 2
} ]
}
# What the "english" analyzer outputs
curl -XGET 'localhost:9200/yakaz/_analyze?pretty=1&analyzer=english' -d 'An example'
{
"tokens" : [ {
"token" : "exampl",
"start_offset" : 3,
"end_offset" : 10,
"type" : "<ALPHANUM>",
"position" : 2
} ]
}
# Now what our combined analyzer outputs
curl -XGET 'localhost:9200/yakaz/_analyze?pretty=1&analyzer=combo_en' -d 'An example'
{
"tokens" : [ {
"token" : "an",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 1
}, {
"token" : "example",
"start_offset" : 3,
"end_offset" : 10,
"type" : "word",
"position" : 2
}, {
"token" : "exampl",
"start_offset" : 3,
"end_offset" : 10,
"type" : "<ALPHANUM>",
"position" : 2
} ]
}
Terms are sorted by position, then by start/end offset, so that it's easier to use its output under reasonable assumptions of using a classical analyzer.
Here is the good news! You can find my implementation here: https://github.com/ofavre/elasticsearch/tree/combo-analyzer-v0.16.4 (based on released ElasticSearch version 0.16.4).
EDIT: It is finally available as a plugin, thanks to jprante: https://github.com/yakaz/elasticsearch-analysis-combo.