Merging the terms from multiple sub-analyzers

Multi-field is great, but searching with multiple analyzers against only one field is simpler/better.
If you have a multi-lingual index, where each document has its source language, you can analyze the text fields using a special analyzer, based on the detected language (maybe even using the [`_analyzer.path`](http://www.elasticsearch.org/guide/reference/mapping/analyzer-field.html) functionality).
But what happens when you misdetected the language somehow, either at index- or at query-time? Some aggressive stemming can have devastating effects.

In such a scenario, having the original words indexed in parallel to the stemmed one would help. Be they in the same field would even letting phrase/slop queries work properly.
The only way to get multiple terms at the same position with ElasticSearch is through the [synonym token filter](http://www.elasticsearch.org/guide/reference/index-modules/analysis/synonym-tokenfilter.html), useless for stemming.

I've been working on a way to merge the terms that multiple analyzers output.
Say you want both to use a [simple analyzer](http://www.elasticsearch.org/guide/reference/index-modules/analysis/simple-analyzer.html), and any of the specialized [language-specific analyzer](http://www.elasticsearch.org/guide/reference/index-modules/analysis/lang-analyzer.html), or anything.
My plugin can make it as simple as the following index setting:

```
index:
  analysis:
    analyzer:
      # An analyzer using both the "simple" analyzer and the sophisticated "english" analyzer, combining the resulting terms
      combo_en:
        type: combo
        sub_analyzers: [simple, english]
```

Here is a simple example of what is does:

``` bash
# What the "simple" analyzer outputs
curl -XGET 'localhost:9200/yakaz/_analyze?pretty=1&analyzer=simple' -d 'An example'
{
  "tokens" : [ {
    "token" : "an",
    "start_offset" : 0,
    "end_offset" : 2,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "example",
    "start_offset" : 3,
    "end_offset" : 10,
    "type" : "word",
    "position" : 2
  } ]
}
# What the "english" analyzer outputs
curl -XGET 'localhost:9200/yakaz/_analyze?pretty=1&analyzer=english' -d 'An example'
{
  "tokens" : [ {
    "token" : "exampl",
    "start_offset" : 3,
    "end_offset" : 10,
    "type" : "<ALPHANUM>",
    "position" : 2
  } ]
}

# Now what our combined analyzer outputs
curl -XGET 'localhost:9200/yakaz/_analyze?pretty=1&analyzer=combo_en' -d 'An example'
{
  "tokens" : [ {
    "token" : "an",
    "start_offset" : 0,
    "end_offset" : 2,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "example",
    "start_offset" : 3,
    "end_offset" : 10,
    "type" : "word",
    "position" : 2
  }, {
    "token" : "exampl",
    "start_offset" : 3,
    "end_offset" : 10,
    "type" : "<ALPHANUM>",
    "position" : 2
  } ]
}
```

Terms are sorted by position, then by start/end offset, so that it's easier to use its output under reasonable assumptions of using a classical analyzer.

Here is the good news! You can find my implementation here: https://github.com/ofavre/elasticsearch/tree/combo-analyzer-v0.16.4 (based on released ElasticSearch version 0.16.4).

**EDIT:** It is finally available as a plugin, thanks to jprante: https://github.com/yakaz/elasticsearch-analysis-combo.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Merging the terms from multiple sub-analyzers #1128

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Merging the terms from multiple sub-analyzers #1128

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions